Skip to Content.
Sympa Menu

charm - RE: [charm] Introduction

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

RE: [charm] Introduction


Chronological Thread 
  • From: "Van Der Wijngaart, Rob F" <rob.f.van.der.wijngaart AT intel.com>
  • To: Elliott Slaughter <slaughter AT cs.stanford.edu>, Phil Miller <mille121 AT illinois.edu>
  • Cc: Sam White <white67 AT illinois.edu>, "Kale, Laxmikant V" <kale AT illinois.edu>, "Chandrasekar, Kavitha" <kchndrs2 AT illinois.edu>, "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
  • Subject: RE: [charm] Introduction
  • Date: Fri, 20 Oct 2017 22:21:18 +0000
  • Accept-language: en-US
  • Authentication-results: illinois.edu; spf=fail smtp.mailfrom=rob.f.van.der.wijngaart AT intel.com
  • Dlp-product: dlpe-windows
  • Dlp-reaction: no-action
  • Dlp-version: 11.0.0.116

You could look at the compiler optimization report. The code should vectorize in both cases, but perhaps there is a difference in alignment.

Memory layout unfortunately can make quite a difference with simple kernels like Stencil. To see if the difference between MPI and Charm++ is structural, you could try a few grid sizes that are close to 40K. With the transpose kernel I observed that what I thought was an important difference between MPI3 with shared memory and OpenMP proved to be a coincidence.

 

Rob

 

From: Elliott Slaughter [mailto:slaughter AT cs.stanford.edu]
Sent: Friday, October 20, 2017 3:12 PM
To: Phil Miller <mille121 AT illinois.edu>
Cc: Van Der Wijngaart, Rob F <rob.f.van.der.wijngaart AT intel.com>; Sam White <white67 AT illinois.edu>; Kale, Laxmikant V <kale AT illinois.edu>; Chandrasekar, Kavitha <kchndrs2 AT illinois.edu>; charm AT cs.uiuc.edu
Subject: Re: Introduction

 

With a single rank Charm++ is about 10% better, which is smaller, but at this problem size they should really be identical.

 

MPI:
Rate (MFlops/s): 6052.196375  Avg time (s): 5.021965

Charm++:
Rate (MFlops): 6892.846640 Avg time (s) 4.409487

(Full logs attached.)

 

On Fri, Oct 20, 2017 at 3:02 PM, Phil Miller <mille121 AT illinois.edu> wrote:

On Fri, Oct 20, 2017 at 4:48 PM, Van Der Wijngaart, Rob F <rob.f.van.der.wijngaart AT intel.com> wrote:

As a first sanity check, could you run with just a single rank/chare and see if you observe any differences? There shouldn’t be any.

 

This is indeed a critical sanity check. Earlier in the lifetime of the PRK project, we saw a 2x disadvantage for the Charm++ version of Stencil relative to all of the others. This turned out to be a difference in how the compiler vectorized the code, where its aliasing analysis was tripped up by pointers coming from member variables in the Charm++ version, and so not optimizing as it did elsewhere.

 

 

 

 

Rob

 

From: Elliott Slaughter [mailto:slaughter AT cs.stanford.edu]
Sent: Friday, October 20, 2017 2:44 PM
To: Van Der Wijngaart, Rob F <rob.f.van.der.wijngaart AT intel.com>
Cc: Sam White <white67 AT illinois.edu>; Phil Miller <mille121 AT illinois.edu>; Kale, Laxmikant V <kale AT illinois.edu>; Chandrasekar, Kavitha <kchndrs2 AT illinois.edu>; charm AT cs.uiuc.edu
Subject: Re: Introduction

 

To follow up on my last email, here is a mystery I can't explain. With the PRK Stencil code and the configuration from my last email, Charm++ seems to get nearly 2x the performance of MPI on a single node, even with an overdecomposition factor of 1. I'm fairly certain that I've configured the two as closely as possible. Both use Intel 17.0.4, both use -O3, same grid size, same number of PEs, etc. The problem size is really quite generous so the impact of programming model in general should be very minimal, and nearly all of the time should be in the kernels. I'm attaching some sample outputs to this email in case you can spot any differences.

Do any of you know if there are any known differences between the MPI and Charm++ stencil codes? I noticed for example that the Charm++ version doesn't respond to the DOUBLE define, but it seems to hard-coded to double-precision so I don't think it should be an issue. Otherwise I'm having a hard time seeing what could cause such a large difference at this problem size. I've worked with the MPI versions of the PRK codes for some time so I'm fairly certain I'm not mis-configuring them.

Thanks!

 

On Fri, Oct 20, 2017 at 2:33 PM, Elliott Slaughter <slaughter AT cs.stanford.edu> wrote:

Thanks Rob for the introduction.

I mostly just wanted to sanity check my configuration to make sure I'm doing things the Right Way (tm).

 

I downloaded Charm++ 6.8.1 and built with the following command. This is on Piz Daint, a Cray XC40/50 system.

 

module load PrgEnv-intel # and unload any other PrgEnv-*
module load craype-hugepages8M
./build charm++ gni-crayxc smp --with-production -j8

 

I wasn't sure about the SMP part, but Rob had talked about Charm++ having a dedicated core for communication, and I think this is the setting I need to get that configuration.

 

I set CHARMTOP inside PRK's make.defs file, but otherwise left the settings the same as the other apps. (I.e. -O3 and so on.)

My run command looks like the following, where $n is the number of nodes and $d is the decomposition factor. The nodes have 12 physical cores per node, so this leaves 2 extra cores for whatever extra threads Charm++ wants to use. The stencil code is memory bound so I've found that even with MPI/OpenMP filling up all the cores isn't generally beneficial.

srun -n $n -N $n --ntasks-per-node 1 --cpu_bind none stencil +ppn 10 +setcpuaffinity 100 40000 $d

If anything about this configuration looks wrong, or if I'm missing any important settings (or there are settings where I should explore the performance impact of different options), please let me know.

 

On Fri, Oct 20, 2017 at 1:56 PM, Van Der Wijngaart, Rob F <rob.f.van.der.wijngaart AT intel.com> wrote:

Hello Team,

 

I wanted to introduce you to Elliott Slaughter, a freshly minted PhD in computer science from Stanford, and member of the Legion team. He had some questions for me about optimal choice of configuration, compiler, and runtime parameters when building Charm++ and executing Charm++ workloads, especially the Parallel Research Kernels. I gave some generic advice, but would like to ask you (or those of you who are still at UIUC) to help him optimize his execution environment. Thanks!

 

Rob



--

Elliott Slaughter

"Don't worry about what anybody else is going to do. The best way to predict the future is to invent it." - Alan Kay




--

Elliott Slaughter

"Don't worry about what anybody else is going to do. The best way to predict the future is to invent it." - Alan Kay

 




--

Elliott Slaughter

"Don't worry about what anybody else is going to do. The best way to predict the future is to invent it." - Alan Kay




Archive powered by MHonArc 2.6.19.

Top of Page