charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

RE: [charm] Introduction

From: "Van Der Wijngaart, Rob F" <rob.f.van.der.wijngaart AT intel.com>
To: Elliott Slaughter <slaughter AT cs.stanford.edu>
Cc: Sam White <white67 AT illinois.edu>, Phil Miller <mille121 AT illinois.edu>, "Kale, Laxmikant V" <kale AT illinois.edu>, "Chandrasekar, Kavitha" <kchndrs2 AT illinois.edu>, "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
Subject: RE: [charm] Introduction
Date: Fri, 20 Oct 2017 21:48:15 +0000
Accept-language: en-US
Authentication-results: illinois.edu; spf=fail smtp.mailfrom=rob.f.van.der.wijngaart AT intel.com
Dlp-product: dlpe-windows
Dlp-reaction: no-action
Dlp-version: 11.0.0.116

Hi Elliott,

As a first sanity check, could you run with just a single rank/chare and see if you observe any differences? There shouldn’t be any.

Rob

From: Elliott Slaughter [mailto:slaughter AT cs.stanford.edu]
Sent: Friday, October 20, 2017 2:44 PM
To: Van Der Wijngaart, Rob F <rob.f.van.der.wijngaart AT intel.com>
Cc: Sam White <white67 AT illinois.edu>; Phil Miller <mille121 AT illinois.edu>; Kale, Laxmikant V <kale AT illinois.edu>; Chandrasekar, Kavitha <kchndrs2 AT illinois.edu>; charm AT cs.uiuc.edu
Subject: Re: Introduction

To follow up on my last email, here is a mystery I can't explain. With the PRK Stencil code and the configuration from my last email, Charm++ seems to get nearly 2x the performance of MPI on a single node, even with an overdecomposition factor of 1. I'm fairly certain that I've configured the two as closely as possible. Both use Intel 17.0.4, both use -O3, same grid size, same number of PEs, etc. The problem size is really quite generous so the impact of programming model in general should be very minimal, and nearly all of the time should be in the kernels. I'm attaching some sample outputs to this email in case you can spot any differences.

Do any of you know if there are any known differences between the MPI and Charm++ stencil codes? I noticed for example that the Charm++ version doesn't respond to the DOUBLE define, but it seems to hard-coded to double-precision so I don't think it should be an issue. Otherwise I'm having a hard time seeing what could cause such a large difference at this problem size. I've worked with the MPI versions of the PRK codes for some time so I'm fairly certain I'm not mis-configuring them.

Thanks!

On Fri, Oct 20, 2017 at 2:33 PM, Elliott Slaughter <slaughter AT cs.stanford.edu> wrote:

Thanks Rob for the introduction.

I mostly just wanted to sanity check my configuration to make sure I'm doing things the Right Way (tm).

I downloaded Charm++ 6.8.1 and built with the following command. This is on Piz Daint, a Cray XC40/50 system.

module load PrgEnv-intel # and unload any other PrgEnv-*
module load craype-hugepages8M
./build charm++ gni-crayxc smp --with-production -j8

I wasn't sure about the SMP part, but Rob had talked about Charm++ having a dedicated core for communication, and I think this is the setting I need to get that configuration.

I set CHARMTOP inside PRK's make.defs file, but otherwise left the settings the same as the other apps. (I.e. -O3 and so on.)

My run command looks like the following, where $n is the number of nodes and $d is the decomposition factor. The nodes have 12 physical cores per node, so this leaves 2 extra cores for whatever extra threads Charm++ wants to use. The stencil code is memory bound so I've found that even with MPI/OpenMP filling up all the cores isn't generally beneficial.

srun -n $n -N $n --ntasks-per-node 1 --cpu_bind none stencil +ppn 10 +setcpuaffinity 100 40000 $d

If anything about this configuration looks wrong, or if I'm missing any important settings (or there are settings where I should explore the performance impact of different options), please let me know.

On Fri, Oct 20, 2017 at 1:56 PM, Van Der Wijngaart, Rob F <rob.f.van.der.wijngaart AT intel.com> wrote:

Hello Team,

I wanted to introduce you to Elliott Slaughter, a freshly minted PhD in computer science from Stanford, and member of the Legion team. He had some questions for me about optimal choice of configuration, compiler, and runtime parameters when building Charm++ and executing Charm++ workloads, especially the Parallel Research Kernels. I gave some generic advice, but would like to ask you (or those of you who are still at UIUC) to help him optimize his execution environment. Thanks!

Rob

--

Elliott Slaughter

"Don't worry about what anybody else is going to do. The best way to predict the future is to invent it." - Alan Kay

Elliott Slaughter

"Don't worry about what anybody else is going to do. The best way to predict the future is to invent it." - Alan Kay

[charm] Introduction, Van Der Wijngaart, Rob F, 10/20/2017
- Re: [charm] Introduction, Elliott Slaughter, 10/20/2017
  - Re: [charm] Introduction, Elliott Slaughter, 10/20/2017
    - RE: [charm] Introduction, Van Der Wijngaart, Rob F, 10/20/2017
      - Re: [charm] Introduction, Phil Miller, 10/20/2017
        
        Re: [charm] Introduction, Elliott Slaughter, 10/20/2017
        
        RE: [charm] Introduction, Van Der Wijngaart, Rob F, 10/20/2017