charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Introduction

From: Elliott Slaughter <slaughter AT cs.stanford.edu>
To: "Van Der Wijngaart, Rob F" <rob.f.van.der.wijngaart AT intel.com>
Cc: Sam White <white67 AT illinois.edu>, Phil Miller <mille121 AT illinois.edu>, "Kale, Laxmikant V" <kale AT illinois.edu>, "Chandrasekar, Kavitha" <kchndrs2 AT illinois.edu>, "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
Subject: Re: [charm] Introduction
Date: Fri, 20 Oct 2017 14:44:03 -0700
Authentication-results: illinois.edu; spf=softfail smtp.mailfrom=slaughter AT cs.stanford.edu

To follow up on my last email, here is a mystery I can't explain. With the PRK Stencil code and the configuration from my last email, Charm++ seems to get nearly 2x the performance of MPI on a single node, even with an overdecomposition factor of 1. I'm fairly certain that I've configured the two as closely as possible. Both use Intel 17.0.4, both use -O3, same grid size, same number of PEs, etc. The problem size is really quite generous so the impact of programming model in general should be very minimal, and nearly all of the time should be in the kernels. I'm attaching some sample outputs to this email in case you can spot any differences.

Do any of you know if there are any known differences between the MPI and Charm++ stencil codes? I noticed for example that the Charm++ version doesn't respond to the DOUBLE define, but it seems to hard-coded to double-precision so I don't think it should be an issue. Otherwise I'm having a hard time seeing what could cause such a large difference at this problem size. I've worked with the MPI versions of the PRK codes for some time so I'm fairly certain I'm not mis-configuring them.

Thanks!

On Fri, Oct 20, 2017 at 2:33 PM, Elliott Slaughter <slaughter AT cs.stanford.edu> wrote:

Thanks Rob for the introduction.

I mostly just wanted to sanity check my configuration to make sure I'm doing things the Right Way (tm).

I downloaded Charm++ 6.8.1 and built with the following command. This is on Piz Daint, a Cray XC40/50 system.

module load PrgEnv-intel # and unload any other PrgEnv-*
module load craype-hugepages8M
./build charm++ gni-crayxc smp --with-production -j8

I wasn't sure about the SMP part, but Rob had talked about Charm++ having a dedicated core for communication, and I think this is the setting I need to get that configuration.

I set CHARMTOP inside PRK's make.defs file, but otherwise left the settings the same as the other apps. (I.e. -O3 and so on.)

My run command looks like the following, where $n is the number of nodes and $d is the decomposition factor. The nodes have 12 physical cores per node, so this leaves 2 extra cores for whatever extra threads Charm++ wants to use. The stencil code is memory bound so I've found that even with MPI/OpenMP filling up all the cores isn't generally beneficial.

srun -n $n -N $n --ntasks-per-node 1 --cpu_bind none stencil +ppn 10 +setcpuaffinity 100 40000 $d

If anything about this configuration looks wrong, or if I'm missing any important settings (or there are settings where I should explore the performance impact of different options), please let me know.

On Fri, Oct 20, 2017 at 1:56 PM, Van Der Wijngaart, Rob F <rob.f.van.der.wijngaart AT intel.com> wrote:

Hello Team,

I wanted to introduce you to Elliott Slaughter, a freshly minted PhD in computer science from Stanford, and member of the Legion team. He had some questions for me about optimal choice of configuration, compiler, and runtime parameters when building Charm++ and executing Charm++ workloads, especially the Parallel Research Kernels. I gave some generic advice, but would like to ask you (or those of you who are still at UIUC) to help him optimize his execution environment. Thanks!

Rob

--
Elliott Slaughter

"Don't worry about what anybody else is going to do. The best way to predict the future is to invent it." - Alan Kay

Elliott Slaughter

"Don't worry about what anybody else is going to do. The best way to predict the future is to invent it." - Alan Kay

Parallel Research Kernels version 2.17
MPI stencil execution on 2D grid
Number of ranks        = 10
Grid size              = 40000
Radius of stencil      = 2
Tiles in x/y-direction = 2/5
Type of stencil        = star
Data type              = double precision
Compact representation of stencil loop body
Number of iterations   = 100
Solution validates
Rate (MFlops/s): 25690.953422  Avg time (s): 1.183059

Charm++> Running on Gemini (GNI) with 1 processes
Charm++> static SMSG
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> only comm thread send/recv messages
Charm++> Cray TLB page size: 8192K
Charm++> Running in SMP mode: numNodes 1,  10 worker threads per process
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: v6.8.1
CharmLB> Load balancer assumes all CPUs are same.
Charm++> cpu affinity enabled. 
Charm++> Running on 1 unique compute nodes (24-way SMP).
Parallel Research Kernels Version 2.17
Charm++ stencil execution on 2D grid
Number of Charm++ PEs   = 10
Overdecomposition       = 1
Grid size               = 40000
Radius of stencil       = 2
Chares in x/y-direction = 2/5
Type of stencil         = star
Compact representation of stencil loop body
Number of iterations    = 100
Solution validates
Rate (MFlops): 44370.295587 Avg time (s) 0.685006
[Partition 0][Node 0] End of program

[charm] Introduction, Van Der Wijngaart, Rob F, 10/20/2017
- Re: [charm] Introduction, Elliott Slaughter, 10/20/2017
  - Re: [charm] Introduction, Elliott Slaughter, 10/20/2017
    - RE: [charm] Introduction, Van Der Wijngaart, Rob F, 10/20/2017
      - Re: [charm] Introduction, Phil Miller, 10/20/2017
        
        Re: [charm] Introduction, Elliott Slaughter, 10/20/2017
        
        RE: [charm] Introduction, Van Der Wijngaart, Rob F, 10/20/2017