Skip to Content.
Sympa Menu

charm - Re: [charm] smp-enabled applications run slow on Blue Waters

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] smp-enabled applications run slow on Blue Waters


Chronological Thread 
  • From: Scott Field <sfield AT astro.cornell.edu>
  • To: "Galvez Garcia, Juan Jose" <jjgalvez AT illinois.edu>
  • Cc: Phil Miller <unmobile AT gmail.com>, Nikhil Jain <nikhil.jain AT acm.org>, Charm Mailing List <charm AT cs.illinois.edu>
  • Subject: Re: [charm] smp-enabled applications run slow on Blue Waters
  • Date: Tue, 29 Mar 2016 12:12:22 -0400

Hi,

  Thanks for the suggestion. With showcpuaffinity and sched_getcpu() I have checked that the threads are correctly assigned. I also wrote a small test code, but no timing differences were observed between smp and non-smp jobs. Writing a more faithful benchmark might be difficult, but we are happy to provide access to our code and any additional assistance in tracking down the issue. 

Best,
Scott

On Thu, Mar 24, 2016 at 7:14 PM, Galvez Garcia, Juan Jose <jjgalvez AT illinois.edu> wrote:
What I worked on was to warn the user when multiple threads are assigned to the same physical core. But it's up to the user to use the correct parameters in aprun and charm++ (pemap and commap) to avoid this. Scott is already providing cpu affinity options so I'm not sure that's the problem here. Scott, you might want to try running charm with +showcpuaffinity to print the cpu affinity of the threads and verify that each one is going to a different core.

Thanks,
Juan

From: Phil Miller [unmobile AT gmail.com]
Sent: Thursday, March 24, 2016 4:20 PM
To: Nikhil Jain; Galvez Garcia, Juan Jose
Cc: Charm Mailing List; Scott Field
Subject: Re: [charm] smp-enabled applications run slow on Blue Waters

There's an open bug in redmine very similar to this from changa. I think Juan was addressing it.

On Mar 24, 2016 2:49 PM, "Nikhil Jain" <nikhil.jain AT acm.org> wrote:
Hello Scott,

This is indeed strange and unexpected. Everything you are doing seems the way it should be done, yet the results are unexpected. Is this code a small benchmark or something we can get access to and explore?

Thanks
Nikhil


March 21, 2016 at 18:17
Hi,

   On blue waters using an smp-enabled charm++ build, my application typically runs 2 to 5 times slower (as judged by the cpu time [computed as walltime x workers]) than the corresponding non-smp build. For example:


I'm at a loss to explain (or fix!) this, but I have a few pieces of evidence to share:

1) On other machines, and using a verbs-* build of charm, I see negligible difference in cpu time between smp and non-smp versions.
 
2) According to Projections, my user defined entry methods run up to 6 times slower compared to non-smp builds (averaged over time and cores)

3) Some cores are idle up to 50% of the time, compared the non-smp version where almost no cores are idle (although non-smp shows significantly higher overhead, which I cannot explain either). See:



4) The smp slowdown is robust to how I launch the executable, although there is a bit of variability. Some examples are given at the end of this email. 

An instructive comparison is a non-smp executable using 1 process per floating unit for 16 total (cpu time of 71 us per algorithm iteration) vs any of the other smp jobs where the threads are restricted to run on the floating point units. These CPU times are 143 us, 166 us, 174 us and 310 us. Comparing jobs using all 32 (logical) cores the times are 105 us vs 500 us.

Just to reiterate, this only appears to be an issue on blue waters. Has anyone else experiences this? Any suggestions? My charm++ builds are

./build charm++ gemini_gni-crayxe smp hugepages persistent --with-production -std=c++11

./build charm++ gemini_gni-crayxe hugepages persistent --with-production -std=c++11

Best,
Scott


# NON-SMP #
aprun -N 1 -n 1 ./pgm
 with CPU time / (points*step) (us) in us = 57.079748 

aprun -N 16 -n 16 -j 1 ./pgm
 with CPU time / (points*step) (us) in us = 71.344819 

aprun -N 32 -n 32 ./pgm
 with CPU time / (points*step) (us) in us = 105.911947 


# SMP #
aprun -N 1 -n 1 -d 31 ./pgm-smp +ppn 1 +pemap 2 +commap 0
 with CPU time / (points*step) (us) in us = 90.569283 

aprun -N 1 -n 1 -d 31 ./pgm-smp +ppn 7 +pemap 2,4,6,8,10,12,14 +commap 0
 with CPU time / (points*step) (us) in us = 166.620733 

aprun -N 1 -n 1 -d 31 ./pgm-smp +ppn 15 +pemap 2,4,6,8,10,12,14,16,18,20,22,24,26,28,30 +commap 0
 with CPU time / (points*step) (us) in us = 310.771909 

aprun -N 1 -n 1 -d 31 ./pgm-smp +ppn 30 +pemap 2-31 +commap 0
 with CPU time / (points*step) (us) in us = 529.651506 

aprun -N 2 -n 2 -d 16 ./pgm-smp +ppn 7 +pemap 2,4,6,8,10,12,14,18,20,22,24,26,28,30 +commap 0,16
 with CPU time / (points*step) (us) in us = 174.455473 

aprun -N 4 -n 4 -d 8 ./pgm-smp +ppn 3 +pemap 2,4,6,10,12,14,18,20,22,26,28,30 +commap 0,8,16,24
 with CPU time / (points*step) (us) in us = 143.428617





Archive powered by MHonArc 2.6.16.

Top of Page