Skip to Content.
Sympa Menu

charm - Re: [charm] entry method overhead on Blue Waters

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] entry method overhead on Blue Waters


Chronological Thread 
  • From: James Bordner <jobordner AT gmail.com>
  • To: "Galvez Garcia, Juan Jose" <jjgalvez AT illinois.edu>
  • Cc: Michael Norman <mlnorman AT sdsc.edu>, "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
  • Subject: Re: [charm] entry method overhead on Blue Waters
  • Date: Tue, 25 Oct 2016 18:23:06 -0700

Hi Juan,

The scaling problem seems not to be related to communication, but to some hidden overhead when scheduling entry methods.  I've attached a Projections plot of time vs messages received that indicates that the rate at which entry methods are invoked is what's limiting performance.   The cyan, blue, and magenta entry methods involve different numbers of chare array elements and different message sizes, yet their rate is bounded by a constant, so this limit must be independent of number of chare elements and of message sizes.  

Being limited by entry method call rate is not itself a problem.  The problem is that the rate depends on the number of pe's--there seems to be a roughly O(P) term involved in each entry method call.  Why would this be?  Is there a way around it?  The constant is small, and I can improve my code's scalability by reducing the number of entry method calls, but that will help only up to a point.

Thanks,
James

On Thu, Oct 13, 2016 at 6:48 PM, Galvez Garcia, Juan Jose <jjgalvez AT illinois.edu> wrote:
Hi James,

Not sure yet if contention could be the cause. It looks like you're using a custom map. So, just to confirm, all blocks within a region are mapped to the same PE? For neighboring blocks not in the same PE (I assume neighboring regions?), it's not clear to me what the mapping would do. Is it possible that their PE number might differ by a significant amount (like PE 0 and PE 4096), and therefore the communication might have to traverse multiple hops, and how frequently would this happen?

Also, how many chares per PE do you have in your current setup?

I'm forwarding this to the list since I forgot to do it before and others might be able to provide feedback.

-Juan



From: James Bordner [jobordner AT gmail.com]
Sent: Thursday, October 13, 2016 7:49 PM
To: Galvez Garcia, Juan Jose
Cc: Michael Norman
Subject: Re: [charm] entry method overhead on Blue Waters

Hi Juan,

The array mapping I'm using subdivides the 3D domain into nx * ny * nz regular regions, with all blocks within in a region (ix,iy,iz) mapped to PE ix + nx*(iy + ny*iz).  Each subregion has essentially the same work and communication (it's solving a Sedov blast-like hydrodynamics problem with adaptive mesh refinement in each region).  It's refined mostly in the region centers, so most neighboring blocks will be assigned to the same PE.  I can certainly try different mappings if you think that could improve scaling.

Attached is a timeline screenshot from a 32K fp core run.  Here each block is calling an entry method on each of its neighbors (including edges and corners).   Each entry method is about 20μs, but for this run the minimum time between individual entry method calls is about 1.0ms.  Would network contention manifest itself like this?   I've been able to get better performance by using a better mapping of PE's to cores within a node, but the issue remains.  

I'll also attach a scaling plot (note it's log-log).  The two non-scaling phases "adapt sync" and "refresh sync" measure time in the "adapt" and "refresh" phases but not in my application (that is, time in Charm++ / Converse, or other lower software layers during these phases).  Both phases involve neighbor-neighbor entry method calls, and the projection views look like the attached timeline.   Given the regularity of the projections view of these phases, I expect the curves closely match the growth of the inter-entry method time with P. 

Thank you for your help!
James


On Thu, Oct 13, 2016 at 12:17 PM, Galvez Garcia, Juan Jose <jjgalvez AT illinois.edu> wrote:
Hi James,

One possibility might be network contention. Depending on the volume of communication and how the chares are placed on processors, if communicating chares are not in the same physical node (or nearby nodes), the network can become congested which hurts scaling for high node counts. Blue Waters can be particularly sensitive about this.

Are you using the default chare to PE mapping scheme in Charm++? Assuming this is the problem, a different mapping might give better results. Some info about array mapping is available here:
http://charmplusplus.org/tutorial/ArrayMapping.html

Juan

From: James Bordner [jobordner AT gmail.com]
Sent: Thursday, October 13, 2016 1:27 PM
To: charm
Subject: [charm] entry method overhead on Blue Waters

I've run into a scaling issue on Blue Waters.  In my chare array, chares are associated with nodes in an octree, and do periodic synchronizations with adjacent neighbors.  The problem is, as discovered using projections, the time in between entry method calls on neighbors increases as core counts increase.  This overhead seems to be very regular, and does not seem to depend on the total number of chares in the chare array, just the core count.   Although I've run on up to 110K FP cores, the scaling curve starts to drop noticeably at about 32K FP cores.

I compiled Charm 6.7.1 with gni-crayxe and--with-production (non-SMP mode). I ran similar tests a couple years ago with Charm++ 6.5.1 in SMP mode and didn't see this problem, though I only went to 32K FP cores at the time so it might not have been noticeable yet.

Any ideas what might be causing this scaling problem or how to get around it?

Thanks!
James


Attachment: N32-msgs-recv-time.png
Description: PNG image




Archive powered by MHonArc 2.6.19.

Top of Page