Skip to Content.
Sympa Menu

charm - Re: [charm] TopoManager / Cray XE6 Question

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] TopoManager / Cray XE6 Question


Chronological Thread 
  • From: Chris Wailes <chris.wailes AT gmail.com>
  • To: Nikhil Jain <nikhil.jain AT acm.org>
  • Cc: "Galvez Garcia, Juan Jose" <jjgalvez AT illinois.edu>, charm <charm AT lists.cs.illinois.edu>
  • Subject: Re: [charm] TopoManager / Cray XE6 Question
  • Date: Fri, 16 Feb 2018 11:10:03 -0500

Nikhil,

The issue I ran into was due to the fact that I was trying to run under Cluster Compatibility Mode, which doesn't have `aprun`.  When I try and use `ccmrun` the Charm program complained about not being able to contact some service.

This was all an attempt to get around the fact that in ESM our system views a `node` as a single CPU.  This means that the ppn argument to qsub can't be greater than 32 (the number of cores per CPU) or it will never be able to schedule the job.  As a result, the T dimension is always truncated to size 32 instead of the proper size 128.

Is there an argument I can provide to the Charm++ app directly that would allow me to override this value?

- Chris

On Thu, Feb 15, 2018 at 5:42 PM, Nikhil Jain <nikhil.jain AT acm.org> wrote:
You don’t need to use charmrun. You can directly launch using standard MOAB scripts like MPI codes.

The launch command in the MOAB script would look something like this:

aprun -n <number of desired process> ./pgm +ppn 32

Number of desired process = number of charm logical node = number of physical nodes * number of Charm process per node. Each charm process will launch +ppn number of threads and hence use those many cores. It is recommended to use +pemap and +commap to specify affinity of threads to cores.


aprun may need to be replaced by srun or mpirun, depending on what the cluster uses. Hope this helps.

---
Nikhil Jain
Postdoctoral Fellow, Lawrence Livermore National Laboratory
nikhil.jain AT acm.org, http://nikhil-jain.github.io/

On Feb 15, 2018, at 11:09, Chris Wailes <chris.wailes AT gmail.com> wrote:

The `charmrun` program looks like it calls the `aprun` command when building Charm++ for the gni-crayxe environment.  Is there a way to use it with Cluster Compatibility Mode?

- Chris

On Tue, Feb 13, 2018 at 5:34 PM, Chris Wailes <chris.wailes AT gmail.com> wrote:
It appears that our job manager (TORQUE / Moab) views a single CPU as a `node`.  As such, values for ppn over 32 don't seem to work.  Do you know how to get a job manager settup like this to work with Charm's T dimension?

- Chris

On Tue, Feb 13, 2018 at 12:31 PM, Chris Wailes <chris.wailes AT gmail.com> wrote:
Juan,

Thanks for your response.  That makes sense.  Do you know if there is a reliable numbering scheme for the T dimension?  From what I can gather from the Cray XE6 documentation only one processor per Cray-node is actually linked to the router, and being able to tell which processor requires an extra HyperTransport hop would really help with my performance modeling.

- Chris

On Tue, Feb 13, 2018 at 12:10 PM, Galvez Garcia, Juan Jose <jjgalvez AT illinois.edu> wrote:

Hi Chris,

 

For Cray XE6, TopoManager considers each Gemini router to be a physical node (so the 2 nodes connected to the same router have the same 3D coordinate). The size of the 4-th coordinate is the total number of processors in both nodes, and the T coordinate should uniquely identify a core inside a physical node. But I think the reported number for T dimension may depend on the ppn requested for the job, so if you request ppn=16, T=32, if ppn=32, T=64.

 

-Juan

 

From: Chris Wailes
Sent: Tuesday, February 13, 2018 10:53 AM
To: charm
Subject: [charm] TopoManager / Cray XE6 Question

 
This documentation states that the TopoManager method `getDimNT()`:


Returns the length of T dimension. TopoManager uses T dimension to represent different cores that reside within a physical node.

However, the Cray documentation for the XE6 series of machines states that each Gemini router is connected to two nodes, each of which contains two processors.

My question then is if the X, Y, and Z coordinates identify routers, are all of the cores on all CPUs for both nodes connected to a router represented using the T dimension or are there other, hidden, dimensions that specify the Cray node and processors?

I ask because the XE6 machine I'm currently using (which has 16-core, 2-way SMP-enabled CPUs) is reporting X, Y, Z, and T dimensions of 11, 6, 8, and 32.  This would seem to indicate that a coordinate quadruple doesn't uniquely identify a core, but instead identifies four cores, one on each CPU attached to each node attached to a router.

Any clarification would be appreciated, and thanks for your work on Charm++.

- Chris








Archive powered by MHonArc 2.6.19.

Top of Page