Skip to Content.
Sympa Menu

charm - Re: [charm] Incorrect T-Dimension Size Information

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Incorrect T-Dimension Size Information


Chronological Thread 
  • From: Chris Wailes <chris.wailes AT gmail.com>
  • To: "Galvez Garcia, Juan Jose" <jjgalvez AT illinois.edu>
  • Cc: charm <charm AT lists.cs.illinois.edu>
  • Subject: Re: [charm] Incorrect T-Dimension Size Information
  • Date: Fri, 30 Mar 2018 17:03:01 -0400
  • Authentication-results: illinois.edu; spf=pass smtp.mailfrom=chris.wailes AT gmail.com; dkim=pass header.d=gmail.com header.s=20161025; dmarc=pass header.from=gmail.com

Juan,

Sorry for the delay but I had to run a test program to verify the answer to your question: yes, Charm++ reports the correct number of PEs at the start of the application.

- Chris

On Mon, Mar 26, 2018 at 12:27 PM, Galvez Garcia, Juan Jose <jjgalvez AT illinois.edu> wrote:

Yeah, sorry, I mixed up the terms there. That is what I meant (one router, two nodes).

Seems like in your case CmiNumCores should return 64, but it is actually returning 16. The value comes from sysconf calls. You could try forcing number of cpu cores with FORCECPUCOUNT.

Btw, does the number of PEs and nodes reported by Charm++ (in the output at the start of the application) match what you expect?



From: Chris Wailes [chris.wailes AT gmail.com]
Sent: Monday, March 26, 2018 11:17 AM
To: Galvez Garcia, Juan Jose
Cc: charm
Subject: Re: [charm] Incorrect T-Dimension Size Information

Juan,

Thanks for the information.

Each coordinate of a XE6 has a single router.  That router is connected to two nodes; each node has two sockets for a total of 4 CPUs.  In this machine, those CPUs are 16-core Abu Dhabi chips with SMT, for a total of 32 logical cores, giving each coordinate a total of 128 cores.

It sounds like I'll need to adjust some values/code in the files you pointed to.

- Chris

On Mon, Mar 26, 2018 at 12:07 PM, Galvez Garcia, Juan Jose <jjgalvez AT illinois.edu> wrote:
Chris,

I don't fully understand the scenario you are dealing with, like where the 128 comes from. If there are 2 geminis per 3D coordinate, shouldn't T be at max 64?

In any case, I can give you some pointers in how the T value is calculated for XE6 and how you can change it.
X, Y, Z coordinates in 3D torus are obtained via calls to the Cray rca library (code to get these values is in `src/util/topomanager/CrayNid.c`).

The T dimension is calculated in src/util/topomanager/XTTorus.h as `CmiNumCores() * CPU_FACTOR`
CmiNumCores is defined in `src/ck-core/cputopology.C` and uses sysconf calls to determine the number of cores per host. Not sure exactly which sysconf calls are determining the value of cores in your case, but you should be able to find out. Also, you can force your own values using FORCECPUCOUNT environment variable.

CPU_FACTOR is set to 2 for XE6 in XTTorus.h. I assume the 2 comes from the fact that 2 geminis make one node in the 3D topology.

-Juan


From: Chris Wailes [chris.wailes AT gmail.com]
Sent: Tuesday, March 20, 2018 9:37 AM
To: charm
Subject: [charm] Incorrect T-Dimension Size Information

I am attempting to use Charm on a Cray XE6 machine with 16-Core AMD Abu Dhabi chips. The way this machine is set up the job management system treats a single CPU as a node with 32 processing elements (16 physical cores / 32 logical cores).

I've been able to run programs from the test/ and examples/ directories using core counts from 1 to 128 (across 4 of the job manager's nodes).  Unfortunately the size of the T dimension as reported by the TopoManager is always 32, instead of the correct value of 128.

This seems to indicate that one of three things is happening:
  1. The part of Charm++ responsible for assigning jobs has the correct size of the T-Dimension that it uses, and there is simply a discrepancy between that value and the value reported from the TopoManager.

  2. The part of Charm++ responsible for assigning jobs also believes that the T-Dimension is only 32, and as a result work is only being allocated to the first 32 processing elements connected to the router.  Everything works fine, but only a quarter of the available resources are being used.

  3. Different parts of the Charm++ runtime have different ideas of what the T-Dimension size is.  Given a chance, the runtime might try and assign a Char to a PE with a T-coordinate >= 32 (assuming 0 indexing) causing a runtime error/exception but I have been lucky enough not to encounter this yet.

My questions then are: which of these three scenarios are occurring and how do I get the TopoManager to report the correct size for the T dimension?

- Chris






Archive powered by MHonArc 2.6.19.

Top of Page