Skip to Content.
Sympa Menu

charm - RE: [charm] Incorrect T-Dimension Size Information

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

RE: [charm] Incorrect T-Dimension Size Information


Chronological Thread 
  • From: "Galvez Garcia, Juan Jose" <jjgalvez AT illinois.edu>
  • To: Chris Wailes <chris.wailes AT gmail.com>, charm <charm AT lists.cs.illinois.edu>
  • Subject: RE: [charm] Incorrect T-Dimension Size Information
  • Date: Mon, 26 Mar 2018 16:07:45 +0000
  • Accept-language: en-US

Chris,

I don't fully understand the scenario you are dealing with, like where the 128 comes from. If there are 2 geminis per 3D coordinate, shouldn't T be at max 64?

In any case, I can give you some pointers in how the T value is calculated for XE6 and how you can change it.
X, Y, Z coordinates in 3D torus are obtained via calls to the Cray rca library (code to get these values is in `src/util/topomanager/CrayNid.c`).

The T dimension is calculated in src/util/topomanager/XTTorus.h as `CmiNumCores() * CPU_FACTOR`
CmiNumCores is defined in `src/ck-core/cputopology.C` and uses sysconf calls to determine the number of cores per host. Not sure exactly which sysconf calls are determining the value of cores in your case, but you should be able to find out. Also, you can force your own values using FORCECPUCOUNT environment variable.

CPU_FACTOR is set to 2 for XE6 in XTTorus.h. I assume the 2 comes from the fact that 2 geminis make one node in the 3D topology.

-Juan


From: Chris Wailes [chris.wailes AT gmail.com]
Sent: Tuesday, March 20, 2018 9:37 AM
To: charm
Subject: [charm] Incorrect T-Dimension Size Information

I am attempting to use Charm on a Cray XE6 machine with 16-Core AMD Abu Dhabi chips. The way this machine is set up the job management system treats a single CPU as a node with 32 processing elements (16 physical cores / 32 logical cores).

I've been able to run programs from the test/ and examples/ directories using core counts from 1 to 128 (across 4 of the job manager's nodes).  Unfortunately the size of the T dimension as reported by the TopoManager is always 32, instead of the correct value of 128.

This seems to indicate that one of three things is happening:
  1. The part of Charm++ responsible for assigning jobs has the correct size of the T-Dimension that it uses, and there is simply a discrepancy between that value and the value reported from the TopoManager.

  2. The part of Charm++ responsible for assigning jobs also believes that the T-Dimension is only 32, and as a result work is only being allocated to the first 32 processing elements connected to the router.  Everything works fine, but only a quarter of the available resources are being used.

  3. Different parts of the Charm++ runtime have different ideas of what the T-Dimension size is.  Given a chance, the runtime might try and assign a Char to a PE with a T-coordinate >= 32 (assuming 0 indexing) causing a runtime error/exception but I have been lucky enough not to encounter this yet.

My questions then are: which of these three scenarios are occurring and how do I get the TopoManager to report the correct size for the T dimension?

- Chris




Archive powered by MHonArc 2.6.19.

Top of Page