Skip to Content.
Sympa Menu

charm - [charm] Charm++ SMP runtime problem

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

[charm] Charm++ SMP runtime problem


Chronological Thread 
  • From: Marius Micluta <marius AT biochim.ro>
  • To: charm AT cs.uiuc.edu
  • Subject: [charm] Charm++ SMP runtime problem
  • Date: Mon, 1 Jun 2009 14:15:57 +0300 (EEST)
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>


Dear CHARM developers,

I get some erratic failures when launching NAMD with the net-linux-x86_64-smp-icc build of charmrun. I found the SMP version of charm++/NAMD to be 25-30% faster than the precompiled version on a small HPC cluster (4 compute nodes, each with 2 quad-core Xeon CPUs, Gigabit Ethernet interconnect) and much faster (2-3 times) than the MPI and TCP builds.

Started with the ++p 32 ++ppn 8 charmrun command line options, NAMD runs fine sometimes, but in many cases charmrun strangely freezes after displaying "cpu topology info is being gathered!". The namd2 processes are launched on all the four compute nodes, but while on two or three nodes the CPU load approaches 100% on all the 8 cores, as in a normal run, on the other(s) the CPU load is near zero.

I also tried to run charmdebug, but it fails. Launching from the command
line the command configured by charmdebug (charmrun +pN
/usr/local/NAMD/namd2 apoa1.namd ++server +cpd +DebugDisplay
localhost:10.0), I get a curious error at the same point where the run
freezes:

Charm++> synchronizing isomalloc memory region...
CPD: Frozen processor N+1
[0] consolidated Isomalloc memory region: 0x2aaaab59b000 - 0x7a351707da18
(83404474 megs)
Charm++> cpu topology info is being gathered!
CPD: Frozen processor N
CPD: Signal received on processor N: 11
CPD: Frozen processor N
------------- Processor N Exiting: Caught Signal ------------
Signal: segmentation violation,

no matter what value between 1 and 32 I choose for N. When N=32, I get many "CPD: Frozen processor" lines, well beyond the number of processors that do actually exist in the cluster. I also tried to specify the .nodelist file with the ++nodelist parameter, as well as other command line options, but with no effect.

I compiled charm++ (version 6.1, packaged with the NAMD-2.7b1 source) either with an older 10.1.011 Intel compiler and with the latest 11.0.083 version from Intel, as well as with various optimization options, from -O0 to -O3, but the lock still occurs sporadically.


Best regards,

Marius Micluta

--

Marius Micluta

Institute of Biochemistry of the Romanian Academy
Tel: +40 21 223 90 69
Fax: +40 21 223 90 68




Archive powered by MHonArc 2.6.16.

Top of Page