Skip to Content.
Sympa Menu

charm - Re: [charm] [ppl] Cannot launch on stampede with = 16k processes

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [ppl] Cannot launch on stampede with = 16k processes


Chronological Thread 
  • From: Jim Phillips <jim AT ks.uiuc.edu>
  • To: Scott Field <sfield AT astro.cornell.edu>
  • Cc: charm AT cs.uiuc.edu
  • Subject: Re: [charm] [ppl] Cannot launch on stampede with = 16k processes
  • Date: Tue, 11 Nov 2014 11:59:41 -0600 (CST)
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>


Here is my (possibly useless) suggestion:


Take a look at /home1/00288/tg455591/NAMD_scripts/runbatch_latest

I use: ++scalable-start ++mpiexec ++remote-shell $SCRIPTDIR/mpiexec

where $SCRIPTDIR/mpiexec looks like:

#!/bin/csh -f

# drop -n <N>
shift
shift

exec /usr/local/bin/ibrun $*


It looks like you're doing about the same thing. I can't say I've every actually tried 16k processes, though. At some point we switch to smp.

You may want to try launching an MPI hello world with ibrun at that scale just to be sure it works at all. Good luck.

Jim


On Tue, 11 Nov 2014, Scott Field wrote:

Hi,

I am running large jobs (as part of a scaling test) on stampede. My
complete module list is "TACC-paths, cluster-paths, cluster, cmake/2.8.9,
mvapich2/1.9a2,
Linux, xalt/0.4.4, TACC, gcc/4.7.1, mkl/13.0.2.146"

and charm++ (the most recent version from git) has been built with

./build charm++ net-linux-x86_64 --with-production -j8

The scaling tests span 1 node (16 procs) up to 1024 nodes (16384 procs).
When I hit 256 nodes charmrun starts reporting problems. Typically I
execute 4 charmruns in a single sbatch submission. At 256 procs the first
one fails:

TACC: Starting up job 4421563
TACC: Setting up parallel environment for MVAPICH2+mpispawn.
TACC: Starting parallel tasks...
Charmrun> error 4466 attaching to node:
Timeout waiting for node-program to connect

while the next three succeed. At 16384 procs all 4 charmrun jobs fail with
the same error (although the error number is different). My "base" command
is

./charmrun ./Evolve1DScalarWave +p16384 ++mpiexec ++remote-shell
mympiexec

where Evolve1DScalarWave is the executable and mympiexec is

#!/bin/csh
shift; shift; exec ibrun $*

Finally, I've tried numerous possible combinations of the following command
line options

++scalable-start
++timeout XXX
++batch YYY

Where XXX is one of 60, 100, 1000 and YYY is one of 10, 64 and 128. None of
these worked. Using only scalable-start I get a slightly modified error
message


Charmrun> error 93523 attaching to node:
Error in accept.


For all three options enabled and a large timeout I get about 200,000 lines
from charmrun showing these same lines over and over (with the numbers
different):

Charmrun remote shell(127.0.0.1.0)> remote responding...
Charmrun remote shell(127.0.0.1.0)> starting node-program...
Charmrun remote shell(127.0.0.1.0)> rsh phase successful.
Charmrun> Waiting for 16305-th client to connect.
Charmrun> client 6763 connected (IP=129.114.77.58 data_port=42030)
Charmrun> adding client 12805: "127.0.0.1", IP:127.0.0.1

until finally the job fails with

[c401-004.stampede.tacc.utexas.edu:mpispawn_1][spawn_processes] Failed to
execvp() 'sfield': (2)

Sometimes this last message is seen immediately after "TACC: Starting
parallel tasks...".

I've been able to reproduce this problem with the jacobi2D charm++ example.

Any help or suggestions would be greatly appreciated!

Best,
Scott





Archive powered by MHonArc 2.6.16.

Top of Page