Skip to Content.
Sympa Menu

charm - [charm] Cannot launch on stampede with = 16k processes

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

[charm] Cannot launch on stampede with = 16k processes


Chronological Thread 
  • From: Scott Field <sfield AT astro.cornell.edu>
  • To: charm AT cs.uiuc.edu
  • Subject: [charm] Cannot launch on stampede with = 16k processes
  • Date: Tue, 11 Nov 2014 12:19:31 -0500
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Hi,

 I am running large jobs (as part of a scaling test) on stampede. My complete module list is "TACC-paths, cluster-paths, cluster, cmake/2.8.9, mvapich2/1.9a2,
Linux, xalt/0.4.4, TACC, gcc/4.7.1, mkl/13.0.2.146"

and charm++ (the most recent version from git) has been built with

>>> ./build charm++ net-linux-x86_64 --with-production -j8

The scaling tests span 1 node (16 procs) up to 1024 nodes (16384 procs). When I hit 256 nodes charmrun starts reporting problems. Typically I execute 4 charmruns in a single sbatch submission. At 256 procs the first one fails:

TACC: Starting up job 4421563
TACC: Setting up parallel environment for MVAPICH2+mpispawn.
TACC: Starting parallel tasks...
Charmrun> error 4466 attaching to node:
Timeout waiting for node-program to connect

while the next three succeed. At 16384 procs all 4 charmrun jobs fail with the same error (although the error number is different). My "base" command is

>>> ./charmrun ./Evolve1DScalarWave +p16384 ++mpiexec ++remote-shell mympiexec

where Evolve1DScalarWave is the executable and mympiexec is

#!/bin/csh
shift; shift; exec ibrun $*

Finally, I've tried numerous possible combinations of the following command line options

++scalable-start
++timeout XXX
++batch YYY

Where XXX is one of 60, 100, 1000 and YYY is one of 10, 64 and 128. None of these worked. Using only scalable-start I get a slightly modified error message


Charmrun> error 93523 attaching to node:
Error in accept.


For all three options enabled and a large timeout I get about 200,000 lines from charmrun showing these same lines over and over (with the numbers different):

Charmrun remote shell(127.0.0.1.0)> remote responding...
Charmrun remote shell(127.0.0.1.0)> starting node-program...
Charmrun remote shell(127.0.0.1.0)> rsh phase successful.
Charmrun> Waiting for 16305-th client to connect.
Charmrun> client 6763 connected (IP=129.114.77.58 data_port=42030)
Charmrun> adding client 12805: "127.0.0.1", IP:127.0.0.1

until finally the job fails with

[c401-004.stampede.tacc.utexas.edu:mpispawn_1][spawn_processes] Failed to execvp() 'sfield':  (2)

Sometimes this last message is seen immediately after "TACC: Starting parallel tasks...".

I've been able to reproduce this problem with the jacobi2D charm++ example.

Any help or suggestions would be greatly appreciated!

Best,
Scott



Archive powered by MHonArc 2.6.16.

Top of Page