Skip to Content.
Sympa Menu

charm - Re: [charm] [ppl] Cannot launch on stampede with = 16k processes

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [ppl] Cannot launch on stampede with = 16k processes


Chronological Thread 
  • From: Bilge Acun <acun2 AT illinois.edu>
  • To: Jim Phillips <jim AT ks.uiuc.edu>
  • Cc: Scott Field <sfield AT astro.cornell.edu>, "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
  • Subject: Re: [charm] [ppl] Cannot launch on stampede with = 16k processes
  • Date: Tue, 11 Nov 2014 12:51:34 -0600
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

You should use Infiniband build of Charm++ on Stampede instead of the regular net build. Your build command should look like:

>> ./build charm++ verbs-linux-x86_64 --with-production -j8

On 11 November 2014 11:59, Jim Phillips <jim AT ks.uiuc.edu> wrote:

Here is my (possibly useless) suggestion:


Take a look at /home1/00288/tg455591/NAMD_scripts/runbatch_latest

I use:   ++scalable-start ++mpiexec ++remote-shell $SCRIPTDIR/mpiexec

where $SCRIPTDIR/mpiexec looks like:

#!/bin/csh -f

# drop -n <N>
shift
shift

exec /usr/local/bin/ibrun $*


It looks like you're doing about the same thing.  I can't say I've every
actually tried 16k processes, though.  At some point we switch to smp.

You may want to try launching an MPI hello world with ibrun at that scale
just to be sure it works at all.  Good luck.

Jim


On Tue, 11 Nov 2014, Scott Field wrote:

> Hi,
>
> I am running large jobs (as part of a scaling test) on stampede. My
> complete module list is "TACC-paths, cluster-paths, cluster, cmake/2.8.9,
> mvapich2/1.9a2,
> Linux, xalt/0.4.4, TACC, gcc/4.7.1, mkl/13.0.2.146"
>
> and charm++ (the most recent version from git) has been built with
>
>>>> ./build charm++ net-linux-x86_64 --with-production -j8
>
> The scaling tests span 1 node (16 procs) up to 1024 nodes (16384 procs).
> When I hit 256 nodes charmrun starts reporting problems. Typically I
> execute 4 charmruns in a single sbatch submission. At 256 procs the first
> one fails:
>
> TACC: Starting up job 4421563
> TACC: Setting up parallel environment for MVAPICH2+mpispawn.
> TACC: Starting parallel tasks...
> Charmrun> error 4466 attaching to node:
> Timeout waiting for node-program to connect
>
> while the next three succeed. At 16384 procs all 4 charmrun jobs fail with
> the same error (although the error number is different). My "base" command
> is
>
>>>> ./charmrun ./Evolve1DScalarWave +p16384 ++mpiexec ++remote-shell
> mympiexec
>
> where Evolve1DScalarWave is the executable and mympiexec is
>
> #!/bin/csh
> shift; shift; exec ibrun $*
>
> Finally, I've tried numerous possible combinations of the following command
> line options
>
> ++scalable-start
> ++timeout XXX
> ++batch YYY
>
> Where XXX is one of 60, 100, 1000 and YYY is one of 10, 64 and 128. None of
> these worked. Using only scalable-start I get a slightly modified error
> message
>
>
> Charmrun> error 93523 attaching to node:
> Error in accept.
>
>
> For all three options enabled and a large timeout I get about 200,000 lines
> from charmrun showing these same lines over and over (with the numbers
> different):
>
> Charmrun remote shell(127.0.0.1.0)> remote responding...
> Charmrun remote shell(127.0.0.1.0)> starting node-program...
> Charmrun remote shell(127.0.0.1.0)> rsh phase successful.
> Charmrun> Waiting for 16305-th client to connect.
> Charmrun> client 6763 connected (IP=129.114.77.58 data_port=42030)
> Charmrun> adding client 12805: "127.0.0.1", IP:127.0.0.1
>
> until finally the job fails with
>
> [c401-004.stampede.tacc.utexas.edu:mpispawn_1][spawn_processes] Failed to
> execvp() 'sfield':  (2)
>
> Sometimes this last message is seen immediately after "TACC: Starting
> parallel tasks...".
>
> I've been able to reproduce this problem with the jacobi2D charm++ example.
>
> Any help or suggestions would be greatly appreciated!
>
> Best,
> Scott
>
_______________________________________________
charm mailing list
charm AT cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/charm



--
Bilge Acun
PhD Candidate at University of Illinois at Urbana-Champaign
Computer Science Department



Archive powered by MHonArc 2.6.16.

Top of Page