charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [ppl] current results on LLNL cluster

From: "Bennion, Brian" <Bennion1 AT llnl.gov>
To: Jim Phillips <jim AT ks.uiuc.edu>
Cc: "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
Subject: Re: [charm] [ppl] current results on LLNL cluster
Date: Thu, 8 Sep 2011 10:54:20 -0700
Accept-language: en-US
Acceptlanguage: en-US
List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Hello,

It seems that the fix JIm suggested below gets me different results.
However, if I am reading things right the way charmrun is working may not be
compatible with our job scheduling methods.

for example, to run jobs on the sierra cluster on logs into one of several
login nodes. if one uses mpi aware jobs, you submit with msub and the job
contents are sent to a free set of nodes not accessible to the login nodes by
users until a job is successfully launched. then the user can rsh to the
nodes allocated and monitor the jobs.

There is also a small portion of the cluster that is accessible for testing
and one can send jobs to that partition via msub or directly via srun with a
-p pdebug option added.

I am using the srun command in my ibverbs testing.

When I look at the output below I see an IP address that is the login node
from which I launched the job from not one from the debug partition. The
loopback maybe blocked on the login nodes and also on the pdebug nodes (if
would ever get sent there to run which it isn't).

bennion1
281:~/g11Dir/NAMD_CVS-2011-08-29_Source/Linux-x86_64-ib-icc-sse3/tests>
../charmrun +p 12 ++verbose ++mpiexec ++remote-shell mympiexec
/g/g14/bennion1/g11Dir/NAMD_CVS-2011-08-29_Source/Linux-x86_64-ib-icc-sse3/namd2

/g/g14/bennion1/g11Dir/NAMD_CVS-2011-08-29_Source/Linux-x86_64-ib-icc-sse3/test/apoa1.namd
Charmrun> charmrun started...
Charmrun> adding client 0: "127.0.0.1", IP:127.0.0.1
Charmrun> adding client 1: "127.0.0.1", IP:127.0.0.1
Charmrun> adding client 2: "127.0.0.1", IP:127.0.0.1
Charmrun> adding client 3: "127.0.0.1", IP:127.0.0.1
Charmrun> adding client 4: "127.0.0.1", IP:127.0.0.1
Charmrun> adding client 5: "127.0.0.1", IP:127.0.0.1
Charmrun> adding client 6: "127.0.0.1", IP:127.0.0.1
Charmrun> adding client 7: "127.0.0.1", IP:127.0.0.1
Charmrun> adding client 8: "127.0.0.1", IP:127.0.0.1
Charmrun> adding client 9: "127.0.0.1", IP:127.0.0.1
Charmrun> adding client 10: "127.0.0.1", IP:127.0.0.1
Charmrun> adding client 11: "127.0.0.1", IP:127.0.0.1
Charmrun> Charmrun = 192.168.117.53, port = 56774
Charmrun> IBVERBS version of charmrun
Charmrun> Sending "$CmiMyNode 192.168.117.53 56774 21872 0" to client 0.
Charmrun> find the node program
"/g/g14/bennion1/g11Dir/NAMD_CVS-2011-08-29_Source/Linux-x86_64-ib-icc-sse3/namd2"
at
"/g/g11/petefred/NAMD_CVS-2011-08-29_Source/Linux-x86_64-ib-icc-sse3/tests"
for 0.
Charmrun> Starting mympiexec ./charmrun.21872
Charmrun> mpiexec started
Charmrun> node programs all started
Charmrun> Waiting for 0-th client to connect.
This is dollar star -n 12 ./charmrun.21872
srun: Job is in held state, pending scheduler release
srun: job 1000962 queued and waiting for resources
srun: job 1000962 has been allocated resources
Charmrun remote shell(127.0.0.1.0)> remote responding...
Charmrun remote shell(127.0.0.1.0)> starting node-program...
Charmrun> Waiting for 1-th client to connect.
Charmrun> error 1 attaching to node:
Timeout waiting for node-program to connect
bennion1
282:~/g11Dir/NAMD_CVS-2011-08-29_Source/Linux-x86_64-ib-icc-sse3/tests>
Charmrun remote shell(127.0.0.1.0)> rsh phase successful.

________________________________________
From: Jim Phillips
[jim AT ks.uiuc.edu]
Sent: Thursday, September 08, 2011 10:32 AM
To: Bennion, Brian
Cc:
charm AT cs.uiuc.edu
Subject: Re: [ppl] [charm] current results on LLNL cluster

Also, you can use a trick like this in your mympiexec script to work
around the /tmp problem:

#!/bin/csh

sed 's/\/tmp\//\/tmp\/bennion1\//g' $3 > $3.fixed

chmod +x $3.fixed

exec ibrun $3.fixed

-Jim

On Thu, 8 Sep 2011, Bennion, Brian wrote:

>
>
>
> Hello,
>
> Compilation of all the suggested versions of charm++ have been completed.
> They were:
> mpi-linux-x86_64
> mpi-linux-x86_64 smp
> net-linux-x86_64 ivberbs
> net-linux-x86_64 ibverbs smp
>
> It appears that the first charmm build is the best for namd2.8. I saw a
> 25% speedup on my regular jobs using the same processor counts. No extra
> tricks were employed in compiling charm++ or namd2.8.
>
> The ibverbs version did compile for charm++ and namd2.8 and was even able
> to start. however, sockets closed prematurely on some node and the whole
> things burns to the ground.
>
> Two quick thoughts here. First, looking at the charm.XXXXX files that are
> created, they look for some /tmp/charm.XXXX directories that are not going
> to be available. The tmp filesystem goes by username it /tmp/bennion1/XXXXX
> Second, I will check with our admins to see if using ib layer is even
> permissible.
>
> Brian
>
> _______________________________________________
> charm mailing list
> charm AT cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/charm
> _______________________________________________
> ppl mailing list
> ppl AT cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/ppl
>

[charm] current results on LLNL cluster, Bennion, Brian, 09/08/2011
- Re: [charm] [ppl] current results on LLNL cluster, Jim Phillips, 09/08/2011
  - Re: [charm] [ppl] current results on LLNL cluster, Bennion, Brian, 09/08/2011
- Re: [charm] [ppl] current results on LLNL cluster, Jim Phillips, 09/08/2011
  - Re: [charm] [ppl] current results on LLNL cluster, Bennion, Brian, 09/08/2011
- Message not available
  - Message not available
    - Re: [charm] [ppl] current results on LLNL cluster, Phil Miller, 09/08/2011
- Message not available
  - Re: [charm] [ppl] current results on LLNL cluster, Bennion, Brian, 09/08/2011
    - Re: [charm] [ppl] current results on LLNL cluster, Abhinav Bhatele, 09/08/2011