Skip to Content.
Sympa Menu

charm - Re: [charm] Launching AMPI application on IB cluster

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Launching AMPI application on IB cluster


Chronological Thread 
  • From: Sam White <white67 AT illinois.edu>
  • To: Maksym Planeta <mplaneta AT os.inf.tu-dresden.de>
  • Cc: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
  • Subject: Re: [charm] Launching AMPI application on IB cluster
  • Date: Wed, 2 Nov 2016 09:28:15 -0500

You should be able to make a shell script with your srun command (and any options to it) like the following:

#!/bin/csh -f
shift
exec ./mysrun -n $*

Here the shift command drops the '-np' argument. Then when you run your job, use this shell script in place of srun, like '++mpiexec ++remote-shell ./mysrun'.

-Sam

On Wed, Nov 2, 2016 at 9:13 AM, Maksym Planeta <mplaneta AT os.inf.tu-dresden.de> wrote:
Thank you.

I have srun and used following command:

./bin/charmrun ++mpiexec ++remote-shell srun +p8  ./bin/CoMD-ampi   +vp8  -i 2 -j 2 -k 2 ++verbose

This way it worked out.

Sorry, didn't manage to find it quickly. Is there a way to pass srun command line arguments this way?

If I write, for example, ++remote-shell "srun -w <hostfile>", contents of the hostfile seems to be ignored. I can pass hostfile via SLURM_HOSTFILE, but I'd be happy to know if there is a generic way to pass the arguments.

On 11/02/2016 02:51 PM, Sam White wrote:
> Charm++/AMPI programs can be launched using a cluster's native job
> launcher (aprun, srun, mpiexec, etc.), and using one of those is often
> necessary or at least much easier than using a nodelist. If your cluster
> allows launching jobs with mpiexec, you can just add ++mpiexec in place
> of ++local in your charmrun command, as in the following:
>
> ./bin/charmrun ++mpiexec +p8 ./bin/CoMD-ampi +vp8 -i 2 -j 2 -k 2
>
> Let us know if that doesn't work or if your system uses another job
> launcher,
> Sam
>
>
> On Wed, Nov 2, 2016 at 8:42 AM, Maksym Planeta
> <mplaneta AT os.inf.tu-dresden.de <mailto:mplaneta AT os.inf.tu-dresden.de>>
> wrote:
>
>     Dear Charm++ group,
>
>     I'm trying to run AMPI application, namely CoMD, on an Infiniband
>     machine, but it fails to launch.
>
>     At the same time, I managed to launch the same application on
>     smaller machine, where I used the same command to launch the program.
>
>     In both cases I use identical command to compile ampi:
>
>     ./build AMPI  verbs-linux-x86_64 gfortran gcc --with-production
>
>     Also if I add flag ++local on the machine, where execution fails,
>     the program launches.
>
>     Below I show the outputs of two executions: The first one is from
>     the machine, where execution fails, and the second one from the
>     machine where execution is successful.
>
>     While the launch process is hanging in "Waiting for 0-th client to
>     connect" I login to a remote node in expectation to see some
>     processes appear in htop. But I see none. It may happen that some
>     processes appear for short period of time, and htop does not capture
>     them. Nevertheless, no long living process is launched on a remote node.
>
>     Could you please help me to diagnose the problem?
>
>     Here is an example of a run, which fails:
>
>     $ ./bin/charmrun ++usehostname +p8 ./bin/CoMD-ampi  ++nodelist
>     /home/<user>/hostfiles/charmhost +vp8  -i 2 -j 2 -k 2 ++verbose
>     Charmrun> scalable start enabled.
>     Charmrun> charmrun started...
>     Charmrun> using /home/<user>/hostfiles/charmhost as nodesfile
>     Charmrun> adding client 0: "clusterB6549", IP:172.24.46.252
>     Charmrun> adding client 1: "clusterB6552", IP:172.24.46.255
>     Charmrun> adding client 2: "clusterB6553", IP:172.24.47.0
>     Charmrun> adding client 3: "clusterB6554", IP:172.24.47.1
>     Charmrun> adding client 4: "clusterB6555", IP:172.24.47.2
>     Charmrun> adding client 5: "clusterB6556", IP:172.24.47.3
>     Charmrun> adding client 6: "clusterB6557", IP:172.24.47.4
>     Charmrun> adding client 7: "clusterB6558", IP:172.24.47.5
>     Charmrun> Charmrun = clusterB6549, port = 51276
>     Charmrun> IBVERBS version of charmrun
>     start_nodes_ssh
>     Charmrun> Sending "0 clusterB6549 51276 18085 0" to client 0.
>     Charmrun> find the node program "<dir>/CoMD-1.1/./bin/CoMD-ampi" at
>     "<dir>/CoMD-1.1" for 0.
>     Charmrun> Starting ssh clusterB6549 -l <user> -o
>     KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
>     NoHostAuthenticationForLocalhost=yes /bin/bash -f
>     Charmrun> remote shell (clusterB6549:0) started
>     Charmrun> Sending "1 clusterB6549 51276 18085 0" to client 1.
>     Charmrun> find the node program "<dir>/CoMD-1.1/./bin/CoMD-ampi" at
>     "<dir>/CoMD-1.1" for 1.
>     Charmrun> Starting ssh clusterB6552 -l <user> -o
>     KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
>     NoHostAuthenticationForLocalhost=yes /bin/bash -f
>     Charmrun> remote shell (clusterB6552:1) started
>     Charmrun> Sending "2 clusterB6549 51276 18085 0" to client 2.
>     Charmrun> find the node program "<dir>/CoMD-1.1/./bin/CoMD-ampi" at
>     "<dir>/CoMD-1.1" for 2.
>     Charmrun> Starting ssh clusterB6553 -l <user> -o
>     KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
>     NoHostAuthenticationForLocalhost=yes /bin/bash -f
>     Charmrun> remote shell (clusterB6553:2) started
>     Charmrun> Sending "3 clusterB6549 51276 18085 0" to client 3.
>     Charmrun> find the node program "<dir>/CoMD-1.1/./bin/CoMD-ampi" at
>     "<dir>/CoMD-1.1" for 3.
>     Charmrun> Starting ssh clusterB6554 -l <user> -o
>     KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
>     NoHostAuthenticationForLocalhost=yes /bin/bash -f
>     Charmrun> remote shell (clusterB6554:3) started
>     Charmrun> Sending "4 clusterB6549 51276 18085 0" to client 4.
>     Charmrun> find the node program "<dir>/CoMD-1.1/./bin/CoMD-ampi" at
>     "<dir>/CoMD-1.1" for 4.
>     Charmrun> Starting ssh clusterB6555 -l <user> -o
>     KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
>     NoHostAuthenticationForLocalhost=yes /bin/bash -f
>     Charmrun> remote shell (clusterB6555:4) started
>     Charmrun> Sending "5 clusterB6549 51276 18085 0" to client 5.
>     Charmrun> find the node program "<dir>/CoMD-1.1/./bin/CoMD-ampi" at
>     "<dir>/CoMD-1.1" for 5.
>     Charmrun> Starting ssh clusterB6556 -l <user> -o
>     KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
>     NoHostAuthenticationForLocalhost=yes /bin/bash -f
>     Charmrun> remote shell (clusterB6556:5) started
>     Charmrun> Sending "6 clusterB6549 51276 18085 0" to client 6.
>     Charmrun> find the node program "<dir>/CoMD-1.1/./bin/CoMD-ampi" at
>     "<dir>/CoMD-1.1" for 6.
>     Charmrun> Starting ssh clusterB6557 -l <user> -o
>     KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
>     NoHostAuthenticationForLocalhost=yes /bin/bash -f
>     Charmrun> remote shell (clusterB6557:6) started
>     Charmrun> Sending "7 clusterB6549 51276 18085 0" to client 7.
>     Charmrun> find the node program "<dir>/CoMD-1.1/./bin/CoMD-ampi" at
>     "<dir>/CoMD-1.1" for 7.
>     Charmrun> Starting ssh clusterB6558 -l <user> -o
>     KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
>     NoHostAuthenticationForLocalhost=yes /bin/bash -f
>     Charmrun> remote shell (clusterB6558:7) started
>     Charmrun> node programs all started
>     Charmrun remote shell(clusterB6549.0)> remote responding...
>     Charmrun remote shell(clusterB6549.0)> starting node-program...
>     Charmrun remote shell(clusterB6549.0)> remote shell phase successful.
>     Charmrun remote shell(clusterB6558.7)> remote responding...
>     Charmrun remote shell(clusterB6558.7)> starting node-program...
>     Charmrun remote shell(clusterB6558.7)> remote shell phase successful.
>     Charmrun remote shell(clusterB6555.4)> remote responding...
>     Charmrun remote shell(clusterB6555.4)> starting node-program...
>     Charmrun remote shell(clusterB6555.4)> remote shell phase successful.
>     Charmrun remote shell(clusterB6556.5)> remote responding...
>     Charmrun remote shell(clusterB6553.2)> remote responding...
>     Charmrun remote shell(clusterB6557.6)> remote responding...
>     Charmrun remote shell(clusterB6552.1)> remote responding...
>     Charmrun remote shell(clusterB6556.5)> starting node-program...
>     Charmrun remote shell(clusterB6556.5)> remote shell phase successful.
>     Charmrun remote shell(clusterB6557.6)> starting node-program...
>     Charmrun remote shell(clusterB6553.2)> starting node-program...
>     Charmrun remote shell(clusterB6554.3)> remote responding...
>     Charmrun remote shell(clusterB6552.1)> starting node-program...
>     Charmrun remote shell(clusterB6557.6)> remote shell phase successful.
>     Charmrun remote shell(clusterB6553.2)> remote shell phase successful.
>     Charmrun remote shell(clusterB6552.1)> remote shell phase successful.
>     Charmrun remote shell(clusterB6554.3)> starting node-program...
>     Charmrun remote shell(clusterB6554.3)> remote shell phase successful.
>     Charmrun> Waiting for 0-th client to connect.
>     Charmrun> error attaching to node 'clusterB6549':
>     Timeout waiting for node-program to connect
>
>     Execution on the second machine.
>
>     $ ./bin/charmrun +p8 ./bin/CoMD-ampi  ++nodelist ~/charmhosts
>     ++verbose  +vp8 -i 2 -j 2 -k 2
>     Charmrun> scalable start enabled.
>     Charmrun> charmrun started...
>     Charmrun> using /home/<user>/charmhosts as nodesfile
>     Charmrun> adding client 0: "machineA-n1", IP:141.76.48.45
>     Charmrun> adding client 1: "machineA-n1", IP:141.76.48.45
>     Charmrun> adding client 2: "machineA-n2", IP:141.76.48.46
>     Charmrun> adding client 3: "machineA-n2", IP:141.76.48.46
>     Charmrun> adding client 4: "machineA-n3", IP:141.76.48.47
>     Charmrun> adding client 5: "machineA-n3", IP:141.76.48.47
>     Charmrun> adding client 6: "machineA-n4", IP:141.76.48.48
>     Charmrun> adding client 7: "machineA-n4", IP:141.76.48.48
>     Charmrun> Charmrun = 141.76.48.45, port = 34881
>     Charmrun> IBVERBS version of charmrun
>     start_nodes_ssh
>     Charmrun> Sending "0 141.76.48.45 34881 422 0" to client 0.
>     Charmrun> find the node program
>     "/home/<user>/ampi/CoMD-1.1/./bin/CoMD-ampi" at
>     "/home/<user>/ampi/CoMD-1.1" for 0.
>     Charmrun> Starting ssh machineA-n1 -l <user> -o
>     KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
>     NoHostAuthenticationForLocalhost=yes /bin/bash -f
>     Charmrun> remote shell (machineA-n1:0) started
>     Charmrun> Sending "2 141.76.48.45 34881 422 0" to client 2.
>     Charmrun> find the node program
>     "/home/<user>/ampi/CoMD-1.1/./bin/CoMD-ampi" at
>     "/home/<user>/ampi/CoMD-1.1" for 2.
>     Charmrun> Starting ssh machineA-n2 -l <user> -o
>     KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
>     NoHostAuthenticationForLocalhost=yes /bin/bash -f
>     Charmrun> remote shell (machineA-n2:2) started
>     Charmrun> Sending "4 141.76.48.45 34881 422 0" to client 4.
>     Charmrun> find the node program
>     "/home/<user>/ampi/CoMD-1.1/./bin/CoMD-ampi" at
>     "/home/<user>/ampi/CoMD-1.1" for 4.
>     Charmrun> Starting ssh machineA-n3 -l <user> -o
>     KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
>     NoHostAuthenticationForLocalhost=yes /bin/bash -f
>     Charmrun> remote shell (machineA-n3:4) started
>     Charmrun> Sending "6 141.76.48.45 34881 422 0" to client 6.
>     Charmrun> find the node program
>     "/home/<user>/ampi/CoMD-1.1/./bin/CoMD-ampi" at
>     "/home/<user>/ampi/CoMD-1.1" for 6.
>     Charmrun> Starting ssh machineA-n4 -l <user> -o
>     KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
>     NoHostAuthenticationForLocalhost=yes /bin/bash -f
>     Charmrun> remote shell (machineA-n4:6) started
>     Charmrun> node programs all started
>     Charmrun remote shell(machineA-n2.2)> remote responding...
>     Charmrun remote shell(machineA-n3.4)> remote responding...
>     Charmrun remote shell(machineA-n1.0)> remote responding...
>     Charmrun remote shell(machineA-n2.2)> starting node-program...
>     Charmrun remote shell(machineA-n2.2)> remote shell phase successful.
>     Charmrun remote shell(machineA-n3.4)> starting node-program...
>     Charmrun remote shell(machineA-n4.6)> remote responding...
>     Charmrun remote shell(machineA-n3.4)> remote shell phase successful.
>     Charmrun remote shell(machineA-n1.0)> starting node-program...
>     Charmrun remote shell(machineA-n1.0)> remote shell phase successful.
>     Charmrun remote shell(machineA-n4.6)> starting node-program...
>     Charmrun remote shell(machineA-n4.6)> remote shell phase successful.
>     Charmrun> Waiting for 0-th client to connect.
>     Charmrun> Waiting for 1-th client to connect.
>     Charmrun> Waiting for 2-th client to connect.
>     Charmrun> Waiting for 3-th client to connect.
>     Charmrun> Waiting for 4-th client to connect.
>     Charmrun> Waiting for 5-th client to connect.
>     Charmrun> Waiting for 6-th client to connect.
>     Charmrun> Waiting for 7-th client to connect.
>     Charmrun> All clients connected.
>     <Program continues running...>
>
>     --
>     Regards,
>     Maksym Planeta
>
>

--
Regards,
Maksym Planeta





Archive powered by MHonArc 2.6.19.

Top of Page