Skip to Content.
Sympa Menu

charm - [charm] Launching AMPI application on IB cluster

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

[charm] Launching AMPI application on IB cluster


Chronological Thread 
  • From: Maksym Planeta <mplaneta AT os.inf.tu-dresden.de>
  • To: charm <charm AT lists.cs.illinois.edu>
  • Subject: [charm] Launching AMPI application on IB cluster
  • Date: Wed, 2 Nov 2016 14:42:43 +0100

Dear Charm++ group,

I'm trying to run AMPI application, namely CoMD, on an Infiniband machine,
but it fails to launch.

At the same time, I managed to launch the same application on smaller
machine, where I used the same command to launch the program.

In both cases I use identical command to compile ampi:

./build AMPI verbs-linux-x86_64 gfortran gcc --with-production

Also if I add flag ++local on the machine, where execution fails, the program
launches.

Below I show the outputs of two executions: The first one is from the
machine, where execution fails, and the second one from the machine where
execution is successful.

While the launch process is hanging in "Waiting for 0-th client to connect" I
login to a remote node in expectation to see some processes appear in htop.
But I see none. It may happen that some processes appear for short period of
time, and htop does not capture them. Nevertheless, no long living process is
launched on a remote node.

Could you please help me to diagnose the problem?

Here is an example of a run, which fails:

$ ./bin/charmrun ++usehostname +p8 ./bin/CoMD-ampi ++nodelist
/home/<user>/hostfiles/charmhost +vp8 -i 2 -j 2 -k 2 ++verbose
Charmrun> scalable start enabled.
Charmrun> charmrun started...
Charmrun> using /home/<user>/hostfiles/charmhost as nodesfile
Charmrun> adding client 0: "clusterB6549", IP:172.24.46.252
Charmrun> adding client 1: "clusterB6552", IP:172.24.46.255
Charmrun> adding client 2: "clusterB6553", IP:172.24.47.0
Charmrun> adding client 3: "clusterB6554", IP:172.24.47.1
Charmrun> adding client 4: "clusterB6555", IP:172.24.47.2
Charmrun> adding client 5: "clusterB6556", IP:172.24.47.3
Charmrun> adding client 6: "clusterB6557", IP:172.24.47.4
Charmrun> adding client 7: "clusterB6558", IP:172.24.47.5
Charmrun> Charmrun = clusterB6549, port = 51276
Charmrun> IBVERBS version of charmrun
start_nodes_ssh
Charmrun> Sending "0 clusterB6549 51276 18085 0" to client 0.
Charmrun> find the node program "<dir>/CoMD-1.1/./bin/CoMD-ampi" at
"<dir>/CoMD-1.1" for 0.
Charmrun> Starting ssh clusterB6549 -l <user> -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (clusterB6549:0) started
Charmrun> Sending "1 clusterB6549 51276 18085 0" to client 1.
Charmrun> find the node program "<dir>/CoMD-1.1/./bin/CoMD-ampi" at
"<dir>/CoMD-1.1" for 1.
Charmrun> Starting ssh clusterB6552 -l <user> -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (clusterB6552:1) started
Charmrun> Sending "2 clusterB6549 51276 18085 0" to client 2.
Charmrun> find the node program "<dir>/CoMD-1.1/./bin/CoMD-ampi" at
"<dir>/CoMD-1.1" for 2.
Charmrun> Starting ssh clusterB6553 -l <user> -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (clusterB6553:2) started
Charmrun> Sending "3 clusterB6549 51276 18085 0" to client 3.
Charmrun> find the node program "<dir>/CoMD-1.1/./bin/CoMD-ampi" at
"<dir>/CoMD-1.1" for 3.
Charmrun> Starting ssh clusterB6554 -l <user> -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (clusterB6554:3) started
Charmrun> Sending "4 clusterB6549 51276 18085 0" to client 4.
Charmrun> find the node program "<dir>/CoMD-1.1/./bin/CoMD-ampi" at
"<dir>/CoMD-1.1" for 4.
Charmrun> Starting ssh clusterB6555 -l <user> -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (clusterB6555:4) started
Charmrun> Sending "5 clusterB6549 51276 18085 0" to client 5.
Charmrun> find the node program "<dir>/CoMD-1.1/./bin/CoMD-ampi" at
"<dir>/CoMD-1.1" for 5.
Charmrun> Starting ssh clusterB6556 -l <user> -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (clusterB6556:5) started
Charmrun> Sending "6 clusterB6549 51276 18085 0" to client 6.
Charmrun> find the node program "<dir>/CoMD-1.1/./bin/CoMD-ampi" at
"<dir>/CoMD-1.1" for 6.
Charmrun> Starting ssh clusterB6557 -l <user> -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (clusterB6557:6) started
Charmrun> Sending "7 clusterB6549 51276 18085 0" to client 7.
Charmrun> find the node program "<dir>/CoMD-1.1/./bin/CoMD-ampi" at
"<dir>/CoMD-1.1" for 7.
Charmrun> Starting ssh clusterB6558 -l <user> -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (clusterB6558:7) started
Charmrun> node programs all started
Charmrun remote shell(clusterB6549.0)> remote responding...
Charmrun remote shell(clusterB6549.0)> starting node-program...
Charmrun remote shell(clusterB6549.0)> remote shell phase successful.
Charmrun remote shell(clusterB6558.7)> remote responding...
Charmrun remote shell(clusterB6558.7)> starting node-program...
Charmrun remote shell(clusterB6558.7)> remote shell phase successful.
Charmrun remote shell(clusterB6555.4)> remote responding...
Charmrun remote shell(clusterB6555.4)> starting node-program...
Charmrun remote shell(clusterB6555.4)> remote shell phase successful.
Charmrun remote shell(clusterB6556.5)> remote responding...
Charmrun remote shell(clusterB6553.2)> remote responding...
Charmrun remote shell(clusterB6557.6)> remote responding...
Charmrun remote shell(clusterB6552.1)> remote responding...
Charmrun remote shell(clusterB6556.5)> starting node-program...
Charmrun remote shell(clusterB6556.5)> remote shell phase successful.
Charmrun remote shell(clusterB6557.6)> starting node-program...
Charmrun remote shell(clusterB6553.2)> starting node-program...
Charmrun remote shell(clusterB6554.3)> remote responding...
Charmrun remote shell(clusterB6552.1)> starting node-program...
Charmrun remote shell(clusterB6557.6)> remote shell phase successful.
Charmrun remote shell(clusterB6553.2)> remote shell phase successful.
Charmrun remote shell(clusterB6552.1)> remote shell phase successful.
Charmrun remote shell(clusterB6554.3)> starting node-program...
Charmrun remote shell(clusterB6554.3)> remote shell phase successful.
Charmrun> Waiting for 0-th client to connect.
Charmrun> error attaching to node 'clusterB6549':
Timeout waiting for node-program to connect

Execution on the second machine.

$ ./bin/charmrun +p8 ./bin/CoMD-ampi ++nodelist ~/charmhosts ++verbose +vp8
-i 2 -j 2 -k 2
Charmrun> scalable start enabled.
Charmrun> charmrun started...
Charmrun> using /home/<user>/charmhosts as nodesfile
Charmrun> adding client 0: "machineA-n1", IP:141.76.48.45
Charmrun> adding client 1: "machineA-n1", IP:141.76.48.45
Charmrun> adding client 2: "machineA-n2", IP:141.76.48.46
Charmrun> adding client 3: "machineA-n2", IP:141.76.48.46
Charmrun> adding client 4: "machineA-n3", IP:141.76.48.47
Charmrun> adding client 5: "machineA-n3", IP:141.76.48.47
Charmrun> adding client 6: "machineA-n4", IP:141.76.48.48
Charmrun> adding client 7: "machineA-n4", IP:141.76.48.48
Charmrun> Charmrun = 141.76.48.45, port = 34881
Charmrun> IBVERBS version of charmrun
start_nodes_ssh
Charmrun> Sending "0 141.76.48.45 34881 422 0" to client 0.
Charmrun> find the node program "/home/<user>/ampi/CoMD-1.1/./bin/CoMD-ampi"
at "/home/<user>/ampi/CoMD-1.1" for 0.
Charmrun> Starting ssh machineA-n1 -l <user> -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (machineA-n1:0) started
Charmrun> Sending "2 141.76.48.45 34881 422 0" to client 2.
Charmrun> find the node program "/home/<user>/ampi/CoMD-1.1/./bin/CoMD-ampi"
at "/home/<user>/ampi/CoMD-1.1" for 2.
Charmrun> Starting ssh machineA-n2 -l <user> -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (machineA-n2:2) started
Charmrun> Sending "4 141.76.48.45 34881 422 0" to client 4.
Charmrun> find the node program "/home/<user>/ampi/CoMD-1.1/./bin/CoMD-ampi"
at "/home/<user>/ampi/CoMD-1.1" for 4.
Charmrun> Starting ssh machineA-n3 -l <user> -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (machineA-n3:4) started
Charmrun> Sending "6 141.76.48.45 34881 422 0" to client 6.
Charmrun> find the node program "/home/<user>/ampi/CoMD-1.1/./bin/CoMD-ampi"
at "/home/<user>/ampi/CoMD-1.1" for 6.
Charmrun> Starting ssh machineA-n4 -l <user> -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (machineA-n4:6) started
Charmrun> node programs all started
Charmrun remote shell(machineA-n2.2)> remote responding...
Charmrun remote shell(machineA-n3.4)> remote responding...
Charmrun remote shell(machineA-n1.0)> remote responding...
Charmrun remote shell(machineA-n2.2)> starting node-program...
Charmrun remote shell(machineA-n2.2)> remote shell phase successful.
Charmrun remote shell(machineA-n3.4)> starting node-program...
Charmrun remote shell(machineA-n4.6)> remote responding...
Charmrun remote shell(machineA-n3.4)> remote shell phase successful.
Charmrun remote shell(machineA-n1.0)> starting node-program...
Charmrun remote shell(machineA-n1.0)> remote shell phase successful.
Charmrun remote shell(machineA-n4.6)> starting node-program...
Charmrun remote shell(machineA-n4.6)> remote shell phase successful.
Charmrun> Waiting for 0-th client to connect.
Charmrun> Waiting for 1-th client to connect.
Charmrun> Waiting for 2-th client to connect.
Charmrun> Waiting for 3-th client to connect.
Charmrun> Waiting for 4-th client to connect.
Charmrun> Waiting for 5-th client to connect.
Charmrun> Waiting for 6-th client to connect.
Charmrun> Waiting for 7-th client to connect.
Charmrun> All clients connected.
<Program continues running...>

--
Regards,
Maksym Planeta

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature




Archive powered by MHonArc 2.6.19.

Top of Page