Skip to Content.
Sympa Menu

charm - Re: [charm] charmrun timeout problem

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] charmrun timeout problem


Chronological Thread 
  • From: Phil Miller <mille121 AT illinois.edu>
  • To: Dominic Roehm <dominic.roehm AT gmail.com>
  • Cc: Charm Mailing List <charm AT cs.illinois.edu>
  • Subject: Re: [charm] charmrun timeout problem
  • Date: Sun, 23 Nov 2014 14:49:06 -0600
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

If you just want to get up and running on this cluster, and aren't going to be running across a huge number of nodes, you could use an mpi-linux-x86_64 build of Charm++ and launch your jobs directly with mpirun/mpiexec just as you would an MPI code. Performance at this scale should be just about the same as our native machine layers.

To further debug this issue, I would next ask if there's any sort of firewall configured on the nodes involved (head and compute). The address and subnet configuration in question looks reasonable. The output of 'route -n' might end up being useful, but this digs us pretty deep into weird cluster configurations.

Any further, and I'll suggest we arrange a time to get on IM or Skype and try to work this out more interactively.

On Sun, Nov 23, 2014 at 11:54 AM, Dominic Roehm <dominic.roehm AT gmail.com> wrote:
that's id addr on the login node where I compiled charm

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:30:48:be:c7:54 brd ff:ff:ff:ff:ff:ff
    inet 10.1.1.1/16 brd 10.1.255.255 scope global eth0
    inet6 fe80::230:48ff:febe:c754/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:30:48:be:c7:55 brd ff:ff:ff:ff:ff:ff
    inet 129.69.120.30/24 brd 129.69.120.255 scope global eth1
    inet6 fe80::230:48ff:febe:c755/64 scope link
       valid_lft forever preferred_lft forever
4: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast state UP qlen 256
    link/infiniband 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:06:98:3b brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 192.168.0.1/24 brd 192.168.0.255 scope global ib0
    inet6 fe80::202:c903:6:983b/64 scope link
       valid_lft forever preferred_lft forever

and that's ifconfig


eth0      Link encap:Ethernet  HWaddr 00:30:48:BE:C7:54 
          inet addr:10.1.1.1  Bcast:10.1.255.255  Mask:255.255.0.0
          inet6 addr: fe80::230:48ff:febe:c754/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:632977547 errors:0 dropped:98 overruns:0 frame:0
          TX packets:312863860 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:868991418711 (809.3 GiB)  TX bytes:40751119455 (37.9 GiB)
          Interrupt:17 Memory:fdde0000-fde00000

eth1      Link encap:Ethernet  HWaddr 00:30:48:BE:C7:55 
          inet addr:129.69.120.30  Bcast:129.69.120.255  Mask:255.255.255.0
          inet6 addr: fe80::230:48ff:febe:c755/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:12080741 errors:2 dropped:0 overruns:0 frame:2
          TX packets:6132814 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:12537898645 (11.6 GiB)  TX bytes:2253434269 (2.0 GiB)
          Interrupt:18 Memory:fdee0000-fdf00000

Ifconfig uses the ioctl access method to get the full address information, which limits hardware addresses to 8 bytes.
Because Infiniband address has 20 bytes, only the first 8 bytes are displayed correctly.
Ifconfig is obsolete! For replacement check ip.
ib0       Link encap:InfiniBand  HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 
          inet addr:192.168.0.1  Bcast:192.168.0.255  Mask:255.255.255.0
          inet6 addr: fe80::202:c903:6:983b/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
          RX packets:178 errors:0 dropped:0 overruns:0 frame:0
          TX packets:85 errors:0 dropped:5 overruns:0 carrier:0
          collisions:0 txqueuelen:256
          RX bytes:9968 (9.7 KiB)  TX bytes:31556 (30.8 KiB)

lo        Link encap:Local Loopback 
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:215040988 errors:0 dropped:0 overruns:0 frame:0
          TX packets:215040988 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:53274274744 (49.6 GiB)  TX bytes:53274274744 (49.6 GiB)



and that's ip addr on a compute node where I want to run it

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:30:48:be:d7:78 brd ff:ff:ff:ff:ff:ff
    inet 10.1.255.240/16 brd 10.1.255.255 scope global eth0
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 00:30:48:be:d7:79 brd ff:ff:ff:ff:ff:ff
4: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast state UP qlen 256
    link/infiniband 80:00:00:48:fe:80:00:00:00:00:00:00:00:30:48:be:d7:78:00:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 192.168.0.16/24 brd 192.168.0.255 scope global ib0

and that's ifconfig

eth0      Link encap:Ethernet  HWaddr 00:30:48:BE:D7:78 
          inet addr:10.1.255.240  Bcast:10.1.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:119394535936 errors:0 dropped:0 overruns:0 frame:0
          TX packets:103387298007 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:129850396378626 (118.0 TiB)  TX bytes:128347902469335 (116.7 TiB)
          Memory:febe0000-fec00000

Ifconfig uses the ioctl access method to get the full address information, which limits hardware addresses to 8 bytes.
Because Infiniband address has 20 bytes, only the first 8 bytes are displayed correctly.
Ifconfig is obsolete! For replacement check ip.
ib0       Link encap:InfiniBand  HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 
          inet addr:192.168.0.16  Bcast:192.168.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
          RX packets:79330874894 errors:0 dropped:0 overruns:0 frame:0
          TX packets:22864262168 errors:0 dropped:10 overruns:0 carrier:0
          collisions:0 txqueuelen:256
          RX bytes:122122588444499 (111.0 TiB)  TX bytes:119083689172574 (108.3 TiB)

lo        Link encap:Local Loopback 
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:93064 errors:0 dropped:0 overruns:0 frame:0
          TX packets:93064 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:64079665 (61.1 MiB)  TX bytes:64079665 (61.1 MiB)




On 11/22/2014 01:38 AM, Phil Miller wrote:

Could you show us the build of Charm++ you're using and the full charmrun command you used?

On Nov 21, 2014 11:42 AM, "Dominic Roehm" <dominic.roehm AT gmail.com> wrote:
Hi,

I tried run my charm code on 2 8-core nodes on my local cluster but I
get a timeout by the node-program. It timeouts during the starting
procedure. The itself code run successful on different workstations and
clusters.  Does anyone have an idea what the problem is or how to get
more information about the issue? Error msg:

Dominic

Charmrun> charmrun started...
Charmrun> adding client 0: "127.0.0.1", IP:127.0.0.1
Charmrun> adding client 1: "127.0.0.1", IP:127.0.0.1
Charmrun> adding client 2: "127.0.0.1", IP:127.0.0.1
Charmrun> adding client 3: "127.0.0.1", IP:127.0.0.1
Charmrun> adding client 4: "127.0.0.1", IP:127.0.0.1
Charmrun> adding client 5: "127.0.0.1", IP:127.0.0.1
Charmrun> adding client 6: "127.0.0.1", IP:127.0.0.1
Charmrun> adding client 7: "127.0.0.1", IP:127.0.0.1
Charmrun> adding client 8: "127.0.0.1", IP:127.0.0.1
Charmrun> adding client 9: "127.0.0.1", IP:127.0.0.1
Charmrun> adding client 10: "127.0.0.1", IP:127.0.0.1
Charmrun> adding client 11: "127.0.0.1", IP:127.0.0.1
Charmrun> adding client 12: "127.0.0.1", IP:127.0.0.1
Charmrun> adding client 13: "127.0.0.1", IP:127.0.0.1
Charmrun> adding client 14: "127.0.0.1", IP:127.0.0.1
Charmrun> adding client 15: "127.0.0.1", IP:127.0.0.1
Charmrun> Charmrun = 10.1.255.254, port = 45243
Charmrun> Sending "$CmiMyNode 10.1.255.254 45243 8527 0" to client 0.
Charmrun> find the node program
"/home/dominic/work/hmm/benchmarks/charm_tp2/../../CoHMM/charm_bin/2D_Kriging"
at "/home/dominic/work/hmm/benchmarks/charm_tp2" for 0.
Charmrun> Starting mpiexec ./charmrun.8527
Charmrun> mpiexec started
Charmrun> node programs all started
Charmrun> Waiting for 0-th client to connect.
Charmrun remote shell(127.0.0.1.0)> remote responding...
Charmrun remote shell(127.0.0.1.0)> starting node-program...
Charmrun remote shell(127.0.0.1.0)> remote responding...
Charmrun remote shell(127.0.0.1.0)> remote responding...
Charmrun remote shell(127.0.0.1.0)> starting node-program...
Charmrun remote shell(127.0.0.1.0)> remote responding...
Charmrun remote shell(127.0.0.1.0)> starting node-program...
Charmrun remote shell(127.0.0.1.0)> remote responding...
Charmrun remote shell(127.0.0.1.0)> starting node-program...
Charmrun remote shell(127.0.0.1.0)> remote responding...
Charmrun remote shell(127.0.0.1.0)> starting node-program...
Charmrun remote shell(127.0.0.1.0)> remote responding...
Charmrun remote shell(127.0.0.1.0)> starting node-program...
Charmrun remote shell(127.0.0.1.0)> remote responding...
Charmrun remote shell(127.0.0.1.0)> starting node-program...
Charmrun remote shell(127.0.0.1.0)> remote responding...
Charmrun remote shell(127.0.0.1.0)> starting node-program...
Charmrun remote shell(127.0.0.1.0)> remote responding...
Charmrun remote shell(127.0.0.1.0)> starting node-program...
Charmrun remote shell(127.0.0.1.0)> remote responding...
Charmrun remote shell(127.0.0.1.0)> starting node-program...
Charmrun remote shell(127.0.0.1.0)> starting node-program...
Charmrun remote shell(127.0.0.1.0)> remote responding...
Charmrun remote shell(127.0.0.1.0)> remote responding...
Charmrun remote shell(127.0.0.1.0)> starting node-program...
Charmrun remote shell(127.0.0.1.0)> starting node-program...
Charmrun remote shell(127.0.0.1.0)> remote responding...
Charmrun remote shell(127.0.0.1.0)> starting node-program...
Charmrun remote shell(127.0.0.1.0)> remote responding...
Charmrun remote shell(127.0.0.1.0)> starting node-program...
Charmrun remote shell(127.0.0.1.0)> remote responding...
Charmrun remote shell(127.0.0.1.0)> starting node-program...
Charmrun remote shell(127.0.0.1.0)> rsh phase successful.
Charmrun remote shell(127.0.0.1.0)> rsh phase successful.
Charmrun remote shell(127.0.0.1.0)> rsh phase successful.
Charmrun remote shell(127.0.0.1.0)> rsh phase successful.
Charmrun remote shell(127.0.0.1.0)> rsh phase successful.
Charmrun remote shell(127.0.0.1.0)> rsh phase successful.
Charmrun remote shell(127.0.0.1.0)> rsh phase successful.
Charmrun remote shell(127.0.0.1.0)> rsh phase successful.
Charmrun remote shell(127.0.0.1.0)> rsh phase successful.
Charmrun remote shell(127.0.0.1.0)> rsh phase successful.
Charmrun remote shell(127.0.0.1.0)> rsh phase successful.
Charmrun remote shell(127.0.0.1.0)> rsh phase successful.
Charmrun remote shell(127.0.0.1.0)> rsh phase successful.
Charmrun remote shell(127.0.0.1.0)> rsh phase successful.
Charmrun remote shell(127.0.0.1.0)> rsh phase successful.
Charmrun remote shell(127.0.0.1.0)> rsh phase successful.
Charmrun> error 0 attaching to node:
Timeout waiting for node-program to connect
_______________________________________________
charm mailing list
charm AT cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/charm





Archive powered by MHonArc 2.6.16.

Top of Page