Skip to Content.
Sympa Menu

charm - Re: [charm] Fault Tolerance

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Fault Tolerance


Chronological Thread 
  • From: <alberto.ortiz09 AT gmail.com>
  • To: charm AT lists.cs.illinois.edu,samt.white AT gmail.com
  • Subject: Re: [charm] Fault Tolerance
  • Date: Fri, 17 Feb 2017 10:21:42 -0600

Thanks for the reply.
I have rebuilt AMPI with the command './build AMPI net-linux-arm7 syncft --
with-production'. It built correctly and have ampi running well.
I have added the lines you mention to the .c file and compiled it using
'ampicc mxm.c -o mxm' and when I try to force a fault rebooting one of the
devices it seems to migrate well at the beginning but one of the threads fails
with this error:

Charmrun> error on request socket to node 2 '192.168.1.20'--
Socket closed before recv.
Socket 6 failed
Charmrun finished launching new process in 1.423164s
charmrun says Processor 2 failed on Node 2
socket_index 2 crashed_node 2 reconnected fd 6
Charmrun> continue node: 2
[2] Restarting after crash
[2] I am restarting cur_restart_phase:3 at time: 0.006966
[2] I am restarting cur_restart_phase:3 discard charm message at time:
0.007272
[0] askProcDataHandler called with '2' cur_restart_phase:3 at time 4.592871.
[0] no checkpoint found for processor 2. This could be due to a crash before
the first checkpointing.
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: no checkpoint found
Fatal error on PE 0> no checkpoint found


I believe this can be due to not using isomalloc, as I am using the
"in_memory" option and the rank which runs on the rebooted device might make
the checkpointing in its own memory, which can't be accessed while rebooting
(I don't know if it works like this, this is just a guess). As such, I
compiled the file like you suggested using 'ampicc -memory isomalloc mxm.c -o
mxm' and it gave no errors while compiling.
The problem is that when running the file compiled with the isomalloc option
it doesn't run, I have used the command ++verbose to obtain more info and the
outcome is:


[artico1@alarm Matrices]$ ./charmrun ++verbose ++nodelist hosts +p3 ./mxm +vp5
Charmrun> scalable start enabled.
Charmrun> charmrun started...
Charmrun> using hosts as nodesfile
Charmrun> adding client 0: "192.168.1.20", IP:192.168.1.20
Charmrun> adding client 1: "192.168.1.20", IP:192.168.1.20
Charmrun> adding client 2: "192.168.1.21", IP:192.168.1.21
Charmrun> Charmrun = 192.168.1.20, port = 53873
start_nodes_ssh
Charmrun> Sending "0 192.168.1.20 53873 1384 0" to client 0.
Charmrun> find the node program "/home/artico1/Desktop/Matrices/./mxm" at "/
home/artico1/Desktop/Matrices" for 0.
Charmrun> Starting ssh 192.168.1.20 -l artico1 -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (192.168.1.20:0) started
Charmrun> Sending "2 192.168.1.20 53873 1384 0" to client 2.
Charmrun> find the node program "/home/artico1/Desktop/Matrices/./mxm" at "/
home/artico1/Desktop/Matrices" for 2.
Charmrun> Starting ssh 192.168.1.21 -l artico1 -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (192.168.1.21:2) started
Charmrun> node programs all started
Charmrun remote shell(192.168.1.21.2)> remote responding...
Charmrun remote shell(192.168.1.20.0)> remote responding...
Charmrun remote shell(192.168.1.21.2)> starting node-program...
Charmrun remote shell(192.168.1.20.0)> starting node-program...
Charmrun remote shell(192.168.1.21.2)> remote shell phase successful.
Charmrun remote shell(192.168.1.20.0)> remote shell phase successful.
Charmrun> Waiting for 0-th client to connect.
Charmrun> error attaching to node '192.168.1.20':
Timeout waiting for node-program to connect

I don't know if I have to configure something else to have isomalloc working
or if the first error has anything to do with what I guessed. The thing is
that out of 5 ranks, 4 of them restart from the checkpoint and just one of
them fails saying there's no checkpoint, but I believe there should be one.

Thanks in advance,
Alberto.



Archive powered by MHonArc 2.6.19.

Top of Page