Skip to Content.
Sympa Menu

charm - Re: [charm] Fault Tolerance

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Fault Tolerance


Chronological Thread 
  • From: Sam White <samt.white AT gmail.com>
  • To: "alberto.ortiz09 AT gmail.com" <alberto.ortiz09 AT gmail.com>
  • Cc: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
  • Subject: Re: [charm] Fault Tolerance
  • Date: Fri, 17 Feb 2017 10:55:42 -0600

If you are not using PUP routines or Isomalloc, then fault tolerance will not work: the runtime systems needs a way to migrate all of rank's memory between address spaces for fault recovery to work. That's why the runtime complained about there not being a checkpoint to restart from: you did not provide a mechanism for it to do a checkpoint.

The way the double in-memory checkpointing scheme works is that ranks exchange checkpoints with a buddy on a separate processor, and keep their own checkpoint in their own memory: that way, ranks that don't themselves fail can recover quickly from their last checkpoint and the one processor that failed can have its rank(s) restarted on whichever processor it had designated as its buddy.

The hang in startup when using Isomalloc is concerning but it's hard to say what is happening there without more information. We don't have any arm7 machines to test Isomalloc on locally (at least that I know of). Could you try linking with '-memory os-isomalloc' instead of '-memory isomalloc' too? That is a slightly different implementation.

-Sam

On Fri, Feb 17, 2017 at 10:21 AM, alberto.ortiz09 AT gmail.com <alberto.ortiz09 AT gmail.com> wrote:
Thanks for the reply.
I have rebuilt AMPI with the command './build AMPI net-linux-arm7 syncft --
with-production'. It built correctly and have ampi running well.
I have added the lines you mention to the .c file and compiled it using
'ampicc mxm.c -o mxm' and when I try to force a fault rebooting one of the
devices it seems to migrate well at the beginning but one of the threads fails
with this error:

Charmrun> error on request socket to node 2 '192.168.1.20'--
Socket closed before recv.
Socket 6 failed
Charmrun finished launching new process in 1.423164s
charmrun says Processor 2 failed on Node 2
socket_index 2 crashed_node 2 reconnected fd 6
Charmrun> continue node: 2
[2] Restarting after crash
[2] I am restarting  cur_restart_phase:3 at time: 0.006966
[2] I am restarting  cur_restart_phase:3 discard charm message at time:
0.007272
[0] askProcDataHandler called with '2' cur_restart_phase:3 at time 4.592871.
[0] no checkpoint found for processor 2. This could be due to a crash before
the first checkpointing.
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: no checkpoint found
Fatal error on PE 0> no checkpoint found


I believe this can be due to not using isomalloc, as I am using the
"in_memory" option and the rank which runs on the rebooted device might make
the checkpointing in its own memory, which can't be accessed while rebooting
(I don't know if it works like this, this is just a guess). As such, I
compiled the file like you suggested using 'ampicc -memory isomalloc mxm.c -o
mxm' and it gave no errors while compiling.
The problem is that when running the file compiled with the isomalloc option
it doesn't run, I have used the command ++verbose to obtain more info and the
outcome is:


[artico1@alarm Matrices]$ ./charmrun ++verbose ++nodelist hosts +p3 ./mxm +vp5
Charmrun> scalable start enabled.
Charmrun> charmrun started...
Charmrun> using hosts as nodesfile
Charmrun> adding client 0: "192.168.1.20", IP:192.168.1.20
Charmrun> adding client 1: "192.168.1.20", IP:192.168.1.20
Charmrun> adding client 2: "192.168.1.21", IP:192.168.1.21
Charmrun> Charmrun = 192.168.1.20, port = 53873
start_nodes_ssh
Charmrun> Sending "0 192.168.1.20 53873 1384 0" to client 0.
Charmrun> find the node program "/home/artico1/Desktop/Matrices/./mxm" at "/
home/artico1/Desktop/Matrices" for 0.
Charmrun> Starting ssh 192.168.1.20 -l artico1 -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (192.168.1.20:0) started
Charmrun> Sending "2 192.168.1.20 53873 1384 0" to client 2.
Charmrun> find the node program "/home/artico1/Desktop/Matrices/./mxm" at "/
home/artico1/Desktop/Matrices" for 2.
Charmrun> Starting ssh 192.168.1.21 -l artico1 -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (192.168.1.21:2) started
Charmrun> node programs all started
Charmrun remote shell(192.168.1.21.2)> remote responding...
Charmrun remote shell(192.168.1.20.0)> remote responding...
Charmrun remote shell(192.168.1.21.2)> starting node-program...
Charmrun remote shell(192.168.1.20.0)> starting node-program...
Charmrun remote shell(192.168.1.21.2)> remote shell phase successful.
Charmrun remote shell(192.168.1.20.0)> remote shell phase successful.
Charmrun> Waiting for 0-th client to connect.
Charmrun> error attaching to node '192.168.1.20':
Timeout waiting for node-program to connect

I don't know if I have to configure something else to have isomalloc working
or if the first error has anything to do with what I guessed. The thing is
that out of 5 ranks, 4 of them restart from the checkpoint and just one of
them fails saying there's no checkpoint, but I believe there should be one.

Thanks in advance,
Alberto.




Archive powered by MHonArc 2.6.19.

Top of Page