Skip to Content.
Sympa Menu

charm - [charm] Charm++ fail detection

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

[charm] Charm++ fail detection


Chronological Thread 
  • From: Cristian Camilo Ruiz Sanabria <cristian.ruiz AT inria.fr>
  • To: charm <charm AT lists.cs.illinois.edu>
  • Subject: [charm] Charm++ fail detection
  • Date: Mon, 2 Nov 2015 21:28:18 +0100 (CET)

Hello,

I'm was testing the fault tolerant capacities of Charm++.
More specifically, how Charm++ detect faults. I noticed if
the node is rebooted normally, the charm processes will be received a signal
and then it will reestablish the execution from the previous checkpoint.
But when I shut down a given node in a not clean way, Charm++ freezes and
it does not detect that a node has failed.
I tested this using a virtual infrastructure by simply forcing the shut down
of vms and also in a real platform by generating a hard reboot.
Does Charm++ detect the SIGPIPE signal?

I'm using Charm++ version 6.6.0 compiled in the following way:

./build charm++ net-linux-x86_64 syncft -O0

Then I run the jacobi3d example:

charm-6.6.0/net-linux-x86_64-syncft/tests/charm++/jacobi3d#./charmrun ++p 32 ++nodelist ~/nodelist ./jacobi3d 512 512 256 64 64 64

Charmrun> started all node programs in 4.001 seconds.
Converse/Charm++ Commit ID:
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> CkMemCheckPTInit mainchare is created!

STENCIL COMPUTATION WITH BARRIERS
STENCIL COMPUTATION WITH BARRIERS
Running Jacobi on 32 processors with (8, 8, 4) chares
Array Dimensions: 512 512 256
Block Dimensions: 64 64 64
Start of iteration 0 at 0.040905
Start of iteration 10 at 1.219512
Start of iteration 20 at 2.034965
Start of iteration 30 at 2.744387
Start of iteration 40 at 3.444831
Start of iteration 50 at 4.139210
Start of iteration 60 at 4.853558
Start of iteration 70 at 5.544240
Start of iteration 80 at 6.245915
Start of iteration 90 at 6.953072
Start of iteration 100 at 7.645917
Start of iteration 110 at 8.347531
Start of iteration 120 at 9.043844
Start of iteration 130 at 9.737729
Start of iteration 140 at 10.449244
Start of iteration 150 at 11.166308
Start of iteration 160 at 11.866738
Start of iteration 170 at 12.575717
Start of iteration 180 at 13.285010
Start of iteration 190 at 13.991714
[0] Start checkpointing  starter: 0...
[0] Checkpoint finished in 0.140883 seconds, sending callback ...
Start of iteration 200 at 14.848064
[0] Checkpoint Processor data: 5447
Start of iteration 210 at 15.538338
Start of iteration 220 at 16.242073
Start of iteration 230 at 16.945164
Start of iteration 240 at 17.636115
Start of iteration 250 at 18.699790
Start of iteration 260 at 19.805021


Here one of the nodes has failed but Charm++ doesnt detect it and the execution is blocked.
I waited for around half and hour and nothing happened, the execution was still blocked.

Any ideas about this behavior? is it normal?



  • [charm] Charm++ fail detection, Cristian Camilo Ruiz Sanabria, 11/02/2015

Archive powered by MHonArc 2.6.16.

Top of Page