charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Fault Tolerance

From: <alberto.ortiz09 AT gmail.com>
To: alberto.ortiz09 AT gmail.com,charm AT lists.cs.illinois.edu
Subject: Re: [charm] Fault Tolerance
Date: Tue, 21 Feb 2017 09:57:48 -0600

The test I made with the 3 ZYNQs was not made correctly. I had not changed
charm's build in the extra ZYNQ to the build with 'syncft'. Even so, if
running with '+isomalloc_sync' the outcome is the same as mentioned, giving
this last messages:

Warning> Randomization of stack pointer is turned on in kernel.
Charm++> synchronizing isomalloc memory region...
------------- Processor 2 Exiting: Called CmiAbort ------------
Reason: isomalloc_sync failed, make sure you have a shared file system.
Fatal error on PE 2> isomalloc_sync failed, make sure you have a shared file
system.

When running without the '+isomalloc_sync' option and having compiled the
program on all 3 ZYNQ devices with '-memory os-isomalloc' the program runs
correctly, but when I reboot one of the devices the migration fails again:

Charmrun> error on request socket to node 3 '192.168.1.22'--
Socket closed before recv.
Socket 7 failed
Charmrun> Sending "3 192.168.1.20 33698 1016 0" to client 3.
Charmrun> find the node program "/home/artico1/Desktop/Matrices/./mxm" at "/
home/artico1/Desktop/Matrices" for 3.
Charmrun> Starting ssh 192.168.1.21 -l artico1 -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (192.168.1.21:3) started
Charmrun remote shell(192.168.1.21.3)> remote responding...
Charmrun remote shell(192.168.1.21.3)> starting node-program...
Charmrun remote shell(192.168.1.21.3)> remote shell phase successful.
Charmrun finished launching new process in 1.387827s
charmrun says Processor 3 failed on Node 3
socket_index 3 crashed_node 3 reconnected fd 7
Charmrun> client 3 connected (IP=192.168.1.21 data_port=53221)
Charmrun> continue node: 3
[3] Restarting after crash
[3] I am restarting cur_restart_phase:2 at time: 0.007460
[3] I am restarting cur_restart_phase:2 discard charm message at time:
0.007545
[0] askProcDataHandler called with '3' cur_restart_phase:2 at time 4.706002.
[0] askProcDataHandler called with '3' cur_restart_phase:2 done at time
4.706227.
[3] ----- recoverProcDataHandler cur_restart_phase:2 at time: 0.009055
[3] ----- recoverProcDataHandler done at 0.014779
[3] restartBcastHandler cur_restart_phase=2 _diePE:3 at 0.015279.
[3] CkRestartCheckPointCallback activated for diePe: 3 at 0.015494
CkRestartCheckPoint CkMemCheckPT GID:10 at time 0.015554
[3] Process data restored in 0.008339 seconds
[3] CkMemCheckPT ----- restart.
[3] CkMemCheckPT ----- resetLB len:0 in 0.000951 seconds.
[3] CkMemCheckPT ----- removeArrayElements in 0.006037 seconds
[3] CkMemCheckPT ----- recoverBuddies starts at 0.022787
[0]got message for crashed pe 3
[3] CkMemCheckPT ----- recoverArrayElements starts at 0.100417
recover all ends
[3] CkMemCheckPT ----- recoverArrayElements streams at 0.118232
[3] CkMemCheckPT ----- recoverArrayElements in 0.033772 seconds, callback
triggered
[3] Restart finished in 0.126767 seconds at 0.134313.
mpi_mm has started with 5 tasks.
Initializing arrays...
Sending 13 rows to task 1 offset=0
Sending 13 rows to task 2 offset=13
Sending 12 rows to task 3 offset=26
Sending 12 rows to task 4 offset=38
Aqui las columnas de AMPI_RANK[1]_WTH[1]
Aqui las columnas de AMPI_RANK[4]_WTH[0]
Received results from task 1
Aqui las columnas de AMPI_RANK[2]_WTH[2]
Received results from task 2
Charmrun> error on request socket to node 3 '192.168.1.21'--
Socket closed before recv.
Socket 7 failed
Charmrun> Sending "3 192.168.1.20 33698 1016 0" to client 3.
Charmrun> find the node program "/home/artico1/Desktop/Matrices/./mxm" at "/
home/artico1/Desktop/Matrices" for 3.
Charmrun> Starting ssh 192.168.1.20 -l artico1 -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (192.168.1.20:3) started
Charmrun remote shell(192.168.1.20.3)> remote responding...
Charmrun remote shell(192.168.1.20.3)> starting node-program...
Charmrun remote shell(192.168.1.20.3)> remote shell phase successful.
Charmrun finished launching new process in 1.389111s
charmrun says Processor 3 failed on Node 3
socket_index 3 crashed_node 3 reconnected fd 7
Charmrun> client 3 connected (IP=192.168.1.20 data_port=33453)
Charmrun> continue node: 3
[3] Restarting after crash
[3] I am restarting cur_restart_phase:3 at time: 0.005712
[3] I am restarting cur_restart_phase:3 discard charm message at time:
0.005971
[0] askProcDataHandler called with '3' cur_restart_phase:3 at time 6.369887.
[0] no checkpoint found for processor 3. This could be due to a crash before
the first checkpointing.
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: no checkpoint found
Fatal error on PE 0> no checkpoint found

It seems like it makes the migration correctly once, but fails again without
rebooting any ZYNQ device again and it fails to migrate. I only reboot a ZYNQ
once, which introduces a failure that triggers the first migration, but I
don't know why another migration is done.

Another question I have is if there is a way to manually specify the buddies
of each processor manually.

Thanks in advance and sorry for all the emails sent,
Alberto.

Re: [charm] Fault Tolerance, Sam White, 02/16/2017
- Re: [charm] Fault Tolerance, alberto.ortiz09, 02/17/2017
- Message not available
  - Re: [charm] Fault Tolerance, Sam White, 02/17/2017
    - Re: [charm] Fault Tolerance, alberto.ortiz09, 02/20/2017
      - Re: [charm] Fault Tolerance, alberto.ortiz09, 02/20/2017
        
        Re: [charm] Fault Tolerance, alberto.ortiz09, 02/21/2017