Skip to Content.
Sympa Menu

charm - Re: [charm] Fault Tolerance

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Fault Tolerance


Chronological Thread 
  • From: <alberto.ortiz09 AT gmail.com>
  • To: charm AT lists.cs.illinois.edu,samt.white AT gmail.com
  • Subject: Re: [charm] Fault Tolerance
  • Date: Mon, 20 Feb 2017 08:33:13 -0600

Thanks for the explanation about checkpointing.
As you said I have tried linking to os-isomalloc by compiling the program with
'ampicc -memory os-isomalloc mxm.c -o mxm'. The program compiled correctly but
again it failed on execution. I incorporated another node with another ZYNQ in
the system just in case it worked with more nodes.
This time, when running it with './charmrun ++verbos ++nodelist hosts +p4 ./
mxm +`vp6 +isomalloc_sync' the outcome is like follows:
[artico1@alarm Matrices]$ ./charmrun ++verbose ++nodelist hosts +p5 ./mxm +vp6
+isomalloc_sync
Charmrun> scalable start enabled.
Charmrun> charmrun started...
Charmrun> using hosts as nodesfile
Charmrun> adding client 0: "192.168.1.20", IP:192.168.1.20
Charmrun> adding client 1: "192.168.1.20", IP:192.168.1.20
Charmrun> adding client 2: "192.168.1.21", IP:192.168.1.21
Charmrun> adding client 3: "192.168.1.21", IP:192.168.1.21
Charmrun> adding client 4: "192.168.1.22", IP:192.168.1.22
Charmrun> Charmrun = 192.168.1.20, port = 52448
start_nodes_ssh
Charmrun> Sending "0 192.168.1.20 52448 787 0" to client 0.
Charmrun> find the node program "/home/artico1/Desktop/Matrices/./mxm" at "/
home/artico1/Desktop/Matrices" for 0.
Charmrun> Starting ssh 192.168.1.20 -l artico1 -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (192.168.1.20:0) started
Charmrun> Sending "2 192.168.1.20 52448 787 0" to client 2.
Charmrun> find the node program "/home/artico1/Desktop/Matrices/./mxm" at "/
home/artico1/Desktop/Matrices" for 2.
Charmrun> Starting ssh 192.168.1.21 -l artico1 -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (192.168.1.21:2) started
Charmrun> Sending "4 192.168.1.20 52448 787 0" to client 4.
Charmrun> find the node program "/home/artico1/Desktop/Matrices/./mxm" at "/
home/artico1/Desktop/Matrices" for 4.
Charmrun> Starting ssh 192.168.1.22 -l artico1 -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (192.168.1.22:4) started
Charmrun> node programs all started
Charmrun remote shell(192.168.1.22.4)> remote responding...
Charmrun remote shell(192.168.1.22.4)> starting node-program...
Charmrun remote shell(192.168.1.20.0)> remote responding...
Charmrun remote shell(192.168.1.22.4)> remote shell phase successful.
Charmrun remote shell(192.168.1.20.0)> starting node-program...
Charmrun remote shell(192.168.1.20.0)> remote shell phase successful.
Charmrun remote shell(192.168.1.21.2)> remote responding...
Charmrun remote shell(192.168.1.21.2)> starting node-program...
Charmrun remote shell(192.168.1.21.2)> remote shell phase successful.
Charmrun> Waiting for 0-th client to connect.
Charmrun> Waiting for 1-th client to connect.
Charmrun> Waiting for 2-th client to connect.
Charmrun> Waiting for 3-th client to connect.
Charmrun> Waiting for 4-th client to connect.
Charmrun> client 4 connected (IP=192.168.1.22 data_port=33360)
Charmrun> client 0 connected (IP=192.168.1.20 data_port=40245)
Charmrun> client 1 connected (IP=192.168.1.20 data_port=48189)
Charmrun> client 2 connected (IP=192.168.1.21 data_port=60053)
Charmrun> client 3 connected (IP=192.168.1.21 data_port=33851)
Charmrun> All clients connected.
Charmrun> IP tables sent.
Charmrun> node programs all connected
Charmrun> started all node programs in 1.486 seconds.
Converse/Charm++ Commit ID: v6.7.0-607-g4228aa601
Warning> Randomization of stack pointer is turned on in kernel.
Charm++> synchronizing isomalloc memory region...
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: isomalloc_sync failed, make sure you have a shared file system.
------------- Processor 2 Exiting: Called CmiAbort ------------
Reason: isomalloc_sync failed, make sure you have a shared file system.
------------- Processor 3 Exiting: Called CmiAbort ------------
Reason: isomalloc_sync failed, make sure you have a shared file system.
Fatal error on PE 2> isomalloc_sync failed, make sure you have a shared file
system.

I may have to have a shared file system, but I don't know what it refers to if
I'm using the option 'in_memory'.

If running the program without the '+isomalloc_sync' the program's outcome is:
[artico1@alarm Matrices]$ ./charmrun ++verbose ++nodelist hosts +p5 ./mxm +vp6
Charmrun> scalable start enabled.
Charmrun> charmrun started...
Charmrun> using hosts as nodesfile
Charmrun> adding client 0: "192.168.1.20", IP:192.168.1.20
Charmrun> adding client 1: "192.168.1.20", IP:192.168.1.20
Charmrun> adding client 2: "192.168.1.21", IP:192.168.1.21
Charmrun> adding client 3: "192.168.1.21", IP:192.168.1.21
Charmrun> adding client 4: "192.168.1.22", IP:192.168.1.22
Charmrun> Charmrun = 192.168.1.20, port = 38935
start_nodes_ssh
Charmrun> Sending "0 192.168.1.20 38935 659 0" to client 0.
Charmrun> find the node program "/home/artico1/Desktop/Matrices/./mxm" at "/
home/artico1/Desktop/Matrices" for 0.
Charmrun> Starting ssh 192.168.1.20 -l artico1 -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (192.168.1.20:0) started
Charmrun> Sending "2 192.168.1.20 38935 659 0" to client 2.
Charmrun> find the node program "/home/artico1/Desktop/Matrices/./mxm" at "/
home/artico1/Desktop/Matrices" for 2.
Charmrun> Starting ssh 192.168.1.21 -l artico1 -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (192.168.1.21:2) started
Charmrun> Sending "4 192.168.1.20 38935 659 0" to client 4.
Charmrun> find the node program "/home/artico1/Desktop/Matrices/./mxm" at "/
home/artico1/Desktop/Matrices" for 4.
Charmrun> Starting ssh 192.168.1.22 -l artico1 -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (192.168.1.22:4) started
Charmrun> node programs all started
Charmrun remote shell(192.168.1.20.0)> remote responding...
Charmrun remote shell(192.168.1.20.0)> starting node-program...
Charmrun remote shell(192.168.1.22.4)> remote responding...
Charmrun remote shell(192.168.1.20.0)> remote shell phase successful.
Charmrun remote shell(192.168.1.22.4)> starting node-program...
Charmrun remote shell(192.168.1.22.4)> remote shell phase successful.
Charmrun remote shell(192.168.1.21.2)> remote responding...
Charmrun remote shell(192.168.1.21.2)> starting node-program...
Charmrun remote shell(192.168.1.21.2)> remote shell phase successful.
Charmrun> Waiting for 0-th client to connect.

[artico1@alarm Matrices]$ ampicc -memory os-isomalloc mxm.c -o mxm
[artico1@alarm Matrices]$ ./charmrun ++verbose ++nodelist hosts +p5 ./mxm +vp6
Charmrun> scalable start enabled.
Charmrun> charmrun started...
Charmrun> using hosts as nodesfile
Charmrun> adding client 0: "192.168.1.20", IP:192.168.1.20
Charmrun> adding client 1: "192.168.1.20", IP:192.168.1.20
Charmrun> adding client 2: "192.168.1.21", IP:192.168.1.21
Charmrun> adding client 3: "192.168.1.21", IP:192.168.1.21
Charmrun> adding client 4: "192.168.1.22", IP:192.168.1.22
Charmrun> Charmrun = 192.168.1.20, port = 36208
start_nodes_ssh
Charmrun> Sending "0 192.168.1.20 36208 770 0" to client 0.
Charmrun> find the node program "/home/artico1/Desktop/Matrices/./mxm" at "/
home/artico1/Desktop/Matrices" for 0.
Charmrun> Starting ssh 192.168.1.20 -l artico1 -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (192.168.1.20:0) started
Charmrun> Sending "2 192.168.1.20 36208 770 0" to client 2.
Charmrun> find the node program "/home/artico1/Desktop/Matrices/./mxm" at "/
home/artico1/Desktop/Matrices" for 2.
Charmrun> Starting ssh 192.168.1.21 -l artico1 -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (192.168.1.21:2) started
Charmrun> Sending "4 192.168.1.20 36208 770 0" to client 4.
Charmrun> find the node program "/home/artico1/Desktop/Matrices/./mxm" at "/
home/artico1/Desktop/Matrices" for 4.
Charmrun> Starting ssh 192.168.1.22 -l artico1 -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (192.168.1.22:4) started
Charmrun> node programs all started
Charmrun remote shell(192.168.1.20.0)> remote responding...
Charmrun remote shell(192.168.1.20.0)> starting node-program...
Charmrun remote shell(192.168.1.20.0)> remote shell phase successful.
Charmrun remote shell(192.168.1.21.2)> remote responding...
Charmrun remote shell(192.168.1.22.4)> remote responding...
Charmrun remote shell(192.168.1.21.2)> starting node-program...
Charmrun remote shell(192.168.1.22.4)> starting node-program...
Charmrun remote shell(192.168.1.21.2)> remote shell phase successful.
Charmrun remote shell(192.168.1.22.4)> remote shell phase successful.
Charmrun> Waiting for 0-th client to connect.
Charmrun> Waiting for 1-th client to connect.
Charmrun> Waiting for 2-th client to connect.
Charmrun> Waiting for 3-th client to connect.
Charmrun> Waiting for 4-th client to connect.
Charmrun> client 0 connected (IP=192.168.1.20 data_port=48204)
Charmrun> client 1 connected (IP=192.168.1.20 data_port=33467)
Charmrun> client 2 connected (IP=192.168.1.21 data_port=48124)
Charmrun> client 4 connected (IP=192.168.1.22 data_port=34383)
Charmrun> client 3 connected (IP=192.168.1.21 data_port=54983)
Charmrun> All clients connected.
Charmrun> IP tables sent.
Charmrun> node programs all connected
Charmrun> started all node programs in 1.472 seconds.
Converse/Charm++ Commit ID: v6.7.0-607-g4228aa601
Warning> Randomization of stack pointer is turned on in kernel, thread
migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as
root to disable it, or try run with '+isomalloc_sync'.
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> CkMemCheckPTInit mainchare is created!

And it stays in that state forever.
I don't know if this information is sufficient to detect what I am doing
wrong, but if needed do not hesitate to ask me for other information, as I
really need to have checkpointing running. Resilience is one of the objectives
of our ongoing project.

Thank you in advance,
Alberto.



Archive powered by MHonArc 2.6.19.

Top of Page