Skip to Content.
Sympa Menu

charm - [charm] Migration

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

[charm] Migration


Chronological Thread 
  • From: <alberto.ortiz09 AT gmail.com>
  • To: charm AT lists.cs.illinois.edu
  • Subject: [charm] Migration
  • Date: Wed, 15 Mar 2017 11:13:18 -0500

Hello,
I have been having problems with migration on a ZYNQ device which has a dual-
core ARM processor running ArchLinux. As debugging and trying to correct
things is much slower than in a linux on a desktop computer I have installed
AMPI on my desktop with the command line './build AMPI netlrts-linux-x86_64
syncft -g -O0 -DCMK_USE_MEMPOOL_ISOMALLOC=1'.
To try out migration I have tried running the jacobi test included in /tests/
ampi/jacobi3d.
I edited the killfile to kill process 3 in 18 seconds instead of 40 and
changed the checkpointing frequency to once every 10 iterations. When
executing the Makefile with 'make syncfttest' the program goes well, making
the checkpointing and continuing when process 3 is killed. The problem comes
when making the first checkpointing after migration, the MPI_Info value
changes and contains strange characters it shouldn't contain and the program
stops there. The outcome can be seen here:


iter 10 elapsed time: 13.241638 time: 0.459466 maxerr: 1432.637261
iter 11 elapsed time: 13.797741 time: 0.532956 maxerr: 1369.591734
[0] Start checkpointing starter: 0...
[0] Checkpoint finished in 1.696204 seconds, sending callback ...
[0] Checkpoint Processor data: 6765
iter 12 elapsed time: 16.032651 time: 2.202434 maxerr: 1314.155629
iter 13 elapsed time: 16.530041 time: 0.452459 maxerr: 1264.920493
iter 14 elapsed time: 17.077625 time: 0.520671 maxerr: 1220.815468
iter 15 elapsed time: 17.661705 time: 0.538240 maxerr: 1181.010637
iter 16 elapsed time: 18.325778 time: 0.621765 maxerr: 1144.851902
Charmrun> error on request socket to node 3 'localhost'--
Socket closed before recv.
Socket 11 failed
Charmrun> Sending "3 10.0.2.15 34313 24803 0" to client 3.
Charmrun> find the node program "/home/artico1/Desktop/AMPI/charm/netlrts-
linux-x86_64-syncft/tests/ampi/jacobi3d/./jacobi" at "/home/artico1/Desktop/
AMPI/charm/netlrts-linux-x86_64-syncft/tests/ampi/jacobi3d" for 3.
Charmrun> Starting ssh localhost -l artico1 -o KbdInteractiveAuthentication=no
-o PasswordAuthentication=no -o NoHostAuthenticationForLocalhost=yes /bin/bash
-f
Charmrun> remote shell (localhost:3) started
Charmrun remote shell(localhost.3)> remote responding...
Charmrun remote shell(localhost.3)> starting node-program...
Charmrun remote shell(localhost.3)> remote shell phase successful.
Charmrun finished launching new process in 1.235761s
charmrun says Processor 3 failed on Node 3
socket_index 7 crashed_node 3 reconnected fd 11
Charmrun> client 3 connected (IP=127.0.0.1 data_port=59471)
Charmrun> continue node: 3
[3] consolidated Isomalloc memory region at restart: 0x440000000 -
0x7ffb80000000 (134181888 megs)
[3] Restarting after crash
[3] I am restarting cur_restart_phase:2 at time: 0.000611
[3] I am restarting cur_restart_phase:2 discard charm message at time:
0.000629
[4] askProcDataHandler called with '3' cur_restart_phase:2 at time 20.089632.
[4] askProcDataHandler called with '3' cur_restart_phase:2 done at time
20.089733.
[3] ----- recoverProcDataHandler cur_restart_phase:2 at time: 0.017730
[3] ----- recoverProcDataHandler done at 0.018138
[3] restartBcastHandler cur_restart_phase=2 _diePE:3 at 0.018169.
[3] CkRestartCheckPointCallback activated for diePe: 3 at 0.040308
CkRestartCheckPoint CkMemCheckPT GID:10 at time 0.040343
[3] Process data restored in 0.039765 seconds
[3] CkMemCheckPT ----- restart.
[3] CkMemCheckPT ----- resetLB len:0 in 0.035940 seconds.
[3] CkMemCheckPT ----- removeArrayElements in 0.050573 seconds
[3] CkMemCheckPT ----- recoverBuddies starts at 0.126889
[4]got message for crashed pe 3
[3] CkMemCheckPT ----- recoverArrayElements starts at 0.504349
recover all ends
[3] CkMemCheckPT ----- recoverArrayElements streams at 0.534110
[3] CkMemCheckPT ----- recoverArrayElements in 0.116023 seconds, callback
triggered
[3] Restart finished in 0.619784 seconds at 0.620413.
iter 12 elapsed time: 14.680434 time: 0.862847 maxerr: 1314.155629
iter 13 elapsed time: 15.368519 time: 0.639659 maxerr: 1264.920493
iter 14 elapsed time: 16.105127 time: 0.696428 maxerr: 1220.815468
iter 15 elapsed time: 16.732691 time: 0.590407 maxerr: 1181.010637
iter 16 elapsed time: 17.240766 time: 0.482753 maxerr: 1144.851902
iter 17 elapsed time: 17.944852 time: 0.679374 maxerr: 1111.816033
iter 18 elapsed time: 18.592441 time: 0.607193 maxerr: 1081.478937
iter 19 elapsed time: 19.100480 time: 0.467898 maxerr: 1053.492816
iter 20 elapsed time: 19.628545 time: 0.487279 maxerr: 1027.569428
iter 21 elapsed time: 20.184631 time: 0.497433 maxerr: 1003.467610
WARNING: Unknown MPI_Info value (in_memory?�) given to AMPI_Migrate for key:
ampi_checkpoint
WARNING: Unknown MPI_Info value (in_memory�:��) given to AMPI_Migrate for
key:
ampi_checkpoint
WARNING: Unknown MPI_Info value (in_memoryY�) given to AMPI_Migrate for key:
ampi_checkpoint
WARNING: Unknown MPI_Info value (in_memoryp�) given to AMPI_Migrate for key:
ampi_checkpoint
WARNING: Unknown MPI_Info value (in_memory�:��) given to AMPI_Migrate for
key:
ampi_checkpoint
WARNING: Unknown MPI_Info value (in_memory�:��) given to AMPI_Migrate for
key:
ampi_checkpoint
WARNING: Unknown MPI_Info value (in_memory�:��) given to AMPI_Migrate for
key:
ampi_checkpoint

I have tried linking to both 'isomalloc' and 'os-isomalloc' and with and
without the flag 'isomalloc_sync'. I don't know if I am doing something wrong
and haven't seen anything related to this failure over the internet.
Thank you in advance and looking forward to your response,
Alberto


  • [charm] Migration, alberto.ortiz09, 03/15/2017

Archive powered by MHonArc 2.6.19.

Top of Page