Skip to Content.
Sympa Menu

charm - Re: [charm] Fault tolerant Jacobi

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Fault tolerant Jacobi


Chronological Thread 
  • From: Kiril Dichev <K.Dichev AT qub.ac.uk>
  • To: Sam White <white67 AT illinois.edu>
  • Cc: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
  • Subject: Re: [charm] Fault tolerant Jacobi
  • Date: Wed, 1 Aug 2018 15:24:53 +0100
  • Authentication-results: illinois.edu; spf=pass smtp.mailfrom=prvs=0751766ec5=K.Dichev AT qub.ac.uk; dkim=pass header.d=qub.ac.uk header.s=qub-rsa; dmarc=none

Hi again,

I am afraid I will need some more clarification on the way MPI recovery works after crashes in Adaptive MPI. In the sample fault-tolerant Jacobi versions for ULFM (e.g http://fault-tolerance.org/2017/11/11/sc17-tutorial/), a lot of the MPI recovery logic is in the actual application code. It is not an easy thing to go through, but there are well defined phases such as

1. revoke communicator upon failure detection
2. shrink communicator via MPI_Comm_shrink
3. expand again via MPI_Comm_spawn

Now, I have been very much focused on how checkpoint/restart happens, which is mostly contained in src/ck-core/ckmemcheckpoint.C. The only indication there of the MPI recovery are the calls find_spare_mpirank and mpi_restart_crashed. ‘mpi_restart_crashed’, implemented in src/arch/mpi/machine.C, however the implementation of these routines there doesn’t give away too much. 

For the moment, I have following questions:

1. So how exactly does Adaptive MPI perform the above steps, which seem necessary no matter how they are implemented? I understand Adaptive MPI does not implement MPI_Comm_shrink, but it must implement something along these lines. How and where does this happens? Also, since Adaptive MPI seems to be more thread-oriented, does it create of a new Unix process, or does it create a new thread within an existing process?

2 How exactly does execution continue post-failure in the application code, say from the start of a new iteration. This is a bit more explicit for ULFM, where survivors use the C calls setjmp / longjmp to reset to the start of a compute iteration. But how does that work with the Adaptive MPI runtime?

Thanks.

Regards,
Kiril

On 20 Jul 2018, at 22:10, Sam White <white67 AT illinois.edu> wrote:

Hi Kiril,

The checkpoint/restart-based fault tolerance schemes described in that paper are available in production for Charm++ and AMPI programs. That includes checkpointing to disk or in-memory, with online recovery. To build Charm++/AMPI with double in-memory checkpoint/restart support, you should build with the 'syncft' option, as in './build AMPI netlrts-linux-x86_64 syncft -j16 --with-production'. I just pushed some cleanup of tests/ampi/jacobi3d/, so if you do 'git pull origin charm' now, then run 'make syncfttest' in that directory you should see the test run with the '+killFile <file>' option.

Also, syncft is currently only supported on the netlrts and verbs communication layers, and message logging fault tolerance is not maintained as a production feature anymore, though it shouldn't be hard to revive it. If you can share, we'd be interested to hear what you're working on.
-Sam

On Fri, Jul 20, 2018 at 10:15 AM, Kiril Dichev <K.Dichev AT qub.ac.uk> wrote:
Hello,

I am a new user of Charm++ and AMPI.

I’ve done some research on fault tolerance in MPI in the last year, and I see some nice ways to couple it with AMPI (happy to explain if anyone is interested). I used a Jacobi solver before, so it would be nice to use the same for AMPI to get going. I am especially interested to test the parallel recovery capabilities that were presented in work such as this one, for Jacobi among other codes: https://repositoriotec.tec.ac.cr/bitstream/handle/2238/7150/Using%20Migratable%20Objects%20to%20Enhance%20Fault%20Tolerance%20Schemes%20in%20Supercomputers.pdf?sequence=1&isAllowed=y


However, I am not sure where to begin. I pulled the official Charm++ repo, which contains some MPI Jacobi code in tests/ampi/jacobi3d. In particular, it has some kill files as well, which a very old tutorial tells me can be used to specify failure scenarios for PEs. However, it seems the +pkill_file option doesn’t even exist anymore, so that’s outdated, and I don’t know if the code is up-to-date either.

On the other hand, there is a repo here, according to the documentation in the main repo:

… which I can’t access, and apparently it also has Jacobi codes I can run with AMPI. Maybe that is the one I need? If it is, can I use it if I’m not affiliated with any US institutions?

Any help which is the up-to-date Jacobi + AMPI would be much appreciated. In addition, any help how to experiment with parallel recovery via migration would be great.


Regards,
Kiril Dichev






Archive powered by MHonArc 2.6.19.

Top of Page