charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Fault tolerant Jacobi

From: Sam White <white67 AT illinois.edu>
To: Kiril Dichev <K.Dichev AT qub.ac.uk>
Cc: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
Subject: Re: [charm] Fault tolerant Jacobi
Date: Thu, 2 Aug 2018 07:52:15 -0500
Authentication-results: illinois.edu; spf=pass smtp.mailfrom=samt.white AT gmail.com; dkim=pass header.d=gmail.com header.s=20161025; dkim=pass header.d=illinois-edu.20150623.gappssmtp.com header.s=20150623; dmarc=none header.from=illinois.edu

Yeah, that would require several changes to the current checkpoint/restart support. Currently the runtime always packs and unpacks all of the MPI state and the user-level thread stack. The heap data can either be automatically packed and unpacked via Isomalloc or done manually via PUP routines. With PUP routines you can selectively pack/unpack only certain data. We also assume a full rollback, but I think there's been work on partial rollbacks (at least in terms of message logging fault tolerance).

If you'd like to look into this more, here are some starting points: TCharm::pup() and TCharm::pupThread() in charm/src/libs/ck-libs/tcharm/tcharm.C are where the thread stack and PUP routines are called. The thread stack is pup'ed in charm/src/conv-core/threads.c.

Others can chime in if they have more info on the checkpoint/restart strategies and partial rollbacks. Hope this helps,
Sam

On Thu, Aug 2, 2018 at 7:22 AM, Kiril Dichev <K.Dichev AT qub.ac.uk> wrote:

Thanks Sam. I understand better now the recovery. The fact that the recovery is almost transparent to the application code (almost, because I still need to provide the PUP pack/unpack routines for checkpointing) is great. The ULFM Jacobi code more than doubles compared to what you guys have managed for AMPI, so I see the advantages there. However, there are some issues when it comes to playing around with different rollback strategies, which is what I am getting at. The important question for my research is if I can test some partial rollback strategies, where some processes don’t roll back, and others do.

Can I draw a line between runtime/communicator/MPI data, and application data, after failure? Can I choose to reset all MPI-related data, such as communicators (whether it resides in heap or stack), but disable:

a) the reset of the application data residing on the stack, which includes pretty much all static variables that hold application data (such as iteration)

b) the application checkpoint (the actual array I provide pack/unpack routines for)

at an MPI process (say, P1) of my choice?

I suspect this is probably impossible to do with the existing design, but maybe you can give me your take on this.

Regards,

Kiril

On 1 Aug 2018, at 17:49, Sam White <white67 AT illinois.edu> wrote:

1. AMPI, unlike ULFM, makes fault recovery transparent to the user. It does so based on its ability to copy and migrate all of the state that an AMPI rank owns and has associated with it. We rely on the Isomalloc memory allocator to do this automatically, or else the user can write explicit Pack/UnPack (PUP) routines to serialize and deserialize their state at runtime. Either way the runtime takes care of serializing all of its own internal state and the user-level thread stack.

Here's how it all works step-by-step:

- The application periodically checkpoints its state via explicit calls to AMPI_Migrate() for either in-memory or disk checkpoints.
- The runtime system continuously monitors for failures by having processes periodically ping each other (buddies are laid out in a ring fashion).
- When a failure is detected, all AMPI ranks are rolled back to their latest checkpoint, and the AMPI ranks (user-level threads) that were on the failed process are restarted in a different (already existing) OS process. If there's a newly added imbalance of ranks per process, then the user call AMPI_Migrate() to perform dynamic load balancing to smooth that out.
- Since all of the runtime's state as well as the application's heap and thread memory is restored to its last checkpoint, and because AMPI supports virtualization, there are no user-level changes needed in the application to continue running, such as MPI_Comm_shrink. All communicators will continue to work, since all AMPI ranks, even those on the failed process, will continue execution from the last checkpoint. The only change is the physical location of some ranks, which AMPI already virtualizes anyway.

2. Execution proceeds from the last checkpoint (call to AMPI_Migrate), since we roll back all heap and stack memory to what it was during that call.

Also note that the code in charm/src/arch/mpi/machine.C is actually the MPI communication layer upon which Charm++ can be built (just like OFI, Infiniband Verbs, Cray uGNI, IBM PAMI, etc), while Adaptive MPI lives in charm/src/libs/ck-libs/ampi/. Most of the code that makes this really work is inside Charm++ and its core location management though, and AMPI basically looks like any other Charm++ application, with the addition of Isomalloc to automate serialization.

Let us know if you have further questions about any of this,
Sam

On Wed, Aug 1, 2018 at 9:24 AM, Kiril Dichev <K.Dichev AT qub.ac.uk> wrote:

Hi again,

I am afraid I will need some more clarification on the way MPI recovery works after crashes in Adaptive MPI. In the sample fault-tolerant Jacobi versions for ULFM (e.g http://fault-tolerance.org/2017/11/11/sc17-tutorial/), a lot of the MPI recovery logic is in the actual application code. It is not an easy thing to go through, but there are well defined phases such as

1. revoke communicator upon failure detection

2. shrink communicator via MPI_Comm_shrink

3. expand again via MPI_Comm_spawn

Now, I have been very much focused on how checkpoint/restart happens, which is mostly contained in src/ck-core/ckmemcheckpoint.C. The only indication there of the MPI recovery are the calls find_spare_mpirank and mpi_restart_crashed. ‘mpi_restart_crashed’, implemented in src/arch/mpi/machine.C, however the implementation of these routines there doesn’t give away too much.

For the moment, I have following questions:

1. So how exactly does Adaptive MPI perform the above steps, which seem necessary no matter how they are implemented? I understand Adaptive MPI does not implement MPI_Comm_shrink, but it must implement something along these lines. How and where does this happens? Also, since Adaptive MPI seems to be more thread-oriented, does it create of a new Unix process, or does it create a new thread within an existing process?

2 How exactly does execution continue post-failure in the application code, say from the start of a new iteration. This is a bit more explicit for ULFM, where survivors use the C calls setjmp / longjmp to reset to the start of a compute iteration. But how does that work with the Adaptive MPI runtime?

Thanks.

Regards,

Kiril

On 20 Jul 2018, at 22:10, Sam White <white67 AT illinois.edu> wrote:

Hi Kiril,

The checkpoint/restart-based fault tolerance schemes described in that paper are available in production for Charm++ and AMPI programs. That includes checkpointing to disk or in-memory, with online recovery. To build Charm++/AMPI with double in-memory checkpoint/restart support, you should build with the 'syncft' option, as in './build AMPI netlrts-linux-x86_64 syncft -j16 --with-production'. I just pushed some cleanup of tests/ampi/jacobi3d/, so if you do 'git pull origin charm' now, then run 'make syncfttest' in that directory you should see the test run with the '+killFile <file>' option.

Also, syncft is currently only supported on the netlrts and verbs communication layers, and message logging fault tolerance is not maintained as a production feature anymore, though it shouldn't be hard to revive it. If you can share, we'd be interested to hear what you're working on.
-Sam

On Fri, Jul 20, 2018 at 10:15 AM, Kiril Dichev <K.Dichev AT qub.ac.uk> wrote:

Hello,

I am a new user of Charm++ and AMPI.

I’ve done some research on fault tolerance in MPI in the last year, and I see some nice ways to couple it with AMPI (happy to explain if anyone is interested). I used a Jacobi solver before, so it would be nice to use the same for AMPI to get going. I am especially interested to test the parallel recovery capabilities that were presented in work such as this one, for Jacobi among other codes: https://repositoriotec.tec.ac.cr/bitstream/handle/2238/7150/Using%20Migratable%20Objects%20to%20Enhance%20Fault%20Tolerance%20Schemes%20in%20Supercomputers.pdf?sequence=1&isAllowed=y

However, I am not sure where to begin. I pulled the official Charm++ repo, which contains some MPI Jacobi code in tests/ampi/jacobi3d. In particular, it has some kill files as well, which a very old tutorial tells me can be used to specify failure scenarios for PEs. However, it seems the +pkill_file option doesn’t even exist anymore, so that’s outdated, and I don’t know if the code is up-to-date either.

On the other hand, there is a repo here, according to the documentation in the main repo:

ssh://charm.cs.illinois.edu:9418/benchmarks/ampi-benchmarks

… which I can’t access, and apparently it also has Jacobi codes I can run with AMPI. Maybe that is the one I need? If it is, can I use it if I’m not affiliated with any US institutions?

Any help which is the up-to-date Jacobi + AMPI would be much appreciated. In addition, any help how to experiment with parallel recovery via migration would be great.

Regards,

Kiril Dichev

Re: [charm] Fault tolerant Jacobi, Kiril Dichev, 08/01/2018
- <Possible follow-up(s)>
- Re: [charm] Fault tolerant Jacobi, Sam White, 08/01/2018
  - Re: [charm] Fault tolerant Jacobi, Kiril Dichev, 08/02/2018
  - Message not available
    - Re: [charm] Fault tolerant Jacobi, Sam White, 08/02/2018
      - Re: [charm] Fault tolerant Jacobi, Kiril Dichev, 08/06/2018
      - Message not available
        
        Re: [charm] Fault tolerant Jacobi, Sam White, 08/07/2018
        
        Re: [charm] Fault tolerant Jacobi, Kiril Dichev, 08/07/2018