Skip to Content.
Sympa Menu

charm - Re: [charm] Fault tolerant Jacobi

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Fault tolerant Jacobi


Chronological Thread 
  • From: Kiril Dichev <K.Dichev AT qub.ac.uk>
  • To: Sam White <white67 AT illinois.edu>
  • Cc: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
  • Subject: Re: [charm] Fault tolerant Jacobi
  • Date: Mon, 23 Jul 2018 17:23:41 +0100
  • Authentication-results: illinois.edu; spf=pass smtp.mailfrom=prvs=074267f5d5=K.Dichev AT qub.ac.uk; dkim=pass header.d=qub.ac.uk header.s=qub-rsa; dmarc=none

Hey Sam,

On the general context: I have been experimenting with localised rollback with stencil codes, such as Jacobi. Localised rollback, as opposed to, global rollback, that is, I try to make less processes roll back. The nice thing is that I’ve been trying to do this without the need for message logging. The real issues is that you don’t necessarily save in overall execution time, unless you have something fancy in the MPI runtime, like the ability to migrate work during the rollback. 

<Drum roll sounds here> Enter AMPI. I was hoping its ability to migrate work will actually lead to overall savings in runtime as well. That’s something that I am unable to get now with other implementations. That’s what I would like to experiment with. Any more questions on what I want to do, I am happy to explain further.

I seem to be able to run the checkpointing and recovery the Jacobi example. Thanks for the help! I may come back with some more questions later.

Regards,
Kiril


On 20 Jul 2018, at 22:10, Sam White <white67 AT illinois.edu> wrote:

Hi Kiril,

The checkpoint/restart-based fault tolerance schemes described in that paper are available in production for Charm++ and AMPI programs. That includes checkpointing to disk or in-memory, with online recovery. To build Charm++/AMPI with double in-memory checkpoint/restart support, you should build with the 'syncft' option, as in './build AMPI netlrts-linux-x86_64 syncft -j16 --with-production'. I just pushed some cleanup of tests/ampi/jacobi3d/, so if you do 'git pull origin charm' now, then run 'make syncfttest' in that directory you should see the test run with the '+killFile <file>' option.

Also, syncft is currently only supported on the netlrts and verbs communication layers, and message logging fault tolerance is not maintained as a production feature anymore, though it shouldn't be hard to revive it. If you can share, we'd be interested to hear what you're working on.
-Sam

On Fri, Jul 20, 2018 at 10:15 AM, Kiril Dichev <K.Dichev AT qub.ac.uk> wrote:
Hello,

I am a new user of Charm++ and AMPI.

I’ve done some research on fault tolerance in MPI in the last year, and I see some nice ways to couple it with AMPI (happy to explain if anyone is interested). I used a Jacobi solver before, so it would be nice to use the same for AMPI to get going. I am especially interested to test the parallel recovery capabilities that were presented in work such as this one, for Jacobi among other codes: https://repositoriotec.tec.ac.cr/bitstream/handle/2238/7150/Using%20Migratable%20Objects%20to%20Enhance%20Fault%20Tolerance%20Schemes%20in%20Supercomputers.pdf?sequence=1&isAllowed=y


However, I am not sure where to begin. I pulled the official Charm++ repo, which contains some MPI Jacobi code in tests/ampi/jacobi3d. In particular, it has some kill files as well, which a very old tutorial tells me can be used to specify failure scenarios for PEs. However, it seems the +pkill_file option doesn’t even exist anymore, so that’s outdated, and I don’t know if the code is up-to-date either.

On the other hand, there is a repo here, according to the documentation in the main repo:

… which I can’t access, and apparently it also has Jacobi codes I can run with AMPI. Maybe that is the one I need? If it is, can I use it if I’m not affiliated with any US institutions?

Any help which is the up-to-date Jacobi + AMPI would be much appreciated. In addition, any help how to experiment with parallel recovery via migration would be great.


Regards,
Kiril Dichev






Archive powered by MHonArc 2.6.19.

Top of Page