Skip to Content.
Sympa Menu

charm - Re: [charm] Fault tolerant Jacobi

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Fault tolerant Jacobi


Chronological Thread 
  • From: Sam White <white67 AT illinois.edu>
  • To: Kiril Dichev <K.Dichev AT qub.ac.uk>
  • Cc: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
  • Subject: Re: [charm] Fault tolerant Jacobi
  • Date: Fri, 20 Jul 2018 16:10:56 -0500
  • Authentication-results: illinois.edu; spf=pass smtp.mailfrom=samt.white AT gmail.com; dkim=pass header.d=gmail.com header.s=20161025; dkim=pass header.d=illinois-edu.20150623.gappssmtp.com header.s=20150623; dmarc=none header.from=illinois.edu

Hi Kiril,

The checkpoint/restart-based fault tolerance schemes described in that paper are available in production for Charm++ and AMPI programs. That includes checkpointing to disk or in-memory, with online recovery. To build Charm++/AMPI with double in-memory checkpoint/restart support, you should build with the 'syncft' option, as in './build AMPI netlrts-linux-x86_64 syncft -j16 --with-production'. I just pushed some cleanup of tests/ampi/jacobi3d/, so if you do 'git pull origin charm' now, then run 'make syncfttest' in that directory you should see the test run with the '+killFile <file>' option.

Also, syncft is currently only supported on the netlrts and verbs communication layers, and message logging fault tolerance is not maintained as a production feature anymore, though it shouldn't be hard to revive it. If you can share, we'd be interested to hear what you're working on.
-Sam

On Fri, Jul 20, 2018 at 10:15 AM, Kiril Dichev <K.Dichev AT qub.ac.uk> wrote:
Hello,

I am a new user of Charm++ and AMPI.

I’ve done some research on fault tolerance in MPI in the last year, and I see some nice ways to couple it with AMPI (happy to explain if anyone is interested). I used a Jacobi solver before, so it would be nice to use the same for AMPI to get going. I am especially interested to test the parallel recovery capabilities that were presented in work such as this one, for Jacobi among other codes: https://repositoriotec.tec.ac.cr/bitstream/handle/2238/7150/Using%20Migratable%20Objects%20to%20Enhance%20Fault%20Tolerance%20Schemes%20in%20Supercomputers.pdf?sequence=1&isAllowed=y


However, I am not sure where to begin. I pulled the official Charm++ repo, which contains some MPI Jacobi code in tests/ampi/jacobi3d. In particular, it has some kill files as well, which a very old tutorial tells me can be used to specify failure scenarios for PEs. However, it seems the +pkill_file option doesn’t even exist anymore, so that’s outdated, and I don’t know if the code is up-to-date either.

On the other hand, there is a repo here, according to the documentation in the main repo:

… which I can’t access, and apparently it also has Jacobi codes I can run with AMPI. Maybe that is the one I need? If it is, can I use it if I’m not affiliated with any US institutions?

Any help which is the up-to-date Jacobi + AMPI would be much appreciated. In addition, any help how to experiment with parallel recovery via migration would be great.


Regards,
Kiril Dichev





Archive powered by MHonArc 2.6.19.

Top of Page