Skip to Content.
Sympa Menu

charm - Re: [charm] Fault Tolerance Documentation

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Fault Tolerance Documentation


Chronological Thread 
  • From: "Kale, Laxmikant V" <kale AT illinois.edu>
  • To: "Wang, Felix Y." <wang65 AT llnl.gov>, "charm AT cs.illinois.edu" <charm AT cs.illinois.edu>
  • Subject: Re: [charm] Fault Tolerance Documentation
  • Date: Wed, 25 Jul 2012 12:19:25 +0000
  • Accept-language: en-US
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Thanks for you comments, Felix. We are in the process of updating the manuals (and, hopefully soon, writing a tutorial), and we will keep your comments in mind. For now, I would like to know: There is the Charm++ manual online (at http://charm.cs.illinois.edu/manuals/  It is the one titled Charm++ language) . Section 6 is about fault tolerance. Was the info in there inadequate or incorrect? We plan to add at least one example program for checkpoint anyway. But it will beuseful to know specific deficiencies of that manual section as we are working on improving it.


-- 
Laxmikant (Sanjay) Kale         http://charm.cs.uiuc.edu
Professor, Computer Science     kale AT illinois.edu
201 N. Goodwin Avenue           Ph:  (217) 244-0094
Urbana, IL  61801-2302          FAX: (217) 265-6582

On 7/24/12 5:28 PM, "Wang, Felix Y." <wang65 AT llnl.gov> wrote:

Hello PPL,

I'm an intern at LLNL over the summer, and I've been working on a code port of the LULESH proxy application to Charm++ and have started to put in some constructs for fault tolerance (checkpoints/restarts) these past few days. Unfortunately, the documentation that is generally available online is rather sparse, and it does not point to any good examples for checkpointing and restarting as it is actually used in a program. Fortunately, I've been able to meet with Xiang to discuss what to actually do the implementation, and she was able to point me to some example code as well as how to build Charm++ to incorporate these constructs in the first place, among other necessary items.

Please take this email as a request to provide a more comprehensive manual section on the fault tolerance aspects of Charm++. A section/link to a tutorial, such as with the PUPers, would also be helpful.

Thanks,

--- Felix



Archive powered by MHonArc 2.6.16.

Top of Page