Skip to Content.
Sympa Menu

charm - Re: [charm] Charm/Converse fault tolerance on BW?

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Charm/Converse fault tolerance on BW?


Chronological Thread 
  • From: Xiang Ni <xiangni2 AT illinois.edu>
  • To: "Brunner, Robert Kraemer" <rbrunner AT illinois.edu>
  • Cc: "charm AT cs.illinois.edu" <charm AT cs.illinois.edu>
  • Subject: Re: [charm] Charm/Converse fault tolerance on BW?
  • Date: Thu, 17 Mar 2016 10:52:19 -0500

Hi Robert,

Restarting and continuing execution on remaining processes (without adding a new process) works in Charm++. However, that feature is not yet merged to the production version. Also, we have so far tested it with net build of Charm++ only. We have not tried it on Cray XE system.

Thanks,
Xiang

On Thu, Mar 17, 2016 at 10:18 AM, Brunner, Robert Kraemer <rbrunner AT illinois.edu> wrote:
Hi,

What is the state of fault tolerance support on Cray XE systems (in particular, Blue Waters) with respect to allowing user code to catch node failures. Can the runtime notify the user program that a node has failed, and allow the user program to handle the failure, and  perhaps to keep running, taking the loss of the node and any associated objects into account?

Robert

----------------------------------------------
Robert Brunner
Blue Waters Science and Engineering Applications Support
National Center for Supercomputing Applications
4006F NCSA Building, MC-257
1205 W Clark St
Urbana, IL 61801
217-333-7677
rbrunner AT illinois.edu








Archive powered by MHonArc 2.6.16.

Top of Page