Skip to Content.
Sympa Menu

charm - Re: [charm] Fault Tolerance

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Fault Tolerance


Chronological Thread 
  • From: Sam White <samt.white AT gmail.com>
  • To: "alberto.ortiz09 AT gmail.com" <alberto.ortiz09 AT gmail.com>
  • Cc: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
  • Subject: Re: [charm] Fault Tolerance
  • Date: Thu, 16 Feb 2017 12:11:21 -0600

Hi Alberto,

To enable automatic fault detection and recovery in AMPI, you'll need to build Charm++/AMPI with 'syncft' support, as in './build AMPI netlrts-linux syncft --with-production'.

Then the only thing you have to do is checkpoint from your application every so often, using calls to AMPI_Migrate(MPI_Info). To request in-memory checkpointing, create an info object like so:

MPI_Info chkpt_info;
MPI_Info_create(&chkpt_info);
MPI_Info_set(chkpt_info, "ampi_checkpoint", "in_memory");

Then every so many iterations of your application, make calls like this:

AMPI_Migrate(chkpt_info);

When AMPI detects a fault, it will restart all ranks from their last checkpoint automatically.

If you are not using PUP routines for migratability, you will need to link your application with Isomalloc too, as in 'ampicc -memory isomalloc pgm.o pgm'

For dynamic load balancing, you can also use Isomalloc to automate migratability and the only other changes needed will be to 1) create an MPI_Info object with the "ampi_load_balance" and "sync" key/value set, 2) link your application with '-memory isomalloc -module CommonLBs', then at runtime select the desired load balancing strategy as in '+balancer GreedyLB'. This will migrate whole AMPI ranks and all of the data they own.

For more information on all of this, see the AMPI manual here: http://charm.cs.illinois.edu/manuals/html/ampi/manual.html
The section of the Charm++ manual on load balancing might also be helpful, since AMPI directly uses Charm++'s load balancing framework.

Let me know if you run into any other issues with this,
Sam

On Thu, Feb 16, 2017 at 11:40 AM, alberto.ortiz09 AT gmail.com <alberto.ortiz09 AT gmail.com> wrote:
Hi,

I am using AMPI on a Zynq-cluster, having each Zynq a dual-core ARM. Currently
I am using 3 MicroZed boards (each one has a Zynq device). I was interested in
using AMPI from the start instead of using OpenMPI since it provides the user
with fault tolerance, adaptability and resilience.

The problem I have is that I don't know how to use or activate its fault
tolerance. I am programing in C using the MPI language and compiling the
programs with ampicc. The fault tolerance test I would like to try is to have
the 3 devices runing a task and reboot or plug off one of them, expecting AMPI
to redistribute the threads that were started in the unplugged device to the
working devices. I don't know if this kind of fault tolerance is implemented
nor how to take advantage of or use the fault tolerance implemented.

Another thing I would like to ask is if AMPI has support for run-time load
balancing. For example, if I were to multiply 10 big matrices and one node
ended its task before others, how can I implement the run-time load balance in
order to load the node with more work taken from other overloaded nodes?

Thank you in advance for the continuous support,
Alberto.




Archive powered by MHonArc 2.6.19.

Top of Page