Skip to Content.
Sympa Menu

charm - Re: [charm] BLCR and NAMD

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] BLCR and NAMD


Chronological Thread 
  • From: Joseph Farran <jfarran AT uci.edu>
  • To: Phil Miller <mille121 AT illinois.edu>, Paul Hargrove <phhargrove AT lbl.gov>
  • Cc: checkpoint AT lbl.gov, namd-l AT ks.uiuc.edu, Joseph Farran <jfarran AT uci.edu>, Charm Mailing List <charm AT cs.illinois.edu>
  • Subject: Re: [charm] BLCR and NAMD
  • Date: Thu, 07 Nov 2013 20:56:44 -0800
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Hi Phil.

Thank you for the details. Paul Hargrove was able to provide us with a
patch to BLCR by which cr_restart will warn but keep running on missing files
in /proc.

We have been using the patched BLCR version on our cluster with great success
and NAMD is able to be check-pointed and resumed without issues.

Many thanks to Paul Hargrove as NAMD is extensively used on our campus
cluster. If you can fix this on next release of NAMD that will be great as
others may enjoy BLCR check-pointing with NAMD as well.

Cheers,
Joseph


On 11/7/2013 8:47 PM, Phil Miller wrote:
Hi Joseph and Paul,

I'm one of the Charm++ developers, and noticed your conversation in the NAMD
mailing list archive about checkpointing NAMD with BLCR by happenstance.

The open file descriptor under /proc is not a leak, in the sense that the Charm++ runtime system code that opens this file maintains a reference to it and will refer to it later if certain functions are called. However, it's probably not something that we should really be keeping open.

We're using it to identify which core a running task is currently mapped to. In situations in which that's useful information to have, the threads should probably be getting pinned to particular cores at startup anyway (+setcpuaffinity and +pemap flags), and thus the results can be cached and the file closed. Without thread affinity, the OS can be moving threads arbitrarily anyway, so any use of that information is apt to be stale.

I'll open up an issue in the Charm++ bug tracker about caching this
information and closing the associated file descriptor.

My sugestion for how BLCR ought to handle this is to look for file descriptors under /proc matching the checkpointing process's PID/TID, and somehow marking them to be replaced with the restarting process PID/TID. Perhaps just store the original PID/TID with the checkpoint (if you don't already), so that the restart procedure can compare against it at the point of need.

Phil





Archive powered by MHonArc 2.6.16.

Top of Page