Skip to Content.
Sympa Menu

charm - [charm] BLCR and NAMD

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

[charm] BLCR and NAMD


Chronological Thread 
  • From: Phil Miller <mille121 AT illinois.edu>
  • To: Joseph Farran <jfarran AT uci.edu>, Paul Hargrove <phhargrove AT lbl.gov>
  • Cc: checkpoint AT lbl.gov, namd-l AT ks.uiuc.edu, Charm Mailing List <charm AT cs.illinois.edu>
  • Subject: [charm] BLCR and NAMD
  • Date: Thu, 7 Nov 2013 20:47:14 -0800
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Hi Joseph and Paul,

I'm one of the Charm++ developers, and noticed your conversation in the NAMD mailing list archive about checkpointing NAMD with BLCR by happenstance.

The open file descriptor under /proc is not a leak, in the sense that the Charm++ runtime system code that opens this file maintains a reference to it and will refer to it later if certain functions are called. However, it's probably not something that we should really be keeping open.

We're using it to identify which core a running task is currently mapped to. In situations in which that's useful information to have, the threads should probably be getting pinned to particular cores at startup anyway (+setcpuaffinity and +pemap flags), and thus the results can be cached and the file closed. Without thread affinity, the OS can be moving threads arbitrarily anyway, so any use of that information is apt to be stale.

I'll open up an issue in the Charm++ bug tracker about caching this information and closing the associated file descriptor.

My sugestion for how BLCR ought to handle this is to look for file descriptors under /proc matching the checkpointing process's PID/TID, and somehow marking them to be replaced with the restarting process PID/TID. Perhaps just store the original PID/TID with the checkpoint (if you don't already), so that the restart procedure can compare against it at the point of need.

Phil



Archive powered by MHonArc 2.6.16.

Top of Page