Skip to Content.
Sympa Menu

charm - Re: [charm] BLCR and NAMD

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] BLCR and NAMD


Chronological Thread 
  • From: Paul Hargrove <phhargrove AT lbl.gov>
  • To: Phil Miller <mille121 AT illinois.edu>
  • Cc: Joseph Farran <jfarran AT uci.edu>, namd-l AT ks.uiuc.edu, Charm Mailing List <charm AT cs.illinois.edu>, checkpoint <checkpoint AT lbl.gov>
  • Subject: Re: [charm] BLCR and NAMD
  • Date: Thu, 7 Nov 2013 23:54:48 -0800
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>


On Thu, Nov 7, 2013 at 8:47 PM, Phil Miller <mille121 AT illinois.edu> wrote:
My sugestion for how BLCR ought to handle this is to look for file descriptors under /proc matching the checkpointing process's PID/TID, and somehow marking them to be replaced with the restarting process PID/TID. Perhaps just store the original PID/TID with the checkpoint (if you don't already), so that the restart procedure can compare against it at the point of need.

Phil,

BLCR actually restores the PGID/PID/TID so the values after restart are the SAME as the ones at checkpoint time.

However, the way BLCR currently orders its restart operations files are reopened before PID/TID restoration is done.  At the time files are being reopened the restarting process temporarily has whatever PID/TID fork (or clone) just happened to allocate to it. Thus /pid/1234 (for example) will exist at the end of the restart, but not at the point in time that BLCR attempts to reopen /proc/1234/task/1234/stat.  There are technical reasons I won't go into here why PID/TID restoration needs to be done "very late" in the restart process which make a complete swapping the order of file and PID/TID restoration impractical.

However, I think BLCR could treat the case of open /proc/<pid> files and directories distinct from others and defer their reopen until even later than the restoration of PID/TID.  Unlike the case of files (for which we may need to checkpoint the CONTENTS) the state that needs to be "stashed" to delay a /proc/<pid> file/dir is very small (path, mode and offset).  However, that is still a non-trivial change and not one I can make with a simple patch.

Your suggestion might be applied at the time files are opened in BLCR now, without reordering any operations.  However, I looked into that and there is at lease one non-obvious problem because that "rename" will be recorded in the kernel as the path to the open file.  If another checkpoint is taken after a restart the WRONG filename is recorded and the correspondence is lost.  Even w/o a subsequent checkpoint the wrong path would show in "ls -l /proc/<pid>/fd".

-Paul

--
Paul H. Hargrove                          PHHargrove AT lbl.gov
Future Technologies Group
Computer and Data Sciences Department     Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900



Archive powered by MHonArc 2.6.16.

Top of Page