Skip to Content.
Sympa Menu

charm - Re: [charm] [Early-users-discuss] NAMD and job dependency

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [Early-users-discuss] NAMD and job dependency


Chronological Thread 
  • From: Phil Miller <mille121 AT illinois.edu>
  • To: Joseph Baker <jlbaker AT uchicago.edu>, Nikhil Jain <nikhil AT illinois.edu>
  • Cc: Jeff Hammond <jhammond AT alcf.anl.gov>, wei jiang <wjiang AT alcf.anl.gov>, Charm Mailing List <charm AT cs.illinois.edu>
  • Subject: Re: [charm] [Early-users-discuss] NAMD and job dependency
  • Date: Sun, 3 Mar 2013 10:11:23 -0600
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

On Sun, Mar 3, 2013 at 10:00 AM, Phil Miller
<mille121 AT illinois.edu>
wrote:
> On Sun, Mar 3, 2013 at 9:35 AM, wei jiang
> <wjiang AT alcf.anl.gov>
> wrote:
>>> Charm++ shouldn't be calling exit(1) on correct termination. That's
>>> the only reasonable solution here and it's trivial to implement.
>>
>> The old implementation did call exit(0). The current discussion seems
>> beyond of early user discussion for bgq.
>> I would turn it to discussion with charm++/namd developers.
>
> This can hop over to the Charm++ mailing list.
>
> First question is which machine layer is being used:
> {pami,pamilrts,mpi]-bluegeneq(-smp(-async)?)?(-xlc)?

Actually, scratch that; looking at the code, it has to be
pami-bluegeneq-smp-*, and the tail doesn't matter.

It looks like the change to thread 0 on each node calling exit(1)
instead of exit(0) happened in commit
aac1a19ecf0fdf132ae93987c1271c30bbd33f37. Perhaps Nikhil knows why the
introduction of a local barrier at shutdown to synchronize threads
also included a change to the exit code, and whether it's safe for us
to make that exit(0)?

>>> On Sun, Mar 3, 2013 at 9:09 AM, wei jiang
>>> <wjiang AT alcf.anl.gov>
>>> wrote:
>>> > Exit(1) has job dependency ruined. Is there a way to start a new
>>> > job regardless dep_fail? (This is still risky. If the dependency
>>> > job really failed then the later jobs started from garbage)
>>> >
>>> > Wei
>>> >
>>> >
>>> > ----- Original Message -----
>>> >> From: "Joseph Baker"
>>> >> <jlbaker AT uchicago.edu>
>>> >> To: "early-users-discuss"
>>> >> <early-users-discuss AT lists.alcf.anl.gov>
>>> >> Sent: Sunday, March 3, 2013 3:46:46 AM
>>> >> Subject: [Early-users-discuss] NAMD and job dependency
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> Hi,
>>> >>
>>> >>
>>> >> NAMD is now shutting itself down correctly when finished running,
>>> >> however if I include a job dependency in the qsub command for a
>>> >> subsequent NAMD job, it goes into the dep_fail state after the
>>> >> first
>>> >> job finishes. I looked on the Vesta user guide online, and it says
>>> >> that there is an expected exit status 0 for the dependency to not
>>> >> enter the fail state.
>>> >>
>>> >>
>>> >> In the error log there are these lines
>>> >>
>>> >>
>>> >> 2013-03-03 08:04:26.526 (INFO ) [0xfff7aa6a270]
>>> >> VST-00040-33771-1024:126717:ibm.runjob.client.Job: job 126717
>>> >> started
>>> >> 2013-03-03 08:04:43.906 (INFO ) [0xfff99191ab0]
>>> >> 7653:tatu.runjob.monitor: tracklib completed
>>> >> 2013-03-03 08:16:36.620 (INFO ) [0xfff7aa6a270]
>>> >> VST-00040-33771-1024:126717:ibm.runjob.client.Job: exited with
>>> >> status 1
>>> >> 2013-03-03 08:16:36.623 (WARN ) [0xfff7aa6a270]
>>> >> VST-00040-33771-1024:126717:ibm.runjob.client.Job: normal
>>> >> termination with status 1 from rank 397
>>> >> 2013-03-03 08:16:36.623 (INFO ) [0xfff7aa6a270]
>>> >> tatu.runjob.client:
>>> >> task exited with status 1
>>> >> 2013-03-03 08:16:36.624 (INFO ) [0xfff99191ab0]
>>> >> 7653:tatu.runjob.monitor: monitor terminating
>>> >> 2013-03-03 08:16:36.625 (INFO ) [0xfff7aa6a270]
>>> >> tatu.runjob.client:
>>> >> monitor completed
>>> >>
>>> >>
>>> >> and in the cobaltlog, it ends with
>>> >>
>>> >> Info: task completed normally with an exit code of 1; initiating
>>> >> job
>>> >> cleanup and removal
>>> >>
>>> >>
>>> >> So, it looks from this that the NAMD exit code is 1. Is this what
>>> >> is
>>> >> causing the problem with the job dependency?
>>> >>
>>> >>
>>> >> Thanks,
>>> >> Joe
>>> >>
>>> >>
>>> >> _______________________________________________
>>> >> early-users-discuss mailing list
>>> >> early-users-discuss AT lists.alcf.anl.gov
>>> >> https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss
>>> >>
>>> > _______________________________________________
>>> > early-users-discuss mailing list
>>> > early-users-discuss AT lists.alcf.anl.gov
>>> > https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss
>>>
>>>
>>>
>>> --
>>> Jeff Hammond
>>> Argonne Leadership Computing Facility
>>> University of Chicago Computation Institute
>>> jhammond AT alcf.anl.gov
>>> / (630) 252-5381
>>> http://www.linkedin.com/in/jeffhammond
>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>
>> _______________________________________________
>> early-users-discuss mailing list
>> early-users-discuss AT lists.alcf.anl.gov
>> https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss




Archive powered by MHonArc 2.6.16.

Top of Page