Skip to Content.
Sympa Menu

charm - Re: [charm] [ppl] [Early-users-discuss] NAMD and job dependency

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [ppl] [Early-users-discuss] NAMD and job dependency


Chronological Thread 
  • From: "Kale, Laxmikant V" <kale AT illinois.edu>
  • To: "Miller, Philip B" <mille121 AT illinois.edu>, Joseph Baker <jlbaker AT uchicago.edu>, "Jain, Nikhil" <nikhil AT illinois.edu>
  • Cc: Jeff Hammond <jhammond AT alcf.anl.gov>, wei jiang <wjiang AT alcf.anl.gov>, Charm Mailing List <charm AT cs.illinois.edu>
  • Subject: Re: [charm] [ppl] [Early-users-discuss] NAMD and job dependency
  • Date: Sun, 3 Mar 2013 17:18:11 +0000
  • Accept-language: en-US
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Nikhil will respond with more technical detail, but for a quick piece of
information: this was done recently following a conversation between
Nikhil and Sameer (IBM). Exit(1) on BG is kind of legitimate, if I
understand the convesation right. Without accepting that, our autobuild
process was failing. If that's the case, may be the higher level scripts
should accept "1" as (one of the) proper exit code.

--
Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu
<http://charm.cs.uiuc.edu/>
Professor, Computer Science
kale AT illinois.edu
201 N. Goodwin Avenue Ph: (217) 244-0094
Urbana, IL 61801-2302 FAX: (217) 265-6582






On 3/3/13 10:11 AM, "Phil Miller"
<mille121 AT illinois.edu>
wrote:

>On Sun, Mar 3, 2013 at 10:00 AM, Phil Miller
><mille121 AT illinois.edu>
>wrote:
>> On Sun, Mar 3, 2013 at 9:35 AM, wei jiang
>> <wjiang AT alcf.anl.gov>
>> wrote:
>>>> Charm++ shouldn't be calling exit(1) on correct termination. That's
>>>> the only reasonable solution here and it's trivial to implement.
>>>
>>> The old implementation did call exit(0). The current discussion seems
>>>beyond of early user discussion for bgq.
>>> I would turn it to discussion with charm++/namd developers.
>>
>> This can hop over to the Charm++ mailing list.
>>
>> First question is which machine layer is being used:
>> {pami,pamilrts,mpi]-bluegeneq(-smp(-async)?)?(-xlc)?
>
>Actually, scratch that; looking at the code, it has to be
>pami-bluegeneq-smp-*, and the tail doesn't matter.
>
>It looks like the change to thread 0 on each node calling exit(1)
>instead of exit(0) happened in commit
>aac1a19ecf0fdf132ae93987c1271c30bbd33f37. Perhaps Nikhil knows why the
>introduction of a local barrier at shutdown to synchronize threads
>also included a change to the exit code, and whether it's safe for us
>to make that exit(0)?
>
>>>> On Sun, Mar 3, 2013 at 9:09 AM, wei jiang
>>>> <wjiang AT alcf.anl.gov>
>>>> wrote:
>>>> > Exit(1) has job dependency ruined. Is there a way to start a new
>>>> > job regardless dep_fail? (This is still risky. If the dependency
>>>> > job really failed then the later jobs started from garbage)
>>>> >
>>>> > Wei
>>>> >
>>>> >
>>>> > ----- Original Message -----
>>>> >> From: "Joseph Baker"
>>>> >> <jlbaker AT uchicago.edu>
>>>> >> To: "early-users-discuss"
>>>> >> <early-users-discuss AT lists.alcf.anl.gov>
>>>> >> Sent: Sunday, March 3, 2013 3:46:46 AM
>>>> >> Subject: [Early-users-discuss] NAMD and job dependency
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> Hi,
>>>> >>
>>>> >>
>>>> >> NAMD is now shutting itself down correctly when finished running,
>>>> >> however if I include a job dependency in the qsub command for a
>>>> >> subsequent NAMD job, it goes into the dep_fail state after the
>>>> >> first
>>>> >> job finishes. I looked on the Vesta user guide online, and it says
>>>> >> that there is an expected exit status 0 for the dependency to not
>>>> >> enter the fail state.
>>>> >>
>>>> >>
>>>> >> In the error log there are these lines
>>>> >>
>>>> >>
>>>> >> 2013-03-03 08:04:26.526 (INFO ) [0xfff7aa6a270]
>>>> >> VST-00040-33771-1024:126717:ibm.runjob.client.Job: job 126717
>>>> >> started
>>>> >> 2013-03-03 08:04:43.906 (INFO ) [0xfff99191ab0]
>>>> >> 7653:tatu.runjob.monitor: tracklib completed
>>>> >> 2013-03-03 08:16:36.620 (INFO ) [0xfff7aa6a270]
>>>> >> VST-00040-33771-1024:126717:ibm.runjob.client.Job: exited with
>>>> >> status 1
>>>> >> 2013-03-03 08:16:36.623 (WARN ) [0xfff7aa6a270]
>>>> >> VST-00040-33771-1024:126717:ibm.runjob.client.Job: normal
>>>> >> termination with status 1 from rank 397
>>>> >> 2013-03-03 08:16:36.623 (INFO ) [0xfff7aa6a270]
>>>> >> tatu.runjob.client:
>>>> >> task exited with status 1
>>>> >> 2013-03-03 08:16:36.624 (INFO ) [0xfff99191ab0]
>>>> >> 7653:tatu.runjob.monitor: monitor terminating
>>>> >> 2013-03-03 08:16:36.625 (INFO ) [0xfff7aa6a270]
>>>> >> tatu.runjob.client:
>>>> >> monitor completed
>>>> >>
>>>> >>
>>>> >> and in the cobaltlog, it ends with
>>>> >>
>>>> >> Info: task completed normally with an exit code of 1; initiating
>>>> >> job
>>>> >> cleanup and removal
>>>> >>
>>>> >>
>>>> >> So, it looks from this that the NAMD exit code is 1. Is this what
>>>> >> is
>>>> >> causing the problem with the job dependency?
>>>> >>
>>>> >>
>>>> >> Thanks,
>>>> >> Joe
>>>> >>
>>>> >>
>>>> >> _______________________________________________
>>>> >> early-users-discuss mailing list
>>>> >> early-users-discuss AT lists.alcf.anl.gov
>>>> >> https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss
>>>> >>
>>>> > _______________________________________________
>>>> > early-users-discuss mailing list
>>>> > early-users-discuss AT lists.alcf.anl.gov
>>>> > https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss
>>>>
>>>>
>>>>
>>>> --
>>>> Jeff Hammond
>>>> Argonne Leadership Computing Facility
>>>> University of Chicago Computation Institute
>>>> jhammond AT alcf.anl.gov
>>>> / (630) 252-5381
>>>> http://www.linkedin.com/in/jeffhammond
>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>
>>> _______________________________________________
>>> early-users-discuss mailing list
>>> early-users-discuss AT lists.alcf.anl.gov
>>> https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss
>_______________________________________________
>charm mailing list
>charm AT cs.uiuc.edu
>http://lists.cs.uiuc.edu/mailman/listinfo/charm
>_______________________________________________
>ppl mailing list
>ppl AT cs.uiuc.edu
>http://lists.cs.uiuc.edu/mailman/listinfo/ppl






Archive powered by MHonArc 2.6.16.

Top of Page