Skip to Content.
Sympa Menu

charm - Re: [charm] [Early-users-discuss] NAMD and job dependency

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [Early-users-discuss] NAMD and job dependency


Chronological Thread 
  • From: Joseph Baker <jlbaker AT uchicago.edu>
  • To: Nikhil Jain <nikhil AT illinois.edu>
  • Cc: wei jiang <wjiang AT alcf.anl.gov>, "charm AT cs.illinois.edu" <charm AT cs.illinois.edu>, Ray Loy <rloy AT alcf.anl.gov>, early-users-discuss <early-users-discuss AT lists.alcf.anl.gov>
  • Subject: Re: [charm] [Early-users-discuss] NAMD and job dependency
  • Date: Mon, 4 Mar 2013 16:42:54 -0600
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Recompiled the new charm++ and NAMD 2.9 nightly source on Vesta. The jobs now exit upon completion, and dependencies work correctly as well. Thanks again!

Joe


On Mon, Mar 4, 2013 at 12:24 PM, Joseph Baker <jlbaker AT uchicago.edu> wrote:
Hi Nikhil,

Thanks! I'll try recompiling NAMD in just a bit and see what happens. 

Joe


On Mon, Mar 4, 2013 at 12:22 PM, Nikhil Jain <nikhil AT illinois.edu> wrote:
Wei, Joseph

I checked in a fix for this issue. NAMD should exit cleanly with exit code
of 0 with this fix. Let me know if you find any oddity.

--Nikhil

--
Nikhil Jain, nikhil AT illinois.edu, http://charm.cs.uiuc.edu/people/nikhil
Doctoral Candidate @ CS, UIUC






On 3/4/13 11:56 AM, "Ray Loy" <rloy AT alcf.anl.gov> wrote:

>
>Wei will sort out the real problem but just to
>mention, possible workarounds are
>  a) manually remove the dependency from the job to get it started
>     e.g. qalter --dependencies none
>  b) you can force the original run to always have exit(0) regardless
>     of actual failure using vesta:~rloy/public/bin/nofail
>     (see comments for usage).
>
>
>Ray
>
>
>
>----- Original Message -----
>> From: "wei jiang" <wjiang AT alcf.anl.gov>
>> To: "Jeff Hammond" <jhammond AT alcf.anl.gov>
>> Cc: "Joseph Baker" <jlbaker AT uchicago.edu>, "early-users-discuss"
>><early-users-discuss AT lists.alcf.anl.gov>
>> Sent: Sunday, March 3, 2013 9:35:50 AM
>> Subject: Re: [Early-users-discuss] NAMD and job dependency
>>
>> > Charm++ shouldn't be calling exit(1) on correct termination.
>> >  That's
>> > the only reasonable solution here and it's trivial to implement.
>>
>> The old implementation did call exit(0). The current discussion seems
>> beyond of early user discussion for bgq.
>> I would turn it to discussion with charm++/namd developers.
>>
>> Wei
>>
>> > Jeff
>> >
>> > On Sun, Mar 3, 2013 at 9:09 AM, wei jiang <wjiang AT alcf.anl.gov>
>> > wrote:
>> > > Exit(1) has job dependency ruined. Is there a way to start a new
>> > > job regardless dep_fail?  (This is still risky. If the dependency
>> > > job really failed then the later jobs started from garbage)
>> > >
>> > > Wei
>> > >
>> > >
>> > > ----- Original Message -----
>> > >> From: "Joseph Baker" <jlbaker AT uchicago.edu>
>> > >> To: "early-users-discuss"
>> > >> <early-users-discuss AT lists.alcf.anl.gov>
>> > >> Sent: Sunday, March 3, 2013 3:46:46 AM
>> > >> Subject: [Early-users-discuss] NAMD and job dependency
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> Hi,
>> > >>
>> > >>
>> > >> NAMD is now shutting itself down correctly when finished
>> > >> running,
>> > >> however if I include a job dependency in the qsub command for a
>> > >> subsequent NAMD job, it goes into the dep_fail state after the
>> > >> first
>> > >> job finishes. I looked on the Vesta user guide online, and it
>> > >> says
>> > >> that there is an expected exit status 0 for the dependency to
>> > >> not
>> > >> enter the fail state.
>> > >>
>> > >>
>> > >> In the error log there are these lines
>> > >>
>> > >>
>> > >> 2013-03-03 08:04:26.526 (INFO ) [0xfff7aa6a270]
>> > >> VST-00040-33771-1024:126717:ibm.runjob.client.Job: job 126717
>> > >> started
>> > >> 2013-03-03 08:04:43.906 (INFO ) [0xfff99191ab0]
>> > >> 7653:tatu.runjob.monitor: tracklib completed
>> > >> 2013-03-03 08:16:36.620 (INFO ) [0xfff7aa6a270]
>> > >> VST-00040-33771-1024:126717:ibm.runjob.client.Job: exited with
>> > >> status 1
>> > >> 2013-03-03 08:16:36.623 (WARN ) [0xfff7aa6a270]
>> > >> VST-00040-33771-1024:126717:ibm.runjob.client.Job: normal
>> > >> termination with status 1 from rank 397
>> > >> 2013-03-03 08:16:36.623 (INFO ) [0xfff7aa6a270]
>> > >> tatu.runjob.client:
>> > >> task exited with status 1
>> > >> 2013-03-03 08:16:36.624 (INFO ) [0xfff99191ab0]
>> > >> 7653:tatu.runjob.monitor: monitor terminating
>> > >> 2013-03-03 08:16:36.625 (INFO ) [0xfff7aa6a270]
>> > >> tatu.runjob.client:
>> > >> monitor completed
>> > >>
>> > >>
>> > >> and in the cobaltlog, it ends with
>> > >>
>> > >> Info: task completed normally with an exit code of 1; initiating
>> > >> job
>> > >> cleanup and removal
>> > >>
>> > >>
>> > >> So, it looks from this that the NAMD exit code is 1. Is this
>> > >> what
>> > >> is
>> > >> causing the problem with the job dependency?
>> > >>
>> > >>
>> > >> Thanks,
>> > >> Joe
>> > >>
>> > >>
>> > >> _______________________________________________
>> > >> early-users-discuss mailing list
>> > >> early-users-discuss AT lists.alcf.anl.gov
>> > >> https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss
>> > >>
>> > > _______________________________________________
>> > > early-users-discuss mailing list
>> > > early-users-discuss AT lists.alcf.anl.gov
>> > > https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss
>> >
>> >
>> >
>> > --
>> > Jeff Hammond
>> > Argonne Leadership Computing Facility
>> > University of Chicago Computation Institute
>> > jhammond AT alcf.anl.gov / (630) 252-5381
>> > http://www.linkedin.com/in/jeffhammond
>> > https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>> >
>> _______________________________________________
>> early-users-discuss mailing list
>> early-users-discuss AT lists.alcf.anl.gov
>> https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss
>>
>_______________________________________________
>early-users-discuss mailing list
>early-users-discuss AT lists.alcf.anl.gov
>https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss







Archive powered by MHonArc 2.6.16.

Top of Page