Skip to Content.
Sympa Menu

charm - Re: [charm] [Early-users-discuss] NAMD and job dependency

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [Early-users-discuss] NAMD and job dependency


Chronological Thread 
  • From: Phil Miller <mille121 AT illinois.edu>
  • To: Joseph Baker <jlbaker AT uchicago.edu>
  • Cc: Jeff Hammond <jhammond AT alcf.anl.gov>, wei jiang <wjiang AT alcf.anl.gov>, Charm Mailing List <charm AT cs.illinois.edu>
  • Subject: Re: [charm] [Early-users-discuss] NAMD and job dependency
  • Date: Sun, 3 Mar 2013 10:00:07 -0600
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

On Sun, Mar 3, 2013 at 9:35 AM, wei jiang
<wjiang AT alcf.anl.gov>
wrote:
>> Charm++ shouldn't be calling exit(1) on correct termination. That's
>> the only reasonable solution here and it's trivial to implement.
>
> The old implementation did call exit(0). The current discussion seems
> beyond of early user discussion for bgq.
> I would turn it to discussion with charm++/namd developers.

This can hop over to the Charm++ mailing list.

First question is which machine layer is being used:
{pami,pamilrts,mpi]-bluegeneq(-smp(-async)?)?(-xlc)?

>
> Wei
>
>> Jeff
>>
>> On Sun, Mar 3, 2013 at 9:09 AM, wei jiang
>> <wjiang AT alcf.anl.gov>
>> wrote:
>> > Exit(1) has job dependency ruined. Is there a way to start a new
>> > job regardless dep_fail? (This is still risky. If the dependency
>> > job really failed then the later jobs started from garbage)
>> >
>> > Wei
>> >
>> >
>> > ----- Original Message -----
>> >> From: "Joseph Baker"
>> >> <jlbaker AT uchicago.edu>
>> >> To: "early-users-discuss"
>> >> <early-users-discuss AT lists.alcf.anl.gov>
>> >> Sent: Sunday, March 3, 2013 3:46:46 AM
>> >> Subject: [Early-users-discuss] NAMD and job dependency
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> Hi,
>> >>
>> >>
>> >> NAMD is now shutting itself down correctly when finished running,
>> >> however if I include a job dependency in the qsub command for a
>> >> subsequent NAMD job, it goes into the dep_fail state after the
>> >> first
>> >> job finishes. I looked on the Vesta user guide online, and it says
>> >> that there is an expected exit status 0 for the dependency to not
>> >> enter the fail state.
>> >>
>> >>
>> >> In the error log there are these lines
>> >>
>> >>
>> >> 2013-03-03 08:04:26.526 (INFO ) [0xfff7aa6a270]
>> >> VST-00040-33771-1024:126717:ibm.runjob.client.Job: job 126717
>> >> started
>> >> 2013-03-03 08:04:43.906 (INFO ) [0xfff99191ab0]
>> >> 7653:tatu.runjob.monitor: tracklib completed
>> >> 2013-03-03 08:16:36.620 (INFO ) [0xfff7aa6a270]
>> >> VST-00040-33771-1024:126717:ibm.runjob.client.Job: exited with
>> >> status 1
>> >> 2013-03-03 08:16:36.623 (WARN ) [0xfff7aa6a270]
>> >> VST-00040-33771-1024:126717:ibm.runjob.client.Job: normal
>> >> termination with status 1 from rank 397
>> >> 2013-03-03 08:16:36.623 (INFO ) [0xfff7aa6a270]
>> >> tatu.runjob.client:
>> >> task exited with status 1
>> >> 2013-03-03 08:16:36.624 (INFO ) [0xfff99191ab0]
>> >> 7653:tatu.runjob.monitor: monitor terminating
>> >> 2013-03-03 08:16:36.625 (INFO ) [0xfff7aa6a270]
>> >> tatu.runjob.client:
>> >> monitor completed
>> >>
>> >>
>> >> and in the cobaltlog, it ends with
>> >>
>> >> Info: task completed normally with an exit code of 1; initiating
>> >> job
>> >> cleanup and removal
>> >>
>> >>
>> >> So, it looks from this that the NAMD exit code is 1. Is this what
>> >> is
>> >> causing the problem with the job dependency?
>> >>
>> >>
>> >> Thanks,
>> >> Joe
>> >>
>> >>
>> >> _______________________________________________
>> >> early-users-discuss mailing list
>> >> early-users-discuss AT lists.alcf.anl.gov
>> >> https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss
>> >>
>> > _______________________________________________
>> > early-users-discuss mailing list
>> > early-users-discuss AT lists.alcf.anl.gov
>> > https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss
>>
>>
>>
>> --
>> Jeff Hammond
>> Argonne Leadership Computing Facility
>> University of Chicago Computation Institute
>> jhammond AT alcf.anl.gov
>> / (630) 252-5381
>> http://www.linkedin.com/in/jeffhammond
>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>
> _______________________________________________
> early-users-discuss mailing list
> early-users-discuss AT lists.alcf.anl.gov
> https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss




Archive powered by MHonArc 2.6.16.

Top of Page