Skip to Content.
Sympa Menu

charm - Re: [charm] [ppl] [Early-users-discuss] NAMD and job dependency

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [ppl] [Early-users-discuss] NAMD and job dependency


Chronological Thread 
  • From: Jeff Hammond <jhammond AT alcf.anl.gov>
  • To: "Kale, Laxmikant V" <kale AT illinois.edu>
  • Cc: wei jiang <wjiang AT alcf.anl.gov>, "Miller, Philip B" <mille121 AT illinois.edu>, Charm Mailing List <charm AT cs.illinois.edu>, Joseph Baker <jlbaker AT uchicago.edu>
  • Subject: Re: [charm] [ppl] [Early-users-discuss] NAMD and job dependency
  • Date: Sun, 3 Mar 2013 12:35:53 -0600
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

To accept an exit code of 1 as success would violate universal
convention, which I define to include the combined conventions of MPI,
C, Linux and POSIX.

Charm++ needs to return exit code 0 on successful termination. If
Blue Gene doesn't like that, it is a system software bug and it's IBM
problem to fix it.

Please share with me the details about why BG doesn't take exit(0)
properly and I will report it to IBM. I have written my fair share of
PAMI+Pthreads programs and never once had a problem with using
exit(0).

Note also that turning exit(1) into success would break the workflow
of every single program I have ever written from scratch and almost
certainly a non-negligible fraction of our user codes as well.

Jeff

* From MPI-3 Section 8.4 "Error Codes and Classes":

"The error codes satisfy, 0 = MPI_SUCCESS < MPI_ERR_... ≤ MPI_ERR_LASTCODE."

MPICH happens to return 1 = MPI_ERR_BUFFER ("Invalid buffer pointer").

* From http://www.gnu.org/software/libc/manual/html_node/Exit-Status.html:

"There are conventions for what sorts of status values certain
programs should return. The most common convention is simply 0 for
success and 1 for failure."

* From http://tldp.org/LDP/abs/html/exit-status.html:

"Every command returns an exit status (sometimes referred to as a
return status or exit code). A successful command returns a 0, while
an unsuccessful one returns a non-zero value that usually can be
interpreted as an error code. Well-behaved UNIX commands, programs,
and utilities return a 0 exit code upon successful completion, though
there are some exceptions."

* From http://en.wikipedia.org/wiki/Exit_status#section_1:

"The C programming language allows programs exiting or returning from
the main function to signal success or failure by returning an
integer, or returning the macros EXIT_SUCCESS and EXIT_FAILURE. On
Unix-like systems these are evaluate to 0 and 1 respectively."

"POSIX-compatible systems typically use a convention of zero for
success and non zero for error."

On Sun, Mar 3, 2013 at 11:18 AM, Kale, Laxmikant V
<kale AT illinois.edu>
wrote:
> Nikhil will respond with more technical detail, but for a quick piece of
> information: this was done recently following a conversation between
> Nikhil and Sameer (IBM). Exit(1) on BG is kind of legitimate, if I
> understand the convesation right. Without accepting that, our autobuild
> process was failing. If that's the case, may be the higher level scripts
> should accept "1" as (one of the) proper exit code.
>
> --
> Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu
> <http://charm.cs.uiuc.edu/>
> Professor, Computer Science
> kale AT illinois.edu
> 201 N. Goodwin Avenue Ph: (217) 244-0094
> Urbana, IL 61801-2302 FAX: (217) 265-6582
>
>
>
>
>
>
> On 3/3/13 10:11 AM, "Phil Miller"
> <mille121 AT illinois.edu>
> wrote:
>
>>On Sun, Mar 3, 2013 at 10:00 AM, Phil Miller
>><mille121 AT illinois.edu>
>>wrote:
>>> On Sun, Mar 3, 2013 at 9:35 AM, wei jiang
>>> <wjiang AT alcf.anl.gov>
>>> wrote:
>>>>> Charm++ shouldn't be calling exit(1) on correct termination. That's
>>>>> the only reasonable solution here and it's trivial to implement.
>>>>
>>>> The old implementation did call exit(0). The current discussion seems
>>>>beyond of early user discussion for bgq.
>>>> I would turn it to discussion with charm++/namd developers.
>>>
>>> This can hop over to the Charm++ mailing list.
>>>
>>> First question is which machine layer is being used:
>>> {pami,pamilrts,mpi]-bluegeneq(-smp(-async)?)?(-xlc)?
>>
>>Actually, scratch that; looking at the code, it has to be
>>pami-bluegeneq-smp-*, and the tail doesn't matter.
>>
>>It looks like the change to thread 0 on each node calling exit(1)
>>instead of exit(0) happened in commit
>>aac1a19ecf0fdf132ae93987c1271c30bbd33f37. Perhaps Nikhil knows why the
>>introduction of a local barrier at shutdown to synchronize threads
>>also included a change to the exit code, and whether it's safe for us
>>to make that exit(0)?
>>
>>>>> On Sun, Mar 3, 2013 at 9:09 AM, wei jiang
>>>>> <wjiang AT alcf.anl.gov>
>>>>> wrote:
>>>>> > Exit(1) has job dependency ruined. Is there a way to start a new
>>>>> > job regardless dep_fail? (This is still risky. If the dependency
>>>>> > job really failed then the later jobs started from garbage)
>>>>> >
>>>>> > Wei
>>>>> >
>>>>> >
>>>>> > ----- Original Message -----
>>>>> >> From: "Joseph Baker"
>>>>> >> <jlbaker AT uchicago.edu>
>>>>> >> To: "early-users-discuss"
>>>>> >> <early-users-discuss AT lists.alcf.anl.gov>
>>>>> >> Sent: Sunday, March 3, 2013 3:46:46 AM
>>>>> >> Subject: [Early-users-discuss] NAMD and job dependency
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> Hi,
>>>>> >>
>>>>> >>
>>>>> >> NAMD is now shutting itself down correctly when finished running,
>>>>> >> however if I include a job dependency in the qsub command for a
>>>>> >> subsequent NAMD job, it goes into the dep_fail state after the
>>>>> >> first
>>>>> >> job finishes. I looked on the Vesta user guide online, and it says
>>>>> >> that there is an expected exit status 0 for the dependency to not
>>>>> >> enter the fail state.
>>>>> >>
>>>>> >>
>>>>> >> In the error log there are these lines
>>>>> >>
>>>>> >>
>>>>> >> 2013-03-03 08:04:26.526 (INFO ) [0xfff7aa6a270]
>>>>> >> VST-00040-33771-1024:126717:ibm.runjob.client.Job: job 126717
>>>>> >> started
>>>>> >> 2013-03-03 08:04:43.906 (INFO ) [0xfff99191ab0]
>>>>> >> 7653:tatu.runjob.monitor: tracklib completed
>>>>> >> 2013-03-03 08:16:36.620 (INFO ) [0xfff7aa6a270]
>>>>> >> VST-00040-33771-1024:126717:ibm.runjob.client.Job: exited with
>>>>> >> status 1
>>>>> >> 2013-03-03 08:16:36.623 (WARN ) [0xfff7aa6a270]
>>>>> >> VST-00040-33771-1024:126717:ibm.runjob.client.Job: normal
>>>>> >> termination with status 1 from rank 397
>>>>> >> 2013-03-03 08:16:36.623 (INFO ) [0xfff7aa6a270]
>>>>> >> tatu.runjob.client:
>>>>> >> task exited with status 1
>>>>> >> 2013-03-03 08:16:36.624 (INFO ) [0xfff99191ab0]
>>>>> >> 7653:tatu.runjob.monitor: monitor terminating
>>>>> >> 2013-03-03 08:16:36.625 (INFO ) [0xfff7aa6a270]
>>>>> >> tatu.runjob.client:
>>>>> >> monitor completed
>>>>> >>
>>>>> >>
>>>>> >> and in the cobaltlog, it ends with
>>>>> >>
>>>>> >> Info: task completed normally with an exit code of 1; initiating
>>>>> >> job
>>>>> >> cleanup and removal
>>>>> >>
>>>>> >>
>>>>> >> So, it looks from this that the NAMD exit code is 1. Is this what
>>>>> >> is
>>>>> >> causing the problem with the job dependency?
>>>>> >>
>>>>> >>
>>>>> >> Thanks,
>>>>> >> Joe
>>>>> >>
>>>>> >>
>>>>> >> _______________________________________________
>>>>> >> early-users-discuss mailing list
>>>>> >> early-users-discuss AT lists.alcf.anl.gov
>>>>> >> https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss
>>>>> >>
>>>>> > _______________________________________________
>>>>> > early-users-discuss mailing list
>>>>> > early-users-discuss AT lists.alcf.anl.gov
>>>>> > https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jeff Hammond
>>>>> Argonne Leadership Computing Facility
>>>>> University of Chicago Computation Institute
>>>>> jhammond AT alcf.anl.gov
>>>>> / (630) 252-5381
>>>>> http://www.linkedin.com/in/jeffhammond
>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>>
>>>> _______________________________________________
>>>> early-users-discuss mailing list
>>>> early-users-discuss AT lists.alcf.anl.gov
>>>> https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss
>>_______________________________________________
>>charm mailing list
>>charm AT cs.uiuc.edu
>>http://lists.cs.uiuc.edu/mailman/listinfo/charm
>>_______________________________________________
>>ppl mailing list
>>ppl AT cs.uiuc.edu
>>http://lists.cs.uiuc.edu/mailman/listinfo/ppl
>



--
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond AT alcf.anl.gov
/ (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond





Archive powered by MHonArc 2.6.16.

Top of Page