Skip to Content.
Sympa Menu

charm - Re: [charm] [ppl] [Early-users-discuss] NAMD and job dependency

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [ppl] [Early-users-discuss] NAMD and job dependency


Chronological Thread 
  • From: "Kale, Laxmikant V" <kale AT illinois.edu>
  • To: Jeff Hammond <jhammond AT alcf.anl.gov>
  • Cc: wei jiang <wjiang AT alcf.anl.gov>, "Miller, Philip B" <mille121 AT illinois.edu>, Charm Mailing List <charm AT cs.illinois.edu>, Joseph Baker <jlbaker AT uchicago.edu>
  • Subject: Re: [charm] [ppl] [Early-users-discuss] NAMD and job dependency
  • Date: Sun, 3 Mar 2013 18:42:32 +0000
  • Accept-language: en-US
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

I agree.

But
Lets postpone discussion until I have a chance to discuss the latest
changes with Nikhil and Sameer.
It is a two day old not-even-beta-release change in the development
version of the PAMI layer, for the upcoming 6.5 version.. Right? (and
thanks for catching the problem).

As an aside: I was not coped on the original email (on the early users
list?) that started this thread. Can someone forward that for me please?

Sanjay

--
Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu
<http://charm.cs.uiuc.edu/>
Professor, Computer Science
kale AT illinois.edu
201 N. Goodwin Avenue Ph: (217) 244-0094
Urbana, IL 61801-2302 FAX: (217) 265-6582






On 3/3/13 12:35 PM, "Jeff Hammond"
<jhammond AT alcf.anl.gov>
wrote:

>To accept an exit code of 1 as success would violate universal
>convention, which I define to include the combined conventions of MPI,
>C, Linux and POSIX.
>
>Charm++ needs to return exit code 0 on successful termination. If
>Blue Gene doesn't like that, it is a system software bug and it's IBM
>problem to fix it.
>
>Please share with me the details about why BG doesn't take exit(0)
>properly and I will report it to IBM. I have written my fair share of
>PAMI+Pthreads programs and never once had a problem with using
>exit(0).
>
>Note also that turning exit(1) into success would break the workflow
>of every single program I have ever written from scratch and almost
>certainly a non-negligible fraction of our user codes as well.
>
>Jeff
>
>* From MPI-3 Section 8.4 "Error Codes and Classes":
>
>"The error codes satisfy, 0 = MPI_SUCCESS < MPI_ERR_... ¾
>MPI_ERR_LASTCODE."
>
>MPICH happens to return 1 = MPI_ERR_BUFFER ("Invalid buffer pointer").
>
>* From http://www.gnu.org/software/libc/manual/html_node/Exit-Status.html:
>
>"There are conventions for what sorts of status values certain
>programs should return. The most common convention is simply 0 for
>success and 1 for failure."
>
>* From http://tldp.org/LDP/abs/html/exit-status.html:
>
>"Every command returns an exit status (sometimes referred to as a
>return status or exit code). A successful command returns a 0, while
>an unsuccessful one returns a non-zero value that usually can be
>interpreted as an error code. Well-behaved UNIX commands, programs,
>and utilities return a 0 exit code upon successful completion, though
>there are some exceptions."
>
>* From http://en.wikipedia.org/wiki/Exit_status#section_1:
>
>"The C programming language allows programs exiting or returning from
>the main function to signal success or failure by returning an
>integer, or returning the macros EXIT_SUCCESS and EXIT_FAILURE. On
>Unix-like systems these are evaluate to 0 and 1 respectively."
>
>"POSIX-compatible systems typically use a convention of zero for
>success and non zero for error."
>
>On Sun, Mar 3, 2013 at 11:18 AM, Kale, Laxmikant V
><kale AT illinois.edu>
>wrote:
>> Nikhil will respond with more technical detail, but for a quick piece of
>> information: this was done recently following a conversation between
>> Nikhil and Sameer (IBM). Exit(1) on BG is kind of legitimate, if I
>> understand the convesation right. Without accepting that, our autobuild
>> process was failing. If that's the case, may be the higher level scripts
>> should accept "1" as (one of the) proper exit code.
>>
>> --
>> Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu
>> <http://charm.cs.uiuc.edu/>
>> Professor, Computer Science
>> kale AT illinois.edu
>> 201 N. Goodwin Avenue Ph: (217) 244-0094
>> Urbana, IL 61801-2302 FAX: (217) 265-6582
>>
>>
>>
>>
>>
>>
>> On 3/3/13 10:11 AM, "Phil Miller"
>> <mille121 AT illinois.edu>
>> wrote:
>>
>>>On Sun, Mar 3, 2013 at 10:00 AM, Phil Miller
>>><mille121 AT illinois.edu>
>>>wrote:
>>>> On Sun, Mar 3, 2013 at 9:35 AM, wei jiang
>>>> <wjiang AT alcf.anl.gov>
>>>> wrote:
>>>>>> Charm++ shouldn't be calling exit(1) on correct termination. That's
>>>>>> the only reasonable solution here and it's trivial to implement.
>>>>>
>>>>> The old implementation did call exit(0). The current discussion seems
>>>>>beyond of early user discussion for bgq.
>>>>> I would turn it to discussion with charm++/namd developers.
>>>>
>>>> This can hop over to the Charm++ mailing list.
>>>>
>>>> First question is which machine layer is being used:
>>>> {pami,pamilrts,mpi]-bluegeneq(-smp(-async)?)?(-xlc)?
>>>
>>>Actually, scratch that; looking at the code, it has to be
>>>pami-bluegeneq-smp-*, and the tail doesn't matter.
>>>
>>>It looks like the change to thread 0 on each node calling exit(1)
>>>instead of exit(0) happened in commit
>>>aac1a19ecf0fdf132ae93987c1271c30bbd33f37. Perhaps Nikhil knows why the
>>>introduction of a local barrier at shutdown to synchronize threads
>>>also included a change to the exit code, and whether it's safe for us
>>>to make that exit(0)?
>>>
>>>>>> On Sun, Mar 3, 2013 at 9:09 AM, wei jiang
>>>>>> <wjiang AT alcf.anl.gov>
>>>>>> wrote:
>>>>>> > Exit(1) has job dependency ruined. Is there a way to start a new
>>>>>> > job regardless dep_fail? (This is still risky. If the dependency
>>>>>> > job really failed then the later jobs started from garbage)
>>>>>> >
>>>>>> > Wei
>>>>>> >
>>>>>> >
>>>>>> > ----- Original Message -----
>>>>>> >> From: "Joseph Baker"
>>>>>> >> <jlbaker AT uchicago.edu>
>>>>>> >> To: "early-users-discuss"
>>>>>><early-users-discuss AT lists.alcf.anl.gov>
>>>>>> >> Sent: Sunday, March 3, 2013 3:46:46 AM
>>>>>> >> Subject: [Early-users-discuss] NAMD and job dependency
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> Hi,
>>>>>> >>
>>>>>> >>
>>>>>> >> NAMD is now shutting itself down correctly when finished running,
>>>>>> >> however if I include a job dependency in the qsub command for a
>>>>>> >> subsequent NAMD job, it goes into the dep_fail state after the
>>>>>> >> first
>>>>>> >> job finishes. I looked on the Vesta user guide online, and it
>>>>>>says
>>>>>> >> that there is an expected exit status 0 for the dependency to not
>>>>>> >> enter the fail state.
>>>>>> >>
>>>>>> >>
>>>>>> >> In the error log there are these lines
>>>>>> >>
>>>>>> >>
>>>>>> >> 2013-03-03 08:04:26.526 (INFO ) [0xfff7aa6a270]
>>>>>> >> VST-00040-33771-1024:126717:ibm.runjob.client.Job: job 126717
>>>>>> >> started
>>>>>> >> 2013-03-03 08:04:43.906 (INFO ) [0xfff99191ab0]
>>>>>> >> 7653:tatu.runjob.monitor: tracklib completed
>>>>>> >> 2013-03-03 08:16:36.620 (INFO ) [0xfff7aa6a270]
>>>>>> >> VST-00040-33771-1024:126717:ibm.runjob.client.Job: exited with
>>>>>> >> status 1
>>>>>> >> 2013-03-03 08:16:36.623 (WARN ) [0xfff7aa6a270]
>>>>>> >> VST-00040-33771-1024:126717:ibm.runjob.client.Job: normal
>>>>>> >> termination with status 1 from rank 397
>>>>>> >> 2013-03-03 08:16:36.623 (INFO ) [0xfff7aa6a270]
>>>>>> >> tatu.runjob.client:
>>>>>> >> task exited with status 1
>>>>>> >> 2013-03-03 08:16:36.624 (INFO ) [0xfff99191ab0]
>>>>>> >> 7653:tatu.runjob.monitor: monitor terminating
>>>>>> >> 2013-03-03 08:16:36.625 (INFO ) [0xfff7aa6a270]
>>>>>> >> tatu.runjob.client:
>>>>>> >> monitor completed
>>>>>> >>
>>>>>> >>
>>>>>> >> and in the cobaltlog, it ends with
>>>>>> >>
>>>>>> >> Info: task completed normally with an exit code of 1; initiating
>>>>>> >> job
>>>>>> >> cleanup and removal
>>>>>> >>
>>>>>> >>
>>>>>> >> So, it looks from this that the NAMD exit code is 1. Is this what
>>>>>> >> is
>>>>>> >> causing the problem with the job dependency?
>>>>>> >>
>>>>>> >>
>>>>>> >> Thanks,
>>>>>> >> Joe
>>>>>> >>
>>>>>> >>
>>>>>> >> _______________________________________________
>>>>>> >> early-users-discuss mailing list
>>>>>> >> early-users-discuss AT lists.alcf.anl.gov
>>>>>> >> https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss
>>>>>> >>
>>>>>> > _______________________________________________
>>>>>> > early-users-discuss mailing list
>>>>>> > early-users-discuss AT lists.alcf.anl.gov
>>>>>> > https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jeff Hammond
>>>>>> Argonne Leadership Computing Facility
>>>>>> University of Chicago Computation Institute
>>>>>> jhammond AT alcf.anl.gov
>>>>>> / (630) 252-5381
>>>>>> http://www.linkedin.com/in/jeffhammond
>>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>>>
>>>>> _______________________________________________
>>>>> early-users-discuss mailing list
>>>>> early-users-discuss AT lists.alcf.anl.gov
>>>>> https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss
>>>_______________________________________________
>>>charm mailing list
>>>charm AT cs.uiuc.edu
>>>http://lists.cs.uiuc.edu/mailman/listinfo/charm
>>>_______________________________________________
>>>ppl mailing list
>>>ppl AT cs.uiuc.edu
>>>http://lists.cs.uiuc.edu/mailman/listinfo/ppl
>>
>
>
>
>--
>Jeff Hammond
>Argonne Leadership Computing Facility
>University of Chicago Computation Institute
>jhammond AT alcf.anl.gov
> / (630) 252-5381
>http://www.linkedin.com/in/jeffhammond
>https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond






Archive powered by MHonArc 2.6.16.

Top of Page