charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [ppl] [Early-users-discuss] NAMD and job dependency

From: Jeff Hammond <jhammond AT alcf.anl.gov>
To: Jim Phillips <jim AT ks.uiuc.edu>
Cc: wei jiang <wjiang AT alcf.anl.gov>, Charm Mailing List <charm AT cs.illinois.edu>, Joseph Baker <jlbaker AT uchicago.edu>, "Kale, Laxmikant V" <kale AT illinois.edu>
Subject: Re: [charm] [ppl] [Early-users-discuss] NAMD and job dependency
Date: Sun, 3 Mar 2013 13:09:07 -0600
List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

As far as I know, Cobalt looks at $? to determine if the job succeeded
or not in order to determine if the dependent jobs should be started.
Maybe we're talking about different things, but I do not understand
how you expect Cobalt to function properly w.r.t. chains of dependent
jobs if error codes aren't treated in the conventional manner.

I'm fairly confident that you can wrap NAMD in a script job in such a
way that it will succeed despite the issuing of non-success error
codes by the application, but I'm not going to expend any effort into
figuring out how to do this because it's the wrong way to solve this
problem.

Jeff

On Sun, Mar 3, 2013 at 1:00 PM, Jim Phillips
<jim AT ks.uiuc.edu>
wrote:
>
> 1) Why does it matter what code is passed to exit()? Shouldn't it be up the
> the application to decide if it succeeded or not?
>
> 2) Isn't the condition for a clean exit just quiescence detection, which
> NAMD does on exit anyway?
>
> -Jim
>
>
> On Sun, 3 Mar 2013, Nikhil Jain wrote:
>
>> I agree with your sentiments Jeff and Prof Kale, and we hope to remain
>> consistent with rest of the world by passing 0 as exit status in future.
>> Having said that, for now, we have taken the route of 1 as exit status as
>> Sameer expects performance implications to manage a clean exit when
>> threads are used (keeping track of all message sent/received etc).
>>
>> Alternatively, I suggest usage of pamilrts based builds that are also
>> stable and exits cleanly with a status of 0 (to handle the dependency
>> issue). We are currently studying the performance of pamilrts builds
>> against pami builds, and the initial findings look promising. The build
>> command for them are similar to pami:
>>
>> ./build charm++ pamilrts-bluegeneq smp --with-production
>> or to use the async mode with separate hardware threads running the
>> progress engine
>> ./build charm++ pamilrts-bluegeneq smp async --with-production
>>
>> And as Prof Kale said, further comments on pami are reserved till we have
>> a detailed discussion.
>>
>> Thanks
>> Nikhil
>>
>>
>>
>>
>> --
>> Nikhil Jain,
>> nikhil AT illinois.edu,
>> http://charm.cs.uiuc.edu/people/nikhil
>> Doctoral Candidate @ CS, UIUC
>>
>>
>>
>>
>>
>>
>> On 3/3/13 12:42 PM, "Kale, Laxmikant V"
>> <kale AT illinois.edu>
>> wrote:
>>
>>> I agree.
>>>
>>> But
>>> Lets postpone discussion until I have a chance to discuss the latest
>>> changes with Nikhil and Sameer.
>>> It is a two day old not-even-beta-release change in the development
>>> version of the PAMI layer, for the upcoming 6.5 version.. Right? (and
>>> thanks for catching the problem).
>>>
>>> As an aside: I was not coped on the original email (on the early users
>>> list?) that started this thread. Can someone forward that for me please?
>>>
>>> Sanjay
>>>
>>> --
>>> Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu
>>> <http://charm.cs.uiuc.edu/>
>>> Professor, Computer Science
>>> kale AT illinois.edu
>>> 201 N. Goodwin Avenue Ph: (217) 244-0094
>>> Urbana, IL 61801-2302 FAX: (217) 265-6582
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 3/3/13 12:35 PM, "Jeff Hammond"
>>> <jhammond AT alcf.anl.gov>
>>> wrote:
>>>
>>>> To accept an exit code of 1 as success would violate universal
>>>> convention, which I define to include the combined conventions of MPI,
>>>> C, Linux and POSIX.
>>>>
>>>> Charm++ needs to return exit code 0 on successful termination. If
>>>> Blue Gene doesn't like that, it is a system software bug and it's IBM
>>>> problem to fix it.
>>>>
>>>> Please share with me the details about why BG doesn't take exit(0)
>>>> properly and I will report it to IBM. I have written my fair share of
>>>> PAMI+Pthreads programs and never once had a problem with using
>>>> exit(0).
>>>>
>>>> Note also that turning exit(1) into success would break the workflow
>>>> of every single program I have ever written from scratch and almost
>>>> certainly a non-negligible fraction of our user codes as well.
>>>>
>>>> Jeff
>>>>
>>>> * From MPI-3 Section 8.4 "Error Codes and Classes":
>>>>
>>>> "The error codes satisfy, 0 = MPI_SUCCESS < MPI_ERR_... 3Ž4
>>>> MPI_ERR_LASTCODE."
>>>>
>>>> MPICH happens to return 1 = MPI_ERR_BUFFER ("Invalid buffer pointer").
>>>>
>>>> * From
>>>> http://www.gnu.org/software/libc/manual/html_node/Exit-Status.html:
>>>>
>>>> "There are conventions for what sorts of status values certain
>>>> programs should return. The most common convention is simply 0 for
>>>> success and 1 for failure."
>>>>
>>>> * From http://tldp.org/LDP/abs/html/exit-status.html:
>>>>
>>>> "Every command returns an exit status (sometimes referred to as a
>>>> return status or exit code). A successful command returns a 0, while
>>>> an unsuccessful one returns a non-zero value that usually can be
>>>> interpreted as an error code. Well-behaved UNIX commands, programs,
>>>> and utilities return a 0 exit code upon successful completion, though
>>>> there are some exceptions."
>>>>
>>>> * From http://en.wikipedia.org/wiki/Exit_status#section_1:
>>>>
>>>> "The C programming language allows programs exiting or returning from
>>>> the main function to signal success or failure by returning an
>>>> integer, or returning the macros EXIT_SUCCESS and EXIT_FAILURE. On
>>>> Unix-like systems these are evaluate to 0 and 1 respectively."
>>>>
>>>> "POSIX-compatible systems typically use a convention of zero for
>>>> success and non zero for error."
>>>>
>>>> On Sun, Mar 3, 2013 at 11:18 AM, Kale, Laxmikant V
>>>> <kale AT illinois.edu>
>>>> wrote:
>>>>>
>>>>> Nikhil will respond with more technical detail, but for a quick piece
>>>>> of
>>>>> information: this was done recently following a conversation between
>>>>> Nikhil and Sameer (IBM). Exit(1) on BG is kind of legitimate, if I
>>>>> understand the convesation right. Without accepting that, our autobuild
>>>>> process was failing. If that's the case, may be the higher level
>>>>> scripts
>>>>> should accept "1" as (one of the) proper exit code.
>>>>>
>>>>> --
>>>>> Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu
>>>>> <http://charm.cs.uiuc.edu/>
>>>>> Professor, Computer Science
>>>>> kale AT illinois.edu
>>>>> 201 N. Goodwin Avenue Ph: (217) 244-0094
>>>>> Urbana, IL 61801-2302 FAX: (217) 265-6582
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 3/3/13 10:11 AM, "Phil Miller"
>>>>> <mille121 AT illinois.edu>
>>>>> wrote:
>>>>>
>>>>>> On Sun, Mar 3, 2013 at 10:00 AM, Phil Miller
>>>>>> <mille121 AT illinois.edu>
>>>>>> wrote:
>>>>>>>
>>>>>>> On Sun, Mar 3, 2013 at 9:35 AM, wei jiang
>>>>>>> <wjiang AT alcf.anl.gov>
>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Charm++ shouldn't be calling exit(1) on correct termination.
>>>>>>>>> That's
>>>>>>>>> the only reasonable solution here and it's trivial to implement.
>>>>>>>>
>>>>>>>>
>>>>>>>> The old implementation did call exit(0). The current discussion
>>>>>>>> seems
>>>>>>>> beyond of early user discussion for bgq.
>>>>>>>> I would turn it to discussion with charm++/namd developers.
>>>>>>>
>>>>>>>
>>>>>>> This can hop over to the Charm++ mailing list.
>>>>>>>
>>>>>>> First question is which machine layer is being used:
>>>>>>> {pami,pamilrts,mpi]-bluegeneq(-smp(-async)?)?(-xlc)?
>>>>>>
>>>>>>
>>>>>> Actually, scratch that; looking at the code, it has to be
>>>>>> pami-bluegeneq-smp-*, and the tail doesn't matter.
>>>>>>
>>>>>> It looks like the change to thread 0 on each node calling exit(1)
>>>>>> instead of exit(0) happened in commit
>>>>>> aac1a19ecf0fdf132ae93987c1271c30bbd33f37. Perhaps Nikhil knows why the
>>>>>> introduction of a local barrier at shutdown to synchronize threads
>>>>>> also included a change to the exit code, and whether it's safe for us
>>>>>> to make that exit(0)?
>>>>>>
>>>>>>>>> On Sun, Mar 3, 2013 at 9:09 AM, wei jiang
>>>>>>>>> <wjiang AT alcf.anl.gov>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Exit(1) has job dependency ruined. Is there a way to start a new
>>>>>>>>>> job regardless dep_fail? (This is still risky. If the dependency
>>>>>>>>>> job really failed then the later jobs started from garbage)
>>>>>>>>>>
>>>>>>>>>> Wei
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ----- Original Message -----
>>>>>>>>>>>
>>>>>>>>>>> From: "Joseph Baker"
>>>>>>>>>>> <jlbaker AT uchicago.edu>
>>>>>>>>>>> To: "early-users-discuss"
>>>>>>>>>
>>>>>>>>> <early-users-discuss AT lists.alcf.anl.gov>
>>>>>>>>>>>
>>>>>>>>>>> Sent: Sunday, March 3, 2013 3:46:46 AM
>>>>>>>>>>> Subject: [Early-users-discuss] NAMD and job dependency
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> NAMD is now shutting itself down correctly when finished
>>>>>>>>>
>>>>>>>>> running,
>>>>>>>>>>>
>>>>>>>>>>> however if I include a job dependency in the qsub command for a
>>>>>>>>>>> subsequent NAMD job, it goes into the dep_fail state after the
>>>>>>>>>>> first
>>>>>>>>>>> job finishes. I looked on the Vesta user guide online, and it
>>>>>>>>>
>>>>>>>>> says
>>>>>>>>>>>
>>>>>>>>>>> that there is an expected exit status 0 for the dependency to
>>>>>>>>>
>>>>>>>>> not
>>>>>>>>>>>
>>>>>>>>>>> enter the fail state.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> In the error log there are these lines
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2013-03-03 08:04:26.526 (INFO ) [0xfff7aa6a270]
>>>>>>>>>>> VST-00040-33771-1024:126717:ibm.runjob.client.Job: job 126717
>>>>>>>>>>> started
>>>>>>>>>>> 2013-03-03 08:04:43.906 (INFO ) [0xfff99191ab0]
>>>>>>>>>>> 7653:tatu.runjob.monitor: tracklib completed
>>>>>>>>>>> 2013-03-03 08:16:36.620 (INFO ) [0xfff7aa6a270]
>>>>>>>>>>> VST-00040-33771-1024:126717:ibm.runjob.client.Job: exited with
>>>>>>>>>>> status 1
>>>>>>>>>>> 2013-03-03 08:16:36.623 (WARN ) [0xfff7aa6a270]
>>>>>>>>>>> VST-00040-33771-1024:126717:ibm.runjob.client.Job: normal
>>>>>>>>>>> termination with status 1 from rank 397
>>>>>>>>>>> 2013-03-03 08:16:36.623 (INFO ) [0xfff7aa6a270]
>>>>>>>>>>> tatu.runjob.client:
>>>>>>>>>>> task exited with status 1
>>>>>>>>>>> 2013-03-03 08:16:36.624 (INFO ) [0xfff99191ab0]
>>>>>>>>>>> 7653:tatu.runjob.monitor: monitor terminating
>>>>>>>>>>> 2013-03-03 08:16:36.625 (INFO ) [0xfff7aa6a270]
>>>>>>>>>>> tatu.runjob.client:
>>>>>>>>>>> monitor completed
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> and in the cobaltlog, it ends with
>>>>>>>>>>>
>>>>>>>>>>> Info: task completed normally with an exit code of 1; initiating
>>>>>>>>>>> job
>>>>>>>>>>> cleanup and removal
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> So, it looks from this that the NAMD exit code is 1. Is this
>>>>>>>>>
>>>>>>>>> what
>>>>>>>>>>>
>>>>>>>>>>> is
>>>>>>>>>>> causing the problem with the job dependency?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Joe
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> early-users-discuss mailing list
>>>>>>>>>>> early-users-discuss AT lists.alcf.anl.gov
>>>>>>>>>>> https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss
>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> early-users-discuss mailing list
>>>>>>>>>> early-users-discuss AT lists.alcf.anl.gov
>>>>>>>>>> https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Jeff Hammond
>>>>>>>>> Argonne Leadership Computing Facility
>>>>>>>>> University of Chicago Computation Institute
>>>>>>>>> jhammond AT alcf.anl.gov
>>>>>>>>> / (630) 252-5381
>>>>>>>>> http://www.linkedin.com/in/jeffhammond
>>>>>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> early-users-discuss mailing list
>>>>>>>> early-users-discuss AT lists.alcf.anl.gov
>>>>>>>> https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss
>>>>>>
>>>>>> _______________________________________________
>>>>>> charm mailing list
>>>>>> charm AT cs.uiuc.edu
>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/charm
>>>>>> _______________________________________________
>>>>>> ppl mailing list
>>>>>> ppl AT cs.uiuc.edu
>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/ppl
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jeff Hammond
>>>> Argonne Leadership Computing Facility
>>>> University of Chicago Computation Institute
>>>> jhammond AT alcf.anl.gov
>>>> / (630) 252-5381
>>>> http://www.linkedin.com/in/jeffhammond
>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>
>>>
>>
>>
>>
>> _______________________________________________
>> charm mailing list
>> charm AT cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/charm
>>
>> _______________________________________________
>> ppl mailing list
>> ppl AT cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/ppl

--
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond AT alcf.anl.gov
/ (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond

Re: [charm] [ppl] [Early-users-discuss] NAMD and job dependency, Nikhil Jain, 03/03/2013
- Re: [charm] [ppl] [Early-users-discuss] NAMD and job dependency, Jim Phillips, 03/03/2013
  - Re: [charm] [ppl] [Early-users-discuss] NAMD and job dependency, Jeff Hammond, 03/03/2013
    - Re: [charm] [ppl] [Early-users-discuss] NAMD and job dependency, Jim Phillips, 03/03/2013
      - Re: [charm] [ppl] [Early-users-discuss] NAMD and job dependency, Jeff Hammond, 03/03/2013