Skip to Content.
Sympa Menu

charm - Re: [charm] [ppl] [Early-users-discuss] NAMD and job dependency

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [ppl] [Early-users-discuss] NAMD and job dependency


Chronological Thread 
  • From: Jeff Hammond <jhammond AT alcf.anl.gov>
  • To: Jim Phillips <jim AT ks.uiuc.edu>
  • Cc: wei jiang <wjiang AT alcf.anl.gov>, Charm Mailing List <charm AT cs.illinois.edu>, Joseph Baker <jlbaker AT uchicago.edu>, "Kale, Laxmikant V" <kale AT illinois.edu>
  • Subject: Re: [charm] [ppl] [Early-users-discuss] NAMD and job dependency
  • Date: Sun, 3 Mar 2013 13:31:02 -0600
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Okay, I completely agree with that. If Cobalt doesn't return
immediately upon exit(0) then something is wrong. If I had to guess,
it would be that exit(1) is nonlocal like MPI_Abort(), but that
exit(0) only terminates the process that calls it.

Can you guys add a barrier to the termination code? I wonder if some
processes aren't exiting while others are this is why Cobalt let's the
job keep running. If a process exists early, obviously it will not
respond to messages, which might be causing Charm++ to deadlock.

Note that this behavior should be consistent with
MPI_Finalize(MPI_COMM_SELF) on other systems, should you want to test
that way.

Best,

Jeff

On Sun, Mar 3, 2013 at 1:25 PM, Jim Phillips
<jim AT ks.uiuc.edu>
wrote:
>
> What I mean is, why can't Charm++ just pass 0 to exit()? It sounds like
> calling exit(0) results in cobalt thinking the program is still running and
> waiting until the job time expires. What is the condition for "clean exit"
> that will make cobalt both report success and exit immediately?
>
> -Jim
>
>
>
> On Sun, 3 Mar 2013, Jeff Hammond wrote:
>
>> As far as I know, Cobalt looks at $? to determine if the job succeeded
>> or not in order to determine if the dependent jobs should be started.
>> Maybe we're talking about different things, but I do not understand
>> how you expect Cobalt to function properly w.r.t. chains of dependent
>> jobs if error codes aren't treated in the conventional manner.
>>
>> I'm fairly confident that you can wrap NAMD in a script job in such a
>> way that it will succeed despite the issuing of non-success error
>> codes by the application, but I'm not going to expend any effort into
>> figuring out how to do this because it's the wrong way to solve this
>> problem.
>>
>> Jeff
>>
>> On Sun, Mar 3, 2013 at 1:00 PM, Jim Phillips
>> <jim AT ks.uiuc.edu>
>> wrote:
>>>
>>>
>>> 1) Why does it matter what code is passed to exit()? Shouldn't it be up
>>> the
>>> the application to decide if it succeeded or not?
>>>
>>> 2) Isn't the condition for a clean exit just quiescence detection, which
>>> NAMD does on exit anyway?
>>>
>>> -Jim
>>>
>>>
>>> On Sun, 3 Mar 2013, Nikhil Jain wrote:
>>>
>>>> I agree with your sentiments Jeff and Prof Kale, and we hope to remain
>>>> consistent with rest of the world by passing 0 as exit status in future.
>>>> Having said that, for now, we have taken the route of 1 as exit status
>>>> as
>>>> Sameer expects performance implications to manage a clean exit when
>>>> threads are used (keeping track of all message sent/received etc).
>>>>
>>>> Alternatively, I suggest usage of pamilrts based builds that are also
>>>> stable and exits cleanly with a status of 0 (to handle the dependency
>>>> issue). We are currently studying the performance of pamilrts builds
>>>> against pami builds, and the initial findings look promising. The build
>>>> command for them are similar to pami:
>>>>
>>>> ./build charm++ pamilrts-bluegeneq smp --with-production
>>>> or to use the async mode with separate hardware threads running the
>>>> progress engine
>>>> ./build charm++ pamilrts-bluegeneq smp async --with-production
>>>>
>>>> And as Prof Kale said, further comments on pami are reserved till we
>>>> have
>>>> a detailed discussion.
>>>>
>>>> Thanks
>>>> Nikhil
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Nikhil Jain,
>>>> nikhil AT illinois.edu,
>>>> http://charm.cs.uiuc.edu/people/nikhil
>>>> Doctoral Candidate @ CS, UIUC
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 3/3/13 12:42 PM, "Kale, Laxmikant V"
>>>> <kale AT illinois.edu>
>>>> wrote:
>>>>
>>>>> I agree.
>>>>>
>>>>> But
>>>>> Lets postpone discussion until I have a chance to discuss the latest
>>>>> changes with Nikhil and Sameer.
>>>>> It is a two day old not-even-beta-release change in the development
>>>>> version of the PAMI layer, for the upcoming 6.5 version.. Right? (and
>>>>> thanks for catching the problem).
>>>>>
>>>>> As an aside: I was not coped on the original email (on the early users
>>>>> list?) that started this thread. Can someone forward that for me
>>>>> please?
>>>>>
>>>>> Sanjay
>>>>>
>>>>> --
>>>>> Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu
>>>>> <http://charm.cs.uiuc.edu/>
>>>>> Professor, Computer Science
>>>>> kale AT illinois.edu
>>>>> 201 N. Goodwin Avenue Ph: (217) 244-0094
>>>>> Urbana, IL 61801-2302 FAX: (217) 265-6582
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 3/3/13 12:35 PM, "Jeff Hammond"
>>>>> <jhammond AT alcf.anl.gov>
>>>>> wrote:
>>>>>
>>>>>> To accept an exit code of 1 as success would violate universal
>>>>>> convention, which I define to include the combined conventions of MPI,
>>>>>> C, Linux and POSIX.
>>>>>>
>>>>>> Charm++ needs to return exit code 0 on successful termination. If
>>>>>> Blue Gene doesn't like that, it is a system software bug and it's IBM
>>>>>> problem to fix it.
>>>>>>
>>>>>> Please share with me the details about why BG doesn't take exit(0)
>>>>>> properly and I will report it to IBM. I have written my fair share of
>>>>>> PAMI+Pthreads programs and never once had a problem with using
>>>>>> exit(0).
>>>>>>
>>>>>> Note also that turning exit(1) into success would break the workflow
>>>>>> of every single program I have ever written from scratch and almost
>>>>>> certainly a non-negligible fraction of our user codes as well.
>>>>>>
>>>>>> Jeff
>>>>>>
>>>>>> * From MPI-3 Section 8.4 "Error Codes and Classes":
>>>>>>
>>>>>> "The error codes satisfy, 0 = MPI_SUCCESS < MPI_ERR_... 3Ž4
>>>>>> MPI_ERR_LASTCODE."
>>>>>>
>>>>>> MPICH happens to return 1 = MPI_ERR_BUFFER ("Invalid buffer pointer").
>>>>>>
>>>>>> * From
>>>>>> http://www.gnu.org/software/libc/manual/html_node/Exit-Status.html:
>>>>>>
>>>>>> "There are conventions for what sorts of status values certain
>>>>>> programs should return. The most common convention is simply 0 for
>>>>>> success and 1 for failure."
>>>>>>
>>>>>> * From http://tldp.org/LDP/abs/html/exit-status.html:
>>>>>>
>>>>>> "Every command returns an exit status (sometimes referred to as a
>>>>>> return status or exit code). A successful command returns a 0, while
>>>>>> an unsuccessful one returns a non-zero value that usually can be
>>>>>> interpreted as an error code. Well-behaved UNIX commands, programs,
>>>>>> and utilities return a 0 exit code upon successful completion, though
>>>>>> there are some exceptions."
>>>>>>
>>>>>> * From http://en.wikipedia.org/wiki/Exit_status#section_1:
>>>>>>
>>>>>> "The C programming language allows programs exiting or returning from
>>>>>> the main function to signal success or failure by returning an
>>>>>> integer, or returning the macros EXIT_SUCCESS and EXIT_FAILURE. On
>>>>>> Unix-like systems these are evaluate to 0 and 1 respectively."
>>>>>>
>>>>>> "POSIX-compatible systems typically use a convention of zero for
>>>>>> success and non zero for error."
>>>>>>
>>>>>> On Sun, Mar 3, 2013 at 11:18 AM, Kale, Laxmikant V
>>>>>> <kale AT illinois.edu>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Nikhil will respond with more technical detail, but for a quick piece
>>>>>>> of
>>>>>>> information: this was done recently following a conversation between
>>>>>>> Nikhil and Sameer (IBM). Exit(1) on BG is kind of legitimate, if I
>>>>>>> understand the convesation right. Without accepting that, our
>>>>>>> autobuild
>>>>>>> process was failing. If that's the case, may be the higher level
>>>>>>> scripts
>>>>>>> should accept "1" as (one of the) proper exit code.
>>>>>>>
>>>>>>> --
>>>>>>> Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu
>>>>>>> <http://charm.cs.uiuc.edu/>
>>>>>>> Professor, Computer Science
>>>>>>> kale AT illinois.edu
>>>>>>> 201 N. Goodwin Avenue Ph: (217) 244-0094
>>>>>>> Urbana, IL 61801-2302 FAX: (217) 265-6582
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 3/3/13 10:11 AM, "Phil Miller"
>>>>>>> <mille121 AT illinois.edu>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> On Sun, Mar 3, 2013 at 10:00 AM, Phil Miller
>>>>>>>> <mille121 AT illinois.edu>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Mar 3, 2013 at 9:35 AM, wei jiang
>>>>>>>>> <wjiang AT alcf.anl.gov>
>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Charm++ shouldn't be calling exit(1) on correct termination.
>>>>>>>>>>> That's
>>>>>>>>>>> the only reasonable solution here and it's trivial to implement.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The old implementation did call exit(0). The current discussion
>>>>>>>>>> seems
>>>>>>>>>> beyond of early user discussion for bgq.
>>>>>>>>>> I would turn it to discussion with charm++/namd developers.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This can hop over to the Charm++ mailing list.
>>>>>>>>>
>>>>>>>>> First question is which machine layer is being used:
>>>>>>>>> {pami,pamilrts,mpi]-bluegeneq(-smp(-async)?)?(-xlc)?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Actually, scratch that; looking at the code, it has to be
>>>>>>>> pami-bluegeneq-smp-*, and the tail doesn't matter.
>>>>>>>>
>>>>>>>> It looks like the change to thread 0 on each node calling exit(1)
>>>>>>>> instead of exit(0) happened in commit
>>>>>>>> aac1a19ecf0fdf132ae93987c1271c30bbd33f37. Perhaps Nikhil knows why
>>>>>>>> the
>>>>>>>> introduction of a local barrier at shutdown to synchronize threads
>>>>>>>> also included a change to the exit code, and whether it's safe for
>>>>>>>> us
>>>>>>>> to make that exit(0)?
>>>>>>>>
>>>>>>>>>>> On Sun, Mar 3, 2013 at 9:09 AM, wei jiang
>>>>>>>>>>> <wjiang AT alcf.anl.gov>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Exit(1) has job dependency ruined. Is there a way to start a new
>>>>>>>>>>>> job regardless dep_fail? (This is still risky. If the
>>>>>>>>>>>> dependency
>>>>>>>>>>>> job really failed then the later jobs started from garbage)
>>>>>>>>>>>>
>>>>>>>>>>>> Wei
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ----- Original Message -----
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> From: "Joseph Baker"
>>>>>>>>>>>>> <jlbaker AT uchicago.edu>
>>>>>>>>>>>>> To: "early-users-discuss"
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> <early-users-discuss AT lists.alcf.anl.gov>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sent: Sunday, March 3, 2013 3:46:46 AM
>>>>>>>>>>>>> Subject: [Early-users-discuss] NAMD and job dependency
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> NAMD is now shutting itself down correctly when finished
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> running,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> however if I include a job dependency in the qsub command for a
>>>>>>>>>>>>> subsequent NAMD job, it goes into the dep_fail state after the
>>>>>>>>>>>>> first
>>>>>>>>>>>>> job finishes. I looked on the Vesta user guide online, and it
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> says
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> that there is an expected exit status 0 for the dependency to
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> not
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> enter the fail state.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> In the error log there are these lines
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2013-03-03 08:04:26.526 (INFO ) [0xfff7aa6a270]
>>>>>>>>>>>>> VST-00040-33771-1024:126717:ibm.runjob.client.Job: job 126717
>>>>>>>>>>>>> started
>>>>>>>>>>>>> 2013-03-03 08:04:43.906 (INFO ) [0xfff99191ab0]
>>>>>>>>>>>>> 7653:tatu.runjob.monitor: tracklib completed
>>>>>>>>>>>>> 2013-03-03 08:16:36.620 (INFO ) [0xfff7aa6a270]
>>>>>>>>>>>>> VST-00040-33771-1024:126717:ibm.runjob.client.Job: exited with
>>>>>>>>>>>>> status 1
>>>>>>>>>>>>> 2013-03-03 08:16:36.623 (WARN ) [0xfff7aa6a270]
>>>>>>>>>>>>> VST-00040-33771-1024:126717:ibm.runjob.client.Job: normal
>>>>>>>>>>>>> termination with status 1 from rank 397
>>>>>>>>>>>>> 2013-03-03 08:16:36.623 (INFO ) [0xfff7aa6a270]
>>>>>>>>>>>>> tatu.runjob.client:
>>>>>>>>>>>>> task exited with status 1
>>>>>>>>>>>>> 2013-03-03 08:16:36.624 (INFO ) [0xfff99191ab0]
>>>>>>>>>>>>> 7653:tatu.runjob.monitor: monitor terminating
>>>>>>>>>>>>> 2013-03-03 08:16:36.625 (INFO ) [0xfff7aa6a270]
>>>>>>>>>>>>> tatu.runjob.client:
>>>>>>>>>>>>> monitor completed
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> and in the cobaltlog, it ends with
>>>>>>>>>>>>>
>>>>>>>>>>>>> Info: task completed normally with an exit code of 1;
>>>>>>>>>>>>> initiating
>>>>>>>>>>>>> job
>>>>>>>>>>>>> cleanup and removal
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> So, it looks from this that the NAMD exit code is 1. Is this
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> what
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> is
>>>>>>>>>>>>> causing the problem with the job dependency?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Joe
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> early-users-discuss mailing list
>>>>>>>>>>>>> early-users-discuss AT lists.alcf.anl.gov
>>>>>>>>>>>>> https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss
>>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> early-users-discuss mailing list
>>>>>>>>>>>> early-users-discuss AT lists.alcf.anl.gov
>>>>>>>>>>>> https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Jeff Hammond
>>>>>>>>>>> Argonne Leadership Computing Facility
>>>>>>>>>>> University of Chicago Computation Institute
>>>>>>>>>>> jhammond AT alcf.anl.gov
>>>>>>>>>>> / (630) 252-5381
>>>>>>>>>>> http://www.linkedin.com/in/jeffhammond
>>>>>>>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> early-users-discuss mailing list
>>>>>>>>>> early-users-discuss AT lists.alcf.anl.gov
>>>>>>>>>> https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> charm mailing list
>>>>>>>> charm AT cs.uiuc.edu
>>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/charm
>>>>>>>> _______________________________________________
>>>>>>>> ppl mailing list
>>>>>>>> ppl AT cs.uiuc.edu
>>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/ppl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jeff Hammond
>>>>>> Argonne Leadership Computing Facility
>>>>>> University of Chicago Computation Institute
>>>>>> jhammond AT alcf.anl.gov
>>>>>> / (630) 252-5381
>>>>>> http://www.linkedin.com/in/jeffhammond
>>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> charm mailing list
>>>> charm AT cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/charm
>>>>
>>>> _______________________________________________
>>>> ppl mailing list
>>>> ppl AT cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/ppl
>>
>>
>>
>>
>> --
>> Jeff Hammond
>> Argonne Leadership Computing Facility
>> University of Chicago Computation Institute
>> jhammond AT alcf.anl.gov
>> / (630) 252-5381
>> http://www.linkedin.com/in/jeffhammond
>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond



--
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond AT alcf.anl.gov
/ (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond





Archive powered by MHonArc 2.6.16.

Top of Page