Skip to Content.
Sympa Menu

charm - Re: [charm] [ppl] [Early-users-discuss] NAMD and job dependency

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [ppl] [Early-users-discuss] NAMD and job dependency


Chronological Thread 
  • From: Jim Phillips <jim AT ks.uiuc.edu>
  • To: Nikhil Jain <nikhil AT illinois.edu>
  • Cc: Jeff Hammond <jhammond AT alcf.anl.gov>, wei jiang <wjiang AT alcf.anl.gov>, Charm Mailing List <charm AT cs.illinois.edu>, Joseph Baker <jlbaker AT uchicago.edu>, "Kale, Laxmikant V" <kale AT illinois.edu>
  • Subject: Re: [charm] [ppl] [Early-users-discuss] NAMD and job dependency
  • Date: Sun, 3 Mar 2013 13:00:27 -0600 (CST)
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>


1) Why does it matter what code is passed to exit()? Shouldn't it be up the the application to decide if it succeeded or not?

2) Isn't the condition for a clean exit just quiescence detection, which NAMD does on exit anyway?

-Jim

On Sun, 3 Mar 2013, Nikhil Jain wrote:

I agree with your sentiments Jeff and Prof Kale, and we hope to remain
consistent with rest of the world by passing 0 as exit status in future.
Having said that, for now, we have taken the route of 1 as exit status as
Sameer expects performance implications to manage a clean exit when
threads are used (keeping track of all message sent/received etc).

Alternatively, I suggest usage of pamilrts based builds that are also
stable and exits cleanly with a status of 0 (to handle the dependency
issue). We are currently studying the performance of pamilrts builds
against pami builds, and the initial findings look promising. The build
command for them are similar to pami:

./build charm++ pamilrts-bluegeneq smp --with-production
or to use the async mode with separate hardware threads running the
progress engine
./build charm++ pamilrts-bluegeneq smp async --with-production

And as Prof Kale said, further comments on pami are reserved till we have
a detailed discussion.

Thanks
Nikhil




--
Nikhil Jain,
nikhil AT illinois.edu,
http://charm.cs.uiuc.edu/people/nikhil
Doctoral Candidate @ CS, UIUC






On 3/3/13 12:42 PM, "Kale, Laxmikant V"
<kale AT illinois.edu>
wrote:

I agree.

But
Lets postpone discussion until I have a chance to discuss the latest
changes with Nikhil and Sameer.
It is a two day old not-even-beta-release change in the development
version of the PAMI layer, for the upcoming 6.5 version.. Right? (and
thanks for catching the problem).

As an aside: I was not coped on the original email (on the early users
list?) that started this thread. Can someone forward that for me please?

Sanjay

--
Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu
<http://charm.cs.uiuc.edu/>
Professor, Computer Science
kale AT illinois.edu
201 N. Goodwin Avenue Ph: (217) 244-0094
Urbana, IL 61801-2302 FAX: (217) 265-6582






On 3/3/13 12:35 PM, "Jeff Hammond"
<jhammond AT alcf.anl.gov>
wrote:

To accept an exit code of 1 as success would violate universal
convention, which I define to include the combined conventions of MPI,
C, Linux and POSIX.

Charm++ needs to return exit code 0 on successful termination. If
Blue Gene doesn't like that, it is a system software bug and it's IBM
problem to fix it.

Please share with me the details about why BG doesn't take exit(0)
properly and I will report it to IBM. I have written my fair share of
PAMI+Pthreads programs and never once had a problem with using
exit(0).

Note also that turning exit(1) into success would break the workflow
of every single program I have ever written from scratch and almost
certainly a non-negligible fraction of our user codes as well.

Jeff

* From MPI-3 Section 8.4 "Error Codes and Classes":

"The error codes satisfy, 0 = MPI_SUCCESS < MPI_ERR_... 3Ž4
MPI_ERR_LASTCODE."

MPICH happens to return 1 = MPI_ERR_BUFFER ("Invalid buffer pointer").

* From
http://www.gnu.org/software/libc/manual/html_node/Exit-Status.html:

"There are conventions for what sorts of status values certain
programs should return. The most common convention is simply 0 for
success and 1 for failure."

* From http://tldp.org/LDP/abs/html/exit-status.html:

"Every command returns an exit status (sometimes referred to as a
return status or exit code). A successful command returns a 0, while
an unsuccessful one returns a non-zero value that usually can be
interpreted as an error code. Well-behaved UNIX commands, programs,
and utilities return a 0 exit code upon successful completion, though
there are some exceptions."

* From http://en.wikipedia.org/wiki/Exit_status#section_1:

"The C programming language allows programs exiting or returning from
the main function to signal success or failure by returning an
integer, or returning the macros EXIT_SUCCESS and EXIT_FAILURE. On
Unix-like systems these are evaluate to 0 and 1 respectively."

"POSIX-compatible systems typically use a convention of zero for
success and non zero for error."

On Sun, Mar 3, 2013 at 11:18 AM, Kale, Laxmikant V
<kale AT illinois.edu>
wrote:
Nikhil will respond with more technical detail, but for a quick piece
of
information: this was done recently following a conversation between
Nikhil and Sameer (IBM). Exit(1) on BG is kind of legitimate, if I
understand the convesation right. Without accepting that, our autobuild
process was failing. If that's the case, may be the higher level
scripts
should accept "1" as (one of the) proper exit code.

--
Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu
<http://charm.cs.uiuc.edu/>
Professor, Computer Science
kale AT illinois.edu
201 N. Goodwin Avenue Ph: (217) 244-0094
Urbana, IL 61801-2302 FAX: (217) 265-6582






On 3/3/13 10:11 AM, "Phil Miller"
<mille121 AT illinois.edu>
wrote:

On Sun, Mar 3, 2013 at 10:00 AM, Phil Miller
<mille121 AT illinois.edu>
wrote:
On Sun, Mar 3, 2013 at 9:35 AM, wei jiang
<wjiang AT alcf.anl.gov>
wrote:
Charm++ shouldn't be calling exit(1) on correct termination.
That's
the only reasonable solution here and it's trivial to implement.

The old implementation did call exit(0). The current discussion
seems
beyond of early user discussion for bgq.
I would turn it to discussion with charm++/namd developers.

This can hop over to the Charm++ mailing list.

First question is which machine layer is being used:
{pami,pamilrts,mpi]-bluegeneq(-smp(-async)?)?(-xlc)?

Actually, scratch that; looking at the code, it has to be
pami-bluegeneq-smp-*, and the tail doesn't matter.

It looks like the change to thread 0 on each node calling exit(1)
instead of exit(0) happened in commit
aac1a19ecf0fdf132ae93987c1271c30bbd33f37. Perhaps Nikhil knows why the
introduction of a local barrier at shutdown to synchronize threads
also included a change to the exit code, and whether it's safe for us
to make that exit(0)?

On Sun, Mar 3, 2013 at 9:09 AM, wei jiang
<wjiang AT alcf.anl.gov>
wrote:
Exit(1) has job dependency ruined. Is there a way to start a new
job regardless dep_fail? (This is still risky. If the dependency
job really failed then the later jobs started from garbage)

Wei


----- Original Message -----
From: "Joseph Baker"
<jlbaker AT uchicago.edu>
To: "early-users-discuss"
<early-users-discuss AT lists.alcf.anl.gov>
Sent: Sunday, March 3, 2013 3:46:46 AM
Subject: [Early-users-discuss] NAMD and job dependency






Hi,


NAMD is now shutting itself down correctly when finished
running,
however if I include a job dependency in the qsub command for a
subsequent NAMD job, it goes into the dep_fail state after the
first
job finishes. I looked on the Vesta user guide online, and it
says
that there is an expected exit status 0 for the dependency to
not
enter the fail state.


In the error log there are these lines


2013-03-03 08:04:26.526 (INFO ) [0xfff7aa6a270]
VST-00040-33771-1024:126717:ibm.runjob.client.Job: job 126717
started
2013-03-03 08:04:43.906 (INFO ) [0xfff99191ab0]
7653:tatu.runjob.monitor: tracklib completed
2013-03-03 08:16:36.620 (INFO ) [0xfff7aa6a270]
VST-00040-33771-1024:126717:ibm.runjob.client.Job: exited with
status 1
2013-03-03 08:16:36.623 (WARN ) [0xfff7aa6a270]
VST-00040-33771-1024:126717:ibm.runjob.client.Job: normal
termination with status 1 from rank 397
2013-03-03 08:16:36.623 (INFO ) [0xfff7aa6a270]
tatu.runjob.client:
task exited with status 1
2013-03-03 08:16:36.624 (INFO ) [0xfff99191ab0]
7653:tatu.runjob.monitor: monitor terminating
2013-03-03 08:16:36.625 (INFO ) [0xfff7aa6a270]
tatu.runjob.client:
monitor completed


and in the cobaltlog, it ends with

Info: task completed normally with an exit code of 1; initiating
job
cleanup and removal


So, it looks from this that the NAMD exit code is 1. Is this
what
is
causing the problem with the job dependency?


Thanks,
Joe


_______________________________________________
early-users-discuss mailing list
early-users-discuss AT lists.alcf.anl.gov
https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss

_______________________________________________
early-users-discuss mailing list
early-users-discuss AT lists.alcf.anl.gov
https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss



--
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond AT alcf.anl.gov
/ (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond

_______________________________________________
early-users-discuss mailing list
early-users-discuss AT lists.alcf.anl.gov
https://lists.alcf.anl.gov/mailman/listinfo/early-users-discuss
_______________________________________________
charm mailing list
charm AT cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/charm
_______________________________________________
ppl mailing list
ppl AT cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/ppl




--
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond AT alcf.anl.gov
/ (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond




_______________________________________________
charm mailing list
charm AT cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/charm

_______________________________________________
ppl mailing list
ppl AT cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/ppl



Archive powered by MHonArc 2.6.16.

Top of Page