charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Deadlock detection

From: Jozsef Bakosi <jbakosi AT lanl.gov>
To: "Kale, Laxmikant V" <kale AT illinois.edu>
Cc: "Miller, Philip B" <mille121 AT illinois.edu>, Vinicius Freitas <vinicius.mct.freitas AT gmail.com>, charm <charm AT lists.cs.illinois.edu>
Subject: Re: [charm] Deadlock detection
Date: Mon, 2 Jul 2018 14:58:58 -0600
Authentication-results: illinois.edu; spf=pass smtp.mailfrom=jbakosi AT lanl.gov; dmarc=pass header.from=lanl.gov

Thanks, Sajnay, for the detailed explanation. More questions and answers below
inline.

On 06.29.2018 22:54, Kale, Laxmikant V wrote:
> > Can I assume that if an algorithm is NOT specifically designed with
> > relying
> > on quiescence in mind, quiescence occurring is (likely) a sign of error?
> > If
> > so, how likely?
> Yes. 100%. Quiescence should not be triggered in such a case, except
> because
> of a bug.

Okay, that is great to hear. I know below you are not advocating running with
QD
always, but I think I will add a command line option to enable QD for
(regression) tests (and CI) and rely on it for catching errors with the
default
as QD off.

> How does it work:
> Quiescence detection (QD) works by maintaining counts of created and
> processed
> messages on each processor, and a continuously-running distributor algorithm
> to check those counts (whether they have changed over successive iterations,
> and whether the total of created messages and processed messages are equal).
> http://charm.cs.illinois.edu/papers/93-11

Nice! Thanks for the link to the paper. So the algorithm does not rely on a
prescribed frequency (or rather, an idle-time interval), which would be hard
to
guess anyway. I have a question regarding sec 2, step 2 "idle": It says "An
idle
message signifies that each processor in the sub-tree below has been idle at
least once since the last idle message". Am I correct, that this means that an
idle message is sent up to the parent anytime when a chare has processed all
of
its messages? The implementation may be more sophisticated, but in principle,
is
that what "idle" means?

> The algorithm is self-limiting, and so will run with relatively low overhead
> in the background; but there is some overhead. I would not advocate always
> running with QD solely for the purpose of catching errors. Nondeterministic
> errorss may cause other types of issues beyond deadlock anyway.
>
> I like Phil's suggestion of having charmdebug show the states of sdag chares
> in a collapsed form. I will see if I can get one of us to develop it. In
> the
> meanwhile, the other idea Phil suggested is useful: QD callback to print
> status of each chare (or maybe add a reduction to collapse the count of
> chares
> in each state).

That is indeed a good suggestion, I will put in an optional QD reduction,
which
is already great help, instead of a random deadlock, and will put
the gathering of more meaningful states on our to-do list.

Thanks,
Jozsef

Re: [charm] Deadlock detection, Jozsef Bakosi, 07/02/2018