Skip to Content.
Sympa Menu

charm - Re: [charm] mis-matched client callbacks in reduction messages

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] mis-matched client callbacks in reduction messages


Chronological Thread 
  • From: Jozsef Bakosi <jbakosi AT lanl.gov>
  • To: Phil Miller <mille121 AT illinois.edu>
  • Cc: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
  • Subject: Re: [charm] mis-matched client callbacks in reduction messages
  • Date: Mon, 6 Nov 2017 07:32:04 -0700
  • Authentication-results: illinois.edu; spf=pass smtp.mailfrom=jbakosi AT lanl.gov

Thanks for the update. I will revert my code to a known failing state and will
test it. I very much appreciate the help with this as I will surely run into
it again.

Jozsef

On 11.03.2017 15:09, Phil Miller wrote:
> You can try out a rough patch to print basic details of the mis-matched
> reductions here:
>
> [1]https://charm.cs.illinois.edu/gerrit/3227
>
> Right now, it will just say what the reducers and callback types are
> numerically - deeper information would require a bunch more code, and
> those bits should be enough to identify among a couple
> suspect contribute() calls.
>
> On Fri, Nov 3, 2017 at 2:58 PM, Phil Miller
> <[2]mille121 AT illinois.edu>
> wrote:
>
> Hi Jozsef,
> It's not whining at all. This is a bothersome problem to address.
> Randomized queues will change the order in which available messages get
> delivered to individual chares. If there is some perverse order it
> creates that leads to an inconsistent reduction sequence, that order is
> entirely possible to occur by chance in non-randomized execution as
> well. Note that it only operates on messages queued for delivery to
> objects - the objects themselves can (and must) structure and sequence
> their processing to ensure consistent operation. Multiple reductions
> are thus *not* inconsistent with randomized queueing. If the abort is
> triggered, there's a message delivery order that causes different
> elements in an array to make a different sequence of contribute calls.
> In the particular case you present, I would recommend sparing yourself
> the bulk of the frustrating reasoning about message ordering, and
> moving the contribution of the diagnostics onto a bound array.
> When running under randomized queues and still getting the error, there
> may be more going on than is apparent in the code you presented. I'm
> putting together a patch that will provide deeper diagnostic
> information for you.
> Phil
>
> On Thu, Nov 2, 2017 at 12:24 PM, Jozsef Bakosi
> <[3]jbakosi AT lanl.gov>
> wrote:
>
> Hi Phil,
> Sorry for the whining, but this error is giving me way too much
> trouble and I
> don't think my understanding is getting better.
> So I am successfully using shadow arrays and they do appear to work
> around this
> problem. (I have tried this with groups successfully only though so
> far.)
> Since I have been mainly getting this problem with multiple
> reductions using the
> randomized-queue build of Charm++, I wonder if my requirement that a
> logic
> involving SDAG and multiple reductions to execute correctly (i.e.
> without this
> error) makes sense even with randomized queues. I am thinking that
> randomized
> queues will most likely fire off multiple reductions in different
> (i.e., random)
> order, effectively taking the ordering out of my hand. Do you think
> that's true?
> Aren't multiple reductions inherently incompatible with randomized
> queues?
> To make it more concrete, I have the following simplified scenario
> in
> pseudo-code:
> class ChareArray : public CProxy_ChareArray {
> /*entry*/ void dt() {
> // compute some dt specific to this array element
> double dt = ...
> // allreduce:
> contribute( to all elements of ChareArray targeting advance(mindt)
> delivering
> the minimum of some dt to all elements )
> }
> /*entry*/ void advance(double mindt) {
> contribute( to some single chare collecting some diagnostics )
> if (continue time stepping)
> dt();
> else
> contribute( to some single chare eventually calling ckExit() )
> }
> }
> So during time stepping there are really two contribute calls and
> I'm pretty
> sure these two generate the "mis-matched client callbacks in
> reduction messages"
> error. (I don't think the logic gets to the contribute that will
> eventually get
> to ckExit().)
> When I start one of them from a bound/shadow array, I still get the
> error but
> only with randomized queues. The order of contributions to the two
> reductions
> (per single chare), I believe, is guaranteed here. But won't
> randomized queues
> screw up the order? Can that even be done? Do I want too much?
> Jozsef
>
> On 10.29.2017 17:22, Jozsef Bakosi wrote:
> > On 10.27.2017 11:38, Jozsef Bakosi wrote:
> > > On 10.27.2017 11:02, Phil Miller wrote:
> > > > We use an approach of creating bound 'shadow' arrays to act as
> > > > independent reduction (sequencing) contexts to address this
> limitation.
> > > > We've used this approach in a few places in our code,
> including the
> > > > LiveViz in-situ visualization library and the collision
> detection
> > > > library.
> > > > In a little more detail, when constructing a chare array, it's
> possible
> > > > to specify that it should be bound to another existing chare
> array.
> > > > That means that elements of the same index will always live on
> the same
> > > > PE. So, you can instantiate some auxiliary arrays, one per
> reduction
> > > > stream, and bind them to your main computation arrays. Since
> elements
> > > > with corresponding indices are guaranteed to be co-located,
> the main
> > > > element can get a pointer to each auxiliary via a ckLocal()
> call, and
> > > > then call aux->contribute(...) rather than implicitly
> > > > this->contribute(). So, the setup code get a bit more
> complicated, and
> > > > the code actually invoking the reductions get just a little
> more
> > > > involved.
> > > > Is that a clear description? Does that approach work for you?
> > >
> > > I think that would work and I do use bound arrays for a different
> purpose.
> > >
> > > So how would I have to use this? Here is what I think I need to do:
> I have to
> > > identify all reductions that can happen in an order that is not
> necessarily
> > > guaranteed to be always the same and fire them from bound arrays
> instead (each
> > > from a different chare array)?
> >
> > Is there a way to tell which two reductions caused the "mis-matched
> client
> > callbacks in reduction messages" error? I do get a traceback from
> one, but can I
> > get one from the other one somehow so I know which reduction I have
> to initiate
> > from a shadow array?
> >
> > Thanks,
> > J



Archive powered by MHonArc 2.6.19.

Top of Page