Skip to Content.
Sympa Menu

charm - Re: [charm] mis-matched client callbacks in reduction messages

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] mis-matched client callbacks in reduction messages


Chronological Thread 
  • From: Phil Miller <mille121 AT illinois.edu>
  • To: Jozsef Bakosi <jbakosi AT lanl.gov>
  • Cc: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
  • Subject: Re: [charm] mis-matched client callbacks in reduction messages
  • Date: Fri, 3 Nov 2017 14:58:43 -0500
  • Authentication-results: illinois.edu; spf=pass smtp.mailfrom=unmobile AT gmail.com

Hi Jozsef,

It's not whining at all. This is a bothersome problem to address.

Randomized queues will change the order in which available messages get delivered to individual chares. If there is some perverse order it creates that leads to an inconsistent reduction sequence, that order is entirely possible to occur by chance in non-randomized execution as well. Note that it only operates on messages queued for delivery to objects - the objects themselves can (and must) structure and sequence their processing to ensure consistent operation. Multiple reductions are thus *not* inconsistent with randomized queueing. If the abort is triggered, there's a message delivery order that causes different elements in an array to make a different sequence of contribute calls.

In the particular case you present, I would recommend sparing yourself the bulk of the frustrating reasoning about message ordering, and moving the contribution of the diagnostics onto a bound array.

When running under randomized queues and still getting the error, there may be more going on than is apparent in the code you presented. I'm putting together a patch that will provide deeper diagnostic information for you.

Phil

On Thu, Nov 2, 2017 at 12:24 PM, Jozsef Bakosi <jbakosi AT lanl.gov> wrote:
Hi Phil,

Sorry for the whining, but this error is giving me way too much trouble and I
don't think my understanding is getting better.

So I am successfully using shadow arrays and they do appear to work around this
problem. (I have tried this with groups successfully only though so far.)

Since I have been mainly getting this problem with multiple reductions using the
randomized-queue build of Charm++, I wonder if my requirement that a logic
involving SDAG and multiple reductions to execute correctly (i.e. without this
error) makes sense even with randomized queues. I am thinking that randomized
queues will most likely fire off multiple reductions in different (i.e., random)
order, effectively taking the ordering out of my hand. Do you think that's true?
Aren't multiple reductions inherently incompatible with randomized queues?

To make it more concrete, I have the following simplified scenario in
pseudo-code:

class ChareArray : public CProxy_ChareArray {

/*entry*/ void dt() {
  // compute some dt specific to this array element
  double dt = ...
  // allreduce:
  contribute( to all elements of ChareArray targeting advance(mindt) delivering
              the minimum of some dt to all elements )
}

/*entry*/ void advance(double mindt) {
  contribute( to some single chare collecting some diagnostics )
  if (continue time stepping)
    dt();
  else
    contribute( to some single chare eventually calling ckExit() )
}

}

So during time stepping there are really two contribute calls and I'm pretty
sure these two generate the "mis-matched client callbacks in reduction messages"
error. (I don't think the logic gets to the contribute that will eventually get
to ckExit().)

When I start one of them from a bound/shadow array, I still get the error but
only with randomized queues. The order of contributions to the two reductions
(per single chare), I believe, is guaranteed here. But won't randomized queues
screw up the order? Can that even be done? Do I want too much?

Jozsef

On 10.29.2017 17:22, Jozsef Bakosi wrote:
> On 10.27.2017 11:38, Jozsef Bakosi wrote:
> > On 10.27.2017 11:02, Phil Miller wrote:
> > >    We use an approach of creating bound 'shadow' arrays to act as
> > >    independent reduction (sequencing) contexts to address this limitation.
> > >    We've used this approach in a few places in our code, including the
> > >    LiveViz in-situ visualization library and the collision detection
> > >    library.
> > >    In a little more detail, when constructing a chare array, it's possible
> > >    to specify that it should be bound to another existing chare array.
> > >    That means that elements of the same index will always live on the same
> > >    PE. So, you can instantiate some auxiliary arrays, one per reduction
> > >    stream, and bind them to your main computation arrays. Since elements
> > >    with corresponding indices are guaranteed to be co-located, the main
> > >    element can get a pointer to each auxiliary via a ckLocal() call, and
> > >    then call aux->contribute(...) rather than implicitly
> > >    this->contribute(). So, the setup code get a bit more complicated, and
> > >    the code actually invoking the reductions get just a little more
> > >    involved.
> > >    Is that a clear description? Does that approach work for you?
> >
> > I think that would work and I do use bound arrays for a different purpose.
> >
> > So how would I have to use this? Here is what I think I need to do: I have to
> > identify all reductions that can happen in an order that is not necessarily
> > guaranteed to be always the same and fire them from bound arrays instead (each
> > from a different chare array)?
>
> Is there a way to tell which two reductions caused the "mis-matched client
> callbacks in reduction messages" error? I do get a traceback from one, but can I
> get one from the other one somehow so I know which reduction I have to initiate
> from a shadow array?
>
> Thanks,
> J




Archive powered by MHonArc 2.6.19.

Top of Page