Skip to Content.
Sympa Menu

charm - Re: [charm] mis-matched client callbacks in reduction messages

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] mis-matched client callbacks in reduction messages


Chronological Thread 
  • From: Jozsef Bakosi <jbakosi AT lanl.gov>
  • To: "Kale, Laxmikant V" <kale AT illinois.edu>
  • Cc: "Miller, Philip B" <mille121 AT illinois.edu>, "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
  • Subject: Re: [charm] mis-matched client callbacks in reduction messages
  • Date: Mon, 6 Nov 2017 07:42:01 -0700
  • Authentication-results: illinois.edu; spf=pass smtp.mailfrom=jbakosi AT lanl.gov

Thanks, Sanjay, for the analysis. This is very helpful.

I attempted (and failed) to reproduce the problem in a minimal example
attached.
It runs fine with randomized queues. This and Phil's help pointed me to look
for
the error somewhere else which I believe I finally found, discussed in the
previous emails.

Here is the example (that fails to reproduce the problem because it is
correct),
in case anyone finds it useful.

Jozsef

On 11.04.2017 02:09, Kale, Laxmikant V wrote:
> A sanity check:
>
> In your example, there is no chance that the contributions will get out of
> order *unless* the “if (continue time stepping)” condition evaluates to
> different values on different chares in the same iteration. Are you sure it
> is identical on all chares?
>
> (E.g. If it is a floating point comparison, it could be off, for example.
> If it is a convergence test, its better to do a Boolean reduction. Or, in
> other ways, if the expression being evaluated is not identical across all
> participating chares, you would get some chares calling dt(), while others
> don’t, in some iteration. ).
>
> Of course, this will be a red herring since you say that no chare is going
> to the “else” clause leading to CkExit() path. But just worth confirming.
>
> Also, if you had “thisProxy.dt();” (sending the invocation via scheduler’s
> queue) the randomized queue might enter the reasoning, but only if another
> advance() call happens before dt() is processed(); But I assume the next
> iteration cannot start until the current one is finished.
>
> -Sanjay
>
> On 11/2/17, 12:24 PM, "Jozsef Bakosi"
> <jbakosi AT lanl.gov>
> wrote:
>
> Hi Phil,
>
> Sorry for the whining, but this error is giving me way too much trouble
> and I
> don't think my understanding is getting better.
>
> So I am successfully using shadow arrays and they do appear to work
> around this
> problem. (I have tried this with groups successfully only though so
> far.)
>
> Since I have been mainly getting this problem with multiple reductions
> using the
> randomized-queue build of Charm++, I wonder if my requirement that a
> logic
> involving SDAG and multiple reductions to execute correctly (i.e.
> without this
> error) makes sense even with randomized queues. I am thinking that
> randomized
> queues will most likely fire off multiple reductions in different
> (i.e., random)
> order, effectively taking the ordering out of my hand. Do you think
> that's true?
> Aren't multiple reductions inherently incompatible with randomized
> queues?
>
> To make it more concrete, I have the following simplified scenario in
> pseudo-code:
>
> class ChareArray : public CProxy_ChareArray {
>
> /*entry*/ void dt() {
> // compute some dt specific to this array element
> double dt = ...
> // allreduce:
> contribute( to all elements of ChareArray targeting advance(mindt)
> delivering
> the minimum of some dt to all elements )
> }
>
> /*entry*/ void advance(double mindt) {
> contribute( to some single chare collecting some diagnostics )
> if (continue time stepping)
> dt();
> else
> contribute( to some single chare eventually calling ckExit() )
> }
>
> }
>
> So during time stepping there are really two contribute calls and I'm
> pretty
> sure these two generate the "mis-matched client callbacks in reduction
> messages"
> error. (I don't think the logic gets to the contribute that will
> eventually get
> to ckExit().)
>
> When I start one of them from a bound/shadow array, I still get the
> error but
> only with randomized queues. The order of contributions to the two
> reductions
> (per single chare), I believe, is guaranteed here. But won't randomized
> queues
> screw up the order? Can that even be done? Do I want too much?
>
> Jozsef
>
> On 10.29.2017 17:22, Jozsef Bakosi wrote:
> > On 10.27.2017 11:38, Jozsef Bakosi wrote:
> > > On 10.27.2017 11:02, Phil Miller wrote:
> > > > We use an approach of creating bound 'shadow' arrays to act as
> > > > independent reduction (sequencing) contexts to address this
> limitation.
> > > > We've used this approach in a few places in our code,
> including the
> > > > LiveViz in-situ visualization library and the collision
> detection
> > > > library.
> > > > In a little more detail, when constructing a chare array, it's
> possible
> > > > to specify that it should be bound to another existing chare
> array.
> > > > That means that elements of the same index will always live on
> the same
> > > > PE. So, you can instantiate some auxiliary arrays, one per
> reduction
> > > > stream, and bind them to your main computation arrays. Since
> elements
> > > > with corresponding indices are guaranteed to be co-located,
> the main
> > > > element can get a pointer to each auxiliary via a ckLocal()
> call, and
> > > > then call aux->contribute(...) rather than implicitly
> > > > this->contribute(). So, the setup code get a bit more
> complicated, and
> > > > the code actually invoking the reductions get just a little
> more
> > > > involved.
> > > > Is that a clear description? Does that approach work for you?
> > >
> > > I think that would work and I do use bound arrays for a different
> purpose.
> > >
> > > So how would I have to use this? Here is what I think I need to do:
> I have to
> > > identify all reductions that can happen in an order that is not
> necessarily
> > > guaranteed to be always the same and fire them from bound arrays
> instead (each
> > > from a different chare array)?
> >
> > Is there a way to tell which two reductions caused the "mis-matched
> client
> > callbacks in reduction messages" error? I do get a traceback from
> one, but can I
> > get one from the other one somehow so I know which reduction I have
> to initiate
> > from a shadow array?
> >
> > Thanks,
> > J

mainmodule insert {

mainchare main {
entry main( CkArgMsg* );
entry [reductiontarget] void finish();
entry [reductiontarget] void diagnostics();
};

array [1D] ChArray {
entry ChArray( const CProxy_main& h );
entry void dt();
entry [reductiontarget] void advance();
};

};
#include "insert.decl.h"

class ChArray : public CBase_ChArray {
  public:
    explicit ChArray( const CProxy_main& host );
    explicit ChArray( CkMigrateMessage* ) {}
    void dt();
    void advance();
  private:
    CProxy_main host;
    int it;
    int maxit;
};

class main : public CBase_main {
public:
  main(CkMigrateMessage *m) {}
  main(CkArgMsg *m);
  void diagnostics();
  void finish() { CkExit(); }
};
#include "insert.h"

ChArray::ChArray( const CProxy_main& h ) : it( 0 ), maxit( 100 ), host( h ) {
 dt();
}

void ChArray::dt() {
  contribute( CkCallback( CkReductionTarget( ChArray, advance ), thisProxy ) );
}

void ChArray::advance() {
  contribute( CkCallback( CkReductionTarget( main, diagnostics ), host ) );
  if (++it == maxit)
    contribute( CkCallback( CkReductionTarget( main, finish ), host ) );
  else
    dt();
}

void main::diagnostics() {}

main::main(CkArgMsg *m)
{
  CProxy_ChArray::ckNew( thisProxy, 2*CkNumPes() );
  delete m;
}

#include "insert.def.h"
# modify the path of CHARMC
CHARMC=<charm-install>/bin/charmc -lc++ $(OPTS)

all: insert

insert: insert.o
$(CHARMC) insert.o -o insert -language charm++

insert.o : insert.C insert.def.h
$(CHARMC) -c insert.C

insert.decl.h insert.def.h: insert.ci
$(CHARMC) insert.ci

clean:
rm -f insert *.o *.decl.h *.def.h *~ charmrun



Archive powered by MHonArc 2.6.19.

Top of Page