Skip to Content.
Sympa Menu

charm - Re: [charm] mis-matched client callbacks in reduction messages

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] mis-matched client callbacks in reduction messages


Chronological Thread 
  • From: Jozsef Bakosi <jbakosi AT lanl.gov>
  • To: Phil Miller <mille121 AT illinois.edu>
  • Cc: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>, Evan Ramos <evan AT hpccharm.com>
  • Subject: Re: [charm] mis-matched client callbacks in reduction messages
  • Date: Tue, 7 Nov 2017 15:35:32 -0700
  • Authentication-results: illinois.edu; spf=pass smtp.mailfrom=jbakosi AT lanl.gov

On 11.07.2017 16:31, Phil Miller wrote:
> On Tue, Nov 7, 2017 at 4:30 PM, Jozsef Bakosi
> <[1]jbakosi AT lanl.gov>
> wrote:
>
> Hi Phil,
> I'm having a hard time with that checkout. Here is what I do:
> git clone [2]https://charm.cs.illinois.edu/gerrit/charm && cd charm
> git fetch [3]https://charm.cs.illinois.edu/gerrit/charm
> refs/changes/27/3227/1 && git checkout FETCH_HEAD
> ./build charm++ mpi-linux-x86_64 --with-prio-type=int
> --enable-randomized-msgq --suffix randq-debug --build-shared -j36 -g
> This is fine, then when I build my code, I get the link error:
> /usr/bin/ld: cannot find -lhwloc_embedded
> Does that ring some bells for you? Is that being pulled in by my
> mpi? I'm
> probably screwing something up...
>
> That's related to some recent changes we've made, though that
> particular failure is kinda surprising to me.
> What machine is this, and what output do you get from the following
> commands?
> which mpicxx
> mpicxx -show


This is my custom-installed openmpi based on gcc-7 (which is system-install on
debian/testing):

$ which mpicxx
/opt/openmpi/2.0.2/gnu/7/bin/mpicxx

$ mpicxx -show
g++-7 -I/opt/openmpi/2.0.2/gnu/7/include -pthread -Wl,-rpath
-Wl,/opt/openmpi/2.0.2/gnu/7/lib -Wl,--enable-new-dtags
-L/opt/openmpi/2.0.2/gnu/7/lib -lmpi_cxx -lmpi

$ g++-7 --version
g++-7 (Debian 7.2.0-12) 7.2.1 20171025
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ which g++-7
/usr/bin/g++-7


>
> Thanks,
> Jozsef
> On 11.03.2017 15:09, Phil Miller wrote:
> > You can try out a rough patch to print basic details of the
> mis-matched
> > reductions here:
> >
> > [1][4]https://charm.cs.illinois.edu/gerrit/3227
> >
> > Right now, it will just say what the reducers and callback
> types are
> > numerically - deeper information would require a bunch more
> code, and
> > those bits should be enough to identify among a couple
> > suspect contribute() calls.
> >
> > On Fri, Nov 3, 2017 at 2:58 PM, Phil Miller
>
> <[2][5]mille121 AT illinois.edu>
> > wrote:
> >
> > Hi Jozsef,
> > It's not whining at all. This is a bothersome problem to
> address.
> > Randomized queues will change the order in which available
> messages get
> > delivered to individual chares. If there is some perverse order
> it
> > creates that leads to an inconsistent reduction sequence, that
> order is
> > entirely possible to occur by chance in non-randomized
> execution as
> > well. Note that it only operates on messages queued for
> delivery to
> > objects - the objects themselves can (and must) structure and
> sequence
> > their processing to ensure consistent operation. Multiple
> reductions
> > are thus *not* inconsistent with randomized queueing. If the
> abort is
> > triggered, there's a message delivery order that causes
> different
> > elements in an array to make a different sequence of contribute
> calls.
> > In the particular case you present, I would recommend sparing
> yourself
> > the bulk of the frustrating reasoning about message ordering,
> and
> > moving the contribution of the diagnostics onto a bound array.
> > When running under randomized queues and still getting the
> error, there
> > may be more going on than is apparent in the code you
> presented. I'm
> > putting together a patch that will provide deeper diagnostic
> > information for you.
> > Phil
> >
>
> > On Thu, Nov 2, 2017 at 12:24 PM, Jozsef Bakosi wrote:
> >
> > Hi Phil,
> > Sorry for the whining, but this error is giving me way too much
> > trouble and I
> > don't think my understanding is getting better.
> > So I am successfully using shadow arrays and they do appear to
> work
> > around this
> > problem. (I have tried this with groups successfully only though
> so
> > far.)
> > Since I have been mainly getting this problem with multiple
> > reductions using the
> > randomized-queue build of Charm++, I wonder if my requirement
> that a
> > logic
> > involving SDAG and multiple reductions to execute correctly
> (i.e.
> > without this
> > error) makes sense even with randomized queues. I am thinking
> that
> > randomized
> > queues will most likely fire off multiple reductions in
> different
> > (i.e., random)
> > order, effectively taking the ordering out of my hand. Do you
> think
> > that's true?
> > Aren't multiple reductions inherently incompatible with
> randomized
> > queues?
> > To make it more concrete, I have the following simplified
> scenario
> > in
> > pseudo-code:
> > class ChareArray : public CProxy_ChareArray {
> > /*entry*/ void dt() {
> > // compute some dt specific to this array element
> > double dt = ...
> > // allreduce:
> > contribute( to all elements of ChareArray targeting
> advance(mindt)
> > delivering
> > the minimum of some dt to all elements )
> > }
> > /*entry*/ void advance(double mindt) {
> > contribute( to some single chare collecting some diagnostics )
> > if (continue time stepping)
> > dt();
> > else
> > contribute( to some single chare eventually calling ckExit()
> )
> > }
> > }
> > So during time stepping there are really two contribute calls
> and
> > I'm pretty
> > sure these two generate the "mis-matched client callbacks in
> > reduction messages"
> > error. (I don't think the logic gets to the contribute that will
> > eventually get
> > to ckExit().)
> > When I start one of them from a bound/shadow array, I still get
> the
> > error but
> > only with randomized queues. The order of contributions to the
> two
> > reductions
> > (per single chare), I believe, is guaranteed here. But won't
> > randomized queues
> > screw up the order? Can that even be done? Do I want too much?
> > Jozsef
> >
> > On 10.29.2017 17:22, Jozsef Bakosi wrote:
> > > On 10.27.2017 11:38, Jozsef Bakosi wrote:
> > > > On 10.27.2017 11:02, Phil Miller wrote:
> > > > > We use an approach of creating bound 'shadow' arrays to
> act as
> > > > > independent reduction (sequencing) contexts to address
> this
> > limitation.
> > > > > We've used this approach in a few places in our code,
> > including the
> > > > > LiveViz in-situ visualization library and the collision
> > detection
> > > > > library.
> > > > > In a little more detail, when constructing a chare array,
> it's
> > possible
> > > > > to specify that it should be bound to another existing
> chare
> > array.
> > > > > That means that elements of the same index will always
> live on
> > the same
> > > > > PE. So, you can instantiate some auxiliary arrays, one
> per
> > reduction
> > > > > stream, and bind them to your main computation arrays.
> Since
> > elements
> > > > > with corresponding indices are guaranteed to be
> co-located,
> > the main
> > > > > element can get a pointer to each auxiliary via a
> ckLocal()
> > call, and
> > > > > then call aux->contribute(...) rather than implicitly
> > > > > this->contribute(). So, the setup code get a bit more
> > complicated, and
> > > > > the code actually invoking the reductions get just a
> little
> > more
> > > > > involved.
> > > > > Is that a clear description? Does that approach work for
> you?
> > > >
> > > > I think that would work and I do use bound arrays for a
> different
> > purpose.
> > > >
> > > > So how would I have to use this? Here is what I think I need
> to do:
> > I have to
> > > > identify all reductions that can happen in an order that is
> not
> > necessarily
> > > > guaranteed to be always the same and fire them from bound
> arrays
> > instead (each
> > > > from a different chare array)?
> > >
> > > Is there a way to tell which two reductions caused the
> "mis-matched
> > client
> > > callbacks in reduction messages" error? I do get a traceback
> from
> > one, but can I
> > > get one from the other one somehow so I know which reduction I
> have
> > to initiate
> > > from a shadow array?
> > >
> > > Thanks,
> > > J
>
> References
>
> 1.
> mailto:jbakosi AT lanl.gov
> 2. https://charm.cs.illinois.edu/gerrit/charm
> 3. https://charm.cs.illinois.edu/gerrit/charm
> 4. https://charm.cs.illinois.edu/gerrit/3227
> 5.
> mailto:mille121 AT illinois.edu

--
Jozsef Bakosi
Computational Physics and Methods (CCS-2)
MS D413, Los Alamos National Laboratory
Los Alamos, NM 87545, USA
email:
jbakosi AT lanl.gov
phone: 505-665-0950
fax: 505-665-4972



Archive powered by MHonArc 2.6.19.

Top of Page