Skip to Content.
Sympa Menu

charm - RE: [charm] mis-matched client callbacks in reduction messages

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

RE: [charm] mis-matched client callbacks in reduction messages


Chronological Thread 
  • From: "White, Samuel T" <white67 AT illinois.edu>
  • To: Jozsef Bakosi <jbakosi AT lanl.gov>, "Miller, Philip B" <mille121 AT illinois.edu>
  • Cc: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>, Evan Ramos <evan AT hpccharm.com>
  • Subject: RE: [charm] mis-matched client callbacks in reduction messages
  • Date: Mon, 30 Apr 2018 14:22:55 +0000
  • Accept-language: en-US
  • Authentication-results: illinois.edu; spf=pass smtp.mailfrom=white67 AT illinois.edu; dmarc=pass header.from=illinois.edu

To enable CMK_ERROR_CHECKING you can either not build with
"--with-production" or you can build with "--with-production
--enable-error-checking".

-Sam
________________________________________
From: Jozsef Bakosi
[jbakosi AT lanl.gov]
Sent: Monday, April 30, 2018 9:19 AM
To: Miller, Philip B
Cc:
charm AT lists.cs.illinois.edu;
Evan Ramos
Subject: Re: [charm] mis-matched client callbacks in reduction messages

Hi folks,

Regarding the diff at
https://charm.cs.illinois.edu/gerrit/#/c/charm/+/3227/3/src/ck-core/ckreduction.C

Could someone please tell how I can ensure the error checking code runs
behind "#if CMK_ERROR_CHECKING"?

Do I have to build Charm++ in debug or some other way?

Thanks,
Jozsef

On 11.20.2017 11:39, Phil Miller wrote:
> Hi Jozsef,
>
> Could you please try building mainline charm and your code now, and show us
> the full link command and resulting output/errors?
>
> Phil
>
> On Tue, Nov 7, 2017 at 4:35 PM, Jozsef Bakosi
> <jbakosi AT lanl.gov>
> wrote:
>
> > On 11.07.2017 16:31, Phil Miller wrote:
> > > On Tue, Nov 7, 2017 at 4:30 PM, Jozsef Bakosi
> > > <[1]jbakosi AT lanl.gov>
> > > wrote:
> > >
> > > Hi Phil,
> > > I'm having a hard time with that checkout. Here is what I do:
> > > git clone [2]https://charm.cs.illinois.edu/gerrit/charm && cd charm
> > > git fetch [3]https://charm.cs.illinois.edu/gerrit/charm
> > > refs/changes/27/3227/1 && git checkout FETCH_HEAD
> > > ./build charm++ mpi-linux-x86_64 --with-prio-type=int
> > > --enable-randomized-msgq --suffix randq-debug --build-shared -j36
> > > -g
> > > This is fine, then when I build my code, I get the link error:
> > > /usr/bin/ld: cannot find -lhwloc_embedded
> > > Does that ring some bells for you? Is that being pulled in by my
> > > mpi? I'm
> > > probably screwing something up...
> > >
> > > That's related to some recent changes we've made, though that
> > > particular failure is kinda surprising to me.
> > > What machine is this, and what output do you get from the following
> > > commands?
> > > which mpicxx
> > > mpicxx -show
> >
> >
> > This is my custom-installed openmpi based on gcc-7 (which is
> > system-install on
> > debian/testing):
> >
> > $ which mpicxx
> > /opt/openmpi/2.0.2/gnu/7/bin/mpicxx
> >
> > $ mpicxx -show
> > g++-7 -I/opt/openmpi/2.0.2/gnu/7/include -pthread -Wl,-rpath
> > -Wl,/opt/openmpi/2.0.2/gnu/7/lib -Wl,--enable-new-dtags
> > -L/opt/openmpi/2.0.2/gnu/7/lib -lmpi_cxx -lmpi
> >
> > $ g++-7 --version
> > g++-7 (Debian 7.2.0-12) 7.2.1 20171025
> > Copyright (C) 2017 Free Software Foundation, Inc.
> > This is free software; see the source for copying conditions. There is NO
> > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
> > PURPOSE.
> >
> > $ which g++-7
> > /usr/bin/g++-7
> >
> >
> > >
> > > Thanks,
> > > Jozsef
> > > On 11.03.2017 15:09, Phil Miller wrote:
> > > > You can try out a rough patch to print basic details of the
> > > mis-matched
> > > > reductions here:
> > > >
> > > > [1][4]https://charm.cs.illinois.edu/gerrit/3227
> > > >
> > > > Right now, it will just say what the reducers and callback
> > > types are
> > > > numerically - deeper information would require a bunch more
> > > code, and
> > > > those bits should be enough to identify among a couple
> > > > suspect contribute() calls.
> > > >
> > > > On Fri, Nov 3, 2017 at 2:58 PM, Phil Miller
> > >
> > > <[2][5]mille121 AT illinois.edu>
> > > > wrote:
> > > >
> > > > Hi Jozsef,
> > > > It's not whining at all. This is a bothersome problem to
> > > address.
> > > > Randomized queues will change the order in which available
> > > messages get
> > > > delivered to individual chares. If there is some perverse
> > > order
> > > it
> > > > creates that leads to an inconsistent reduction sequence, that
> > > order is
> > > > entirely possible to occur by chance in non-randomized
> > > execution as
> > > > well. Note that it only operates on messages queued for
> > > delivery to
> > > > objects - the objects themselves can (and must) structure and
> > > sequence
> > > > their processing to ensure consistent operation. Multiple
> > > reductions
> > > > are thus *not* inconsistent with randomized queueing. If the
> > > abort is
> > > > triggered, there's a message delivery order that causes
> > > different
> > > > elements in an array to make a different sequence of
> > > contribute
> > > calls.
> > > > In the particular case you present, I would recommend sparing
> > > yourself
> > > > the bulk of the frustrating reasoning about message ordering,
> > > and
> > > > moving the contribution of the diagnostics onto a bound array.
> > > > When running under randomized queues and still getting the
> > > error, there
> > > > may be more going on than is apparent in the code you
> > > presented. I'm
> > > > putting together a patch that will provide deeper diagnostic
> > > > information for you.
> > > > Phil
> > > >
> > >
> > > > On Thu, Nov 2, 2017 at 12:24 PM, Jozsef Bakosi wrote:
> > > >
> > > > Hi Phil,
> > > > Sorry for the whining, but this error is giving me way too
> > > much
> > > > trouble and I
> > > > don't think my understanding is getting better.
> > > > So I am successfully using shadow arrays and they do appear to
> > > work
> > > > around this
> > > > problem. (I have tried this with groups successfully only
> > though
> > > so
> > > > far.)
> > > > Since I have been mainly getting this problem with multiple
> > > > reductions using the
> > > > randomized-queue build of Charm++, I wonder if my requirement
> > > that a
> > > > logic
> > > > involving SDAG and multiple reductions to execute correctly
> > > (i.e.
> > > > without this
> > > > error) makes sense even with randomized queues. I am thinking
> > > that
> > > > randomized
> > > > queues will most likely fire off multiple reductions in
> > > different
> > > > (i.e., random)
> > > > order, effectively taking the ordering out of my hand. Do you
> > > think
> > > > that's true?
> > > > Aren't multiple reductions inherently incompatible with
> > > randomized
> > > > queues?
> > > > To make it more concrete, I have the following simplified
> > > scenario
> > > > in
> > > > pseudo-code:
> > > > class ChareArray : public CProxy_ChareArray {
> > > > /*entry*/ void dt() {
> > > > // compute some dt specific to this array element
> > > > double dt = ...
> > > > // allreduce:
> > > > contribute( to all elements of ChareArray targeting
> > > advance(mindt)
> > > > delivering
> > > > the minimum of some dt to all elements )
> > > > }
> > > > /*entry*/ void advance(double mindt) {
> > > > contribute( to some single chare collecting some diagnostics
> > )
> > > > if (continue time stepping)
> > > > dt();
> > > > else
> > > > contribute( to some single chare eventually calling
> > ckExit()
> > > )
> > > > }
> > > > }
> > > > So during time stepping there are really two contribute calls
> > > and
> > > > I'm pretty
> > > > sure these two generate the "mis-matched client callbacks in
> > > > reduction messages"
> > > > error. (I don't think the logic gets to the contribute that
> > will
> > > > eventually get
> > > > to ckExit().)
> > > > When I start one of them from a bound/shadow array, I still
> > > get
> > > the
> > > > error but
> > > > only with randomized queues. The order of contributions to the
> > > two
> > > > reductions
> > > > (per single chare), I believe, is guaranteed here. But won't
> > > > randomized queues
> > > > screw up the order? Can that even be done? Do I want too much?
> > > > Jozsef
> > > >
> > > > On 10.29.2017 17:22, Jozsef Bakosi wrote:
> > > > > On 10.27.2017 11:38, Jozsef Bakosi wrote:
> > > > > > On 10.27.2017 11:02, Phil Miller wrote:
> > > > > > > We use an approach of creating bound 'shadow' arrays to
> > > act as
> > > > > > > independent reduction (sequencing) contexts to address
> > > this
> > > > limitation.
> > > > > > > We've used this approach in a few places in our code,
> > > > including the
> > > > > > > LiveViz in-situ visualization library and the collision
> > > > detection
> > > > > > > library.
> > > > > > > In a little more detail, when constructing a chare
> > array,
> > > it's
> > > > possible
> > > > > > > to specify that it should be bound to another existing
> > > chare
> > > > array.
> > > > > > > That means that elements of the same index will always
> > > live on
> > > > the same
> > > > > > > PE. So, you can instantiate some auxiliary arrays, one
> > > per
> > > > reduction
> > > > > > > stream, and bind them to your main computation arrays.
> > > Since
> > > > elements
> > > > > > > with corresponding indices are guaranteed to be
> > > co-located,
> > > > the main
> > > > > > > element can get a pointer to each auxiliary via a
> > > ckLocal()
> > > > call, and
> > > > > > > then call aux->contribute(...) rather than implicitly
> > > > > > > this->contribute(). So, the setup code get a bit more
> > > > complicated, and
> > > > > > > the code actually invoking the reductions get just a
> > > little
> > > > more
> > > > > > > involved.
> > > > > > > Is that a clear description? Does that approach work
> > > for
> > > you?
> > > > > >
> > > > > > I think that would work and I do use bound arrays for a
> > > different
> > > > purpose.
> > > > > >
> > > > > > So how would I have to use this? Here is what I think I need
> > > to do:
> > > > I have to
> > > > > > identify all reductions that can happen in an order that is
> > > not
> > > > necessarily
> > > > > > guaranteed to be always the same and fire them from bound
> > > arrays
> > > > instead (each
> > > > > > from a different chare array)?
> > > > >
> > > > > Is there a way to tell which two reductions caused the
> > > "mis-matched
> > > > client
> > > > > callbacks in reduction messages" error? I do get a traceback
> > > from
> > > > one, but can I
> > > > > get one from the other one somehow so I know which reduction I
> > > have
> > > > to initiate
> > > > > from a shadow array?
> > > > >
> > > > > Thanks,
> > > > > J



Archive powered by MHonArc 2.6.19.

Top of Page