Skip to Content.
Sympa Menu

charm - Re: [charm] mis-matched client callbacks in reduction messages

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] mis-matched client callbacks in reduction messages


Chronological Thread 
  • From: Jozsef Bakosi <jbakosi AT lanl.gov>
  • To: "White, Samuel T" <white67 AT illinois.edu>
  • Cc: "Miller, Philip B" <mille121 AT illinois.edu>, "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>, Evan Ramos <evan AT hpccharm.com>
  • Subject: Re: [charm] mis-matched client callbacks in reduction messages
  • Date: Mon, 30 Apr 2018 08:31:27 -0600
  • Authentication-results: illinois.edu; spf=pass smtp.mailfrom=jbakosi AT lanl.gov; dmarc=pass header.from=lanl.gov

Excellent! Thanks, Sam.
Jozsef

On 04.30.2018 14:22, White, Samuel T wrote:
> To enable CMK_ERROR_CHECKING you can either not build with
> "--with-production" or you can build with "--with-production
> --enable-error-checking".
>
> -Sam
> ________________________________________
> From: Jozsef Bakosi
> [jbakosi AT lanl.gov]
> Sent: Monday, April 30, 2018 9:19 AM
> To: Miller, Philip B
> Cc:
> charm AT lists.cs.illinois.edu;
> Evan Ramos
> Subject: Re: [charm] mis-matched client callbacks in reduction messages
>
> Hi folks,
>
> Regarding the diff at
> https://charm.cs.illinois.edu/gerrit/#/c/charm/+/3227/3/src/ck-core/ckreduction.C
>
> Could someone please tell how I can ensure the error checking code runs
> behind "#if CMK_ERROR_CHECKING"?
>
> Do I have to build Charm++ in debug or some other way?
>
> Thanks,
> Jozsef
>
> On 11.20.2017 11:39, Phil Miller wrote:
> > Hi Jozsef,
> >
> > Could you please try building mainline charm and your code now, and show
> > us
> > the full link command and resulting output/errors?
> >
> > Phil
> >
> > On Tue, Nov 7, 2017 at 4:35 PM, Jozsef Bakosi
> > <jbakosi AT lanl.gov>
> > wrote:
> >
> > > On 11.07.2017 16:31, Phil Miller wrote:
> > > > On Tue, Nov 7, 2017 at 4:30 PM, Jozsef Bakosi
> > > > <[1]jbakosi AT lanl.gov>
> > > > wrote:
> > > >
> > > > Hi Phil,
> > > > I'm having a hard time with that checkout. Here is what I do:
> > > > git clone [2]https://charm.cs.illinois.edu/gerrit/charm && cd
> > > > charm
> > > > git fetch [3]https://charm.cs.illinois.edu/gerrit/charm
> > > > refs/changes/27/3227/1 && git checkout FETCH_HEAD
> > > > ./build charm++ mpi-linux-x86_64 --with-prio-type=int
> > > > --enable-randomized-msgq --suffix randq-debug --build-shared
> > > > -j36 -g
> > > > This is fine, then when I build my code, I get the link error:
> > > > /usr/bin/ld: cannot find -lhwloc_embedded
> > > > Does that ring some bells for you? Is that being pulled in by my
> > > > mpi? I'm
> > > > probably screwing something up...
> > > >
> > > > That's related to some recent changes we've made, though that
> > > > particular failure is kinda surprising to me.
> > > > What machine is this, and what output do you get from the following
> > > > commands?
> > > > which mpicxx
> > > > mpicxx -show
> > >
> > >
> > > This is my custom-installed openmpi based on gcc-7 (which is
> > > system-install on
> > > debian/testing):
> > >
> > > $ which mpicxx
> > > /opt/openmpi/2.0.2/gnu/7/bin/mpicxx
> > >
> > > $ mpicxx -show
> > > g++-7 -I/opt/openmpi/2.0.2/gnu/7/include -pthread -Wl,-rpath
> > > -Wl,/opt/openmpi/2.0.2/gnu/7/lib -Wl,--enable-new-dtags
> > > -L/opt/openmpi/2.0.2/gnu/7/lib -lmpi_cxx -lmpi
> > >
> > > $ g++-7 --version
> > > g++-7 (Debian 7.2.0-12) 7.2.1 20171025
> > > Copyright (C) 2017 Free Software Foundation, Inc.
> > > This is free software; see the source for copying conditions. There is
> > > NO
> > > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
> > > PURPOSE.
> > >
> > > $ which g++-7
> > > /usr/bin/g++-7
> > >
> > >
> > > >
> > > > Thanks,
> > > > Jozsef
> > > > On 11.03.2017 15:09, Phil Miller wrote:
> > > > > You can try out a rough patch to print basic details of the
> > > > mis-matched
> > > > > reductions here:
> > > > >
> > > > > [1][4]https://charm.cs.illinois.edu/gerrit/3227
> > > > >
> > > > > Right now, it will just say what the reducers and callback
> > > > types are
> > > > > numerically - deeper information would require a bunch more
> > > > code, and
> > > > > those bits should be enough to identify among a couple
> > > > > suspect contribute() calls.
> > > > >
> > > > > On Fri, Nov 3, 2017 at 2:58 PM, Phil Miller
> > > >
> > > > <[2][5]mille121 AT illinois.edu>
> > > > > wrote:
> > > > >
> > > > > Hi Jozsef,
> > > > > It's not whining at all. This is a bothersome problem to
> > > > address.
> > > > > Randomized queues will change the order in which available
> > > > messages get
> > > > > delivered to individual chares. If there is some perverse
> > > > order
> > > > it
> > > > > creates that leads to an inconsistent reduction sequence,
> > > > that
> > > > order is
> > > > > entirely possible to occur by chance in non-randomized
> > > > execution as
> > > > > well. Note that it only operates on messages queued for
> > > > delivery to
> > > > > objects - the objects themselves can (and must) structure
> > > > and
> > > > sequence
> > > > > their processing to ensure consistent operation. Multiple
> > > > reductions
> > > > > are thus *not* inconsistent with randomized queueing. If the
> > > > abort is
> > > > > triggered, there's a message delivery order that causes
> > > > different
> > > > > elements in an array to make a different sequence of
> > > > contribute
> > > > calls.
> > > > > In the particular case you present, I would recommend
> > > > sparing
> > > > yourself
> > > > > the bulk of the frustrating reasoning about message
> > > > ordering,
> > > > and
> > > > > moving the contribution of the diagnostics onto a bound
> > > > array.
> > > > > When running under randomized queues and still getting the
> > > > error, there
> > > > > may be more going on than is apparent in the code you
> > > > presented. I'm
> > > > > putting together a patch that will provide deeper diagnostic
> > > > > information for you.
> > > > > Phil
> > > > >
> > > >
> > > > > On Thu, Nov 2, 2017 at 12:24 PM, Jozsef Bakosi wrote:
> > > > >
> > > > > Hi Phil,
> > > > > Sorry for the whining, but this error is giving me way too
> > > > much
> > > > > trouble and I
> > > > > don't think my understanding is getting better.
> > > > > So I am successfully using shadow arrays and they do appear
> > > > to
> > > > work
> > > > > around this
> > > > > problem. (I have tried this with groups successfully only
> > > though
> > > > so
> > > > > far.)
> > > > > Since I have been mainly getting this problem with multiple
> > > > > reductions using the
> > > > > randomized-queue build of Charm++, I wonder if my
> > > > requirement
> > > > that a
> > > > > logic
> > > > > involving SDAG and multiple reductions to execute correctly
> > > > (i.e.
> > > > > without this
> > > > > error) makes sense even with randomized queues. I am
> > > > thinking
> > > > that
> > > > > randomized
> > > > > queues will most likely fire off multiple reductions in
> > > > different
> > > > > (i.e., random)
> > > > > order, effectively taking the ordering out of my hand. Do
> > > > you
> > > > think
> > > > > that's true?
> > > > > Aren't multiple reductions inherently incompatible with
> > > > randomized
> > > > > queues?
> > > > > To make it more concrete, I have the following simplified
> > > > scenario
> > > > > in
> > > > > pseudo-code:
> > > > > class ChareArray : public CProxy_ChareArray {
> > > > > /*entry*/ void dt() {
> > > > > // compute some dt specific to this array element
> > > > > double dt = ...
> > > > > // allreduce:
> > > > > contribute( to all elements of ChareArray targeting
> > > > advance(mindt)
> > > > > delivering
> > > > > the minimum of some dt to all elements )
> > > > > }
> > > > > /*entry*/ void advance(double mindt) {
> > > > > contribute( to some single chare collecting some
> > > > diagnostics
> > > )
> > > > > if (continue time stepping)
> > > > > dt();
> > > > > else
> > > > > contribute( to some single chare eventually calling
> > > ckExit()
> > > > )
> > > > > }
> > > > > }
> > > > > So during time stepping there are really two contribute
> > > > calls
> > > > and
> > > > > I'm pretty
> > > > > sure these two generate the "mis-matched client callbacks in
> > > > > reduction messages"
> > > > > error. (I don't think the logic gets to the contribute that
> > > will
> > > > > eventually get
> > > > > to ckExit().)
> > > > > When I start one of them from a bound/shadow array, I still
> > > > get
> > > > the
> > > > > error but
> > > > > only with randomized queues. The order of contributions to
> > > > the
> > > > two
> > > > > reductions
> > > > > (per single chare), I believe, is guaranteed here. But won't
> > > > > randomized queues
> > > > > screw up the order? Can that even be done? Do I want too
> > > > much?
> > > > > Jozsef
> > > > >
> > > > > On 10.29.2017 17:22, Jozsef Bakosi wrote:
> > > > > > On 10.27.2017 11:38, Jozsef Bakosi wrote:
> > > > > > > On 10.27.2017 11:02, Phil Miller wrote:
> > > > > > > > We use an approach of creating bound 'shadow' arrays
> > > > to
> > > > act as
> > > > > > > > independent reduction (sequencing) contexts to
> > > > address
> > > > this
> > > > > limitation.
> > > > > > > > We've used this approach in a few places in our code,
> > > > > including the
> > > > > > > > LiveViz in-situ visualization library and the
> > > > collision
> > > > > detection
> > > > > > > > library.
> > > > > > > > In a little more detail, when constructing a chare
> > > array,
> > > > it's
> > > > > possible
> > > > > > > > to specify that it should be bound to another
> > > > existing
> > > > chare
> > > > > array.
> > > > > > > > That means that elements of the same index will
> > > > always
> > > > live on
> > > > > the same
> > > > > > > > PE. So, you can instantiate some auxiliary arrays,
> > > > one
> > > > per
> > > > > reduction
> > > > > > > > stream, and bind them to your main computation
> > > > arrays.
> > > > Since
> > > > > elements
> > > > > > > > with corresponding indices are guaranteed to be
> > > > co-located,
> > > > > the main
> > > > > > > > element can get a pointer to each auxiliary via a
> > > > ckLocal()
> > > > > call, and
> > > > > > > > then call aux->contribute(...) rather than implicitly
> > > > > > > > this->contribute(). So, the setup code get a bit more
> > > > > complicated, and
> > > > > > > > the code actually invoking the reductions get just a
> > > > little
> > > > > more
> > > > > > > > involved.
> > > > > > > > Is that a clear description? Does that approach work
> > > > for
> > > > you?
> > > > > > >
> > > > > > > I think that would work and I do use bound arrays for a
> > > > different
> > > > > purpose.
> > > > > > >
> > > > > > > So how would I have to use this? Here is what I think I
> > > > need
> > > > to do:
> > > > > I have to
> > > > > > > identify all reductions that can happen in an order that
> > > > is
> > > > not
> > > > > necessarily
> > > > > > > guaranteed to be always the same and fire them from bound
> > > > arrays
> > > > > instead (each
> > > > > > > from a different chare array)?
> > > > > >
> > > > > > Is there a way to tell which two reductions caused the
> > > > "mis-matched
> > > > > client
> > > > > > callbacks in reduction messages" error? I do get a traceback
> > > > from
> > > > > one, but can I
> > > > > > get one from the other one somehow so I know which
> > > > reduction I
> > > > have
> > > > > to initiate
> > > > > > from a shadow array?
> > > > > >
> > > > > > Thanks,
> > > > > > J



Archive powered by MHonArc 2.6.19.

Top of Page