Skip to Content.
Sympa Menu

charm - Re: [charm] mis-matched client callbacks in reduction messages

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] mis-matched client callbacks in reduction messages


Chronological Thread 
  • From: Sam White <white67 AT illinois.edu>
  • To: Jozsef Bakosi <jbakosi AT lanl.gov>
  • Cc: "Miller, Philip B" <mille121 AT illinois.edu>, "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>, Evan Ramos <evan AT hpccharm.com>
  • Subject: Re: [charm] mis-matched client callbacks in reduction messages
  • Date: Thu, 3 May 2018 10:32:54 -0500
  • Authentication-results: illinois.edu; spf=pass smtp.mailfrom=samt.white AT gmail.com; dkim=pass header.d=gmail.com header.s=20161025; dkim=pass header.d=illinois-edu.20150623.gappssmtp.com header.s=20150623; dmarc=none header.from=illinois.edu

I just looked at 'CkReduction::initReducerTable()' in charm/src/ck-core/ckreduction.C to see the ordering of the built-in reducers as they are emplaced in the reducer table. For the callback type I looked at the definition of 'enum callbackType' in charm/src/ck-core/ckcallback.h.

We could add a string argument to 'addReducer()', which when building with CMK_ERROR_CHECKING stores the name as part of our internal reducerStruct. That way when we detect a collision between reductions we could print out a string like "CkReduction::min_double" instead of an integer index into the reducer table. Same could go for callback types as well.

-Sam

On Thu, May 3, 2018 at 10:15 AM, Jozsef Bakosi <jbakosi AT lanl.gov> wrote:
Alright, this is pretty useful. I believe I have found the colliding reductions
and have changed one of them to originate from a bound array, which should fix
this problem.

Can you point me to how you identified the reduction types based on the callback
types (so I can do it myself next time)?

Thanks,
Jozsef

On 05.02.2018 09:35, Sam White wrote:
> It looks like you have two reductions colliding, one with CkReduction::nop
> and one with CkReduction::min_double, both targeted to a chare.
> Note that CkReduction::nop is implicitly used when calling
> "contribute(CkCallback cb)" without specifying a reducer.
>
> -Sam
>
> On Wed, May 2, 2018 at 9:20 AM, Jozsef Bakosi <jbakosi AT lanl.gov> wrote:
>
> > Alright, I finally got an error message out of this:
> >
> > Mismatched callback details: reducers 49, 1; callback types 6, 6;
> >
> > Can you guys please help me with how I can find the reductions that
> > collide here?
> >
> > Thanks,
> > Jozsef
> >
> > On 11.20.2017 11:39, Phil Miller wrote:
> > > Hi Jozsef,
> > >
> > > Could you please try building mainline charm and your code now, and show
> > us
> > > the full link command and resulting output/errors?
> > >
> > > Phil
> > >
> > > On Tue, Nov 7, 2017 at 4:35 PM, Jozsef Bakosi <jbakosi AT lanl.gov> wrote:
> > >
> > > > On 11.07.2017 16:31, Phil Miller wrote:
> > > > >    On Tue, Nov 7, 2017 at 4:30 PM, Jozsef Bakosi <[1]
> > jbakosi AT lanl.gov>
> > > > >    wrote:
> > > > >
> > > > >      Hi Phil,
> > > > >      I'm having a hard time with that checkout. Here is what I do:
> > > > >      git clone [2]https://charm.cs.illinois.edu/gerrit/charm && cd
> > charm
> > > > >      git fetch [3]https://charm.cs.illinois.edu/gerrit/charm
> > > > >      refs/changes/27/3227/1 && git checkout FETCH_HEAD
> > > > >      ./build charm++ mpi-linux-x86_64 --with-prio-type=int
> > > > >      --enable-randomized-msgq --suffix randq-debug --build-shared
> > -j36 -g
> > > > >      This is fine, then when I build my code, I get the link error:
> > > > >      /usr/bin/ld: cannot find -lhwloc_embedded
> > > > >      Does that ring some bells for you? Is that being pulled in by my
> > > > >      mpi? I'm
> > > > >      probably screwing something up...
> > > > >
> > > > >    That's related to some recent changes we've made, though that
> > > > >    particular failure is kinda surprising to me.
> > > > >    What machine is this, and what output do you get from the
> > following
> > > > >    commands?
> > > > >    which mpicxx
> > > > >    mpicxx -show
> > > >
> > > >
> > > > This is my custom-installed openmpi based on gcc-7 (which is
> > > > system-install on
> > > > debian/testing):
> > > >
> > > > $ which mpicxx
> > > > /opt/openmpi/2.0.2/gnu/7/bin/mpicxx
> > > >
> > > > $ mpicxx -show
> > > > g++-7 -I/opt/openmpi/2.0.2/gnu/7/include -pthread -Wl,-rpath
> > > > -Wl,/opt/openmpi/2.0.2/gnu/7/lib -Wl,--enable-new-dtags
> > > > -L/opt/openmpi/2.0.2/gnu/7/lib -lmpi_cxx -lmpi
> > > >
> > > > $ g++-7 --version
> > > > g++-7 (Debian 7.2.0-12) 7.2.1 20171025
> > > > Copyright (C) 2017 Free Software Foundation, Inc.
> > > > This is free software; see the source for copying conditions.  There
> > is NO
> > > > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
> > PURPOSE.
> > > >
> > > > $ which g++-7
> > > > /usr/bin/g++-7
> > > >
> > > >
> > > > >
> > > > >      Thanks,
> > > > >      Jozsef
> > > > >      On 11.03.2017 15:09, Phil Miller wrote:
> > > > >      >    You can try out a rough patch to print basic details of the
> > > > >      mis-matched
> > > > >      >    reductions here:
> > > > >      >
> > > > >      >    [1][4]https://charm.cs.illinois.edu/gerrit/3227
> > > > >      >
> > > > >      >    Right now, it will just say what the reducers and callback
> > > > >      types are
> > > > >      >    numerically - deeper information would require a bunch more
> > > > >      code, and
> > > > >      >    those bits should be enough to identify among a couple
> > > > >      >    suspect contribute() calls.
> > > > >      >
> > > > >      >    On Fri, Nov 3, 2017 at 2:58 PM, Phil Miller
> > > > >      <[2][5]mille121 AT illinois.edu>
> > > > >      >    wrote:
> > > > >      >
> > > > >      >    Hi Jozsef,
> > > > >      >    It's not whining at all. This is a bothersome problem to
> > > > >      address.
> > > > >      >    Randomized queues will change the order in which available
> > > > >      messages get
> > > > >      >    delivered to individual chares. If there is some perverse
> > order
> > > > >      it
> > > > >      >    creates that leads to an inconsistent reduction sequence,
> > that
> > > > >      order is
> > > > >      >    entirely possible to occur by chance in non-randomized
> > > > >      execution as
> > > > >      >    well. Note that it only operates on messages queued for
> > > > >      delivery to
> > > > >      >    objects - the objects themselves can (and must) structure
> > and
> > > > >      sequence
> > > > >      >    their processing to ensure consistent operation. Multiple
> > > > >      reductions
> > > > >      >    are thus *not* inconsistent with randomized queueing. If
> > the
> > > > >      abort is
> > > > >      >    triggered, there's a message delivery order that causes
> > > > >      different
> > > > >      >    elements in an array to make a different sequence of
> > contribute
> > > > >      calls.
> > > > >      >    In the particular case you present, I would recommend
> > sparing
> > > > >      yourself
> > > > >      >    the bulk of the frustrating reasoning about message
> > ordering,
> > > > >      and
> > > > >      >    moving the contribution of the diagnostics onto a bound
> > array.
> > > > >      >    When running under randomized queues and still getting the
> > > > >      error, there
> > > > >      >    may be more going on than is apparent in the code you
> > > > >      presented. I'm
> > > > >      >    putting together a patch that will provide deeper
> > diagnostic
> > > > >      >    information for you.
> > > > >      >    Phil
> > > > >      >
> > > > >
> > > > >    >    On Thu, Nov 2, 2017 at 12:24 PM, Jozsef Bakosi wrote:
> > > > >    >
> > > > >    >      Hi Phil,
> > > > >    >      Sorry for the whining, but this error is giving me way too
> > much
> > > > >    >      trouble and I
> > > > >    >      don't think my understanding is getting better.
> > > > >    >      So I am successfully using shadow arrays and they do
> > appear to
> > > > >    work
> > > > >    >      around this
> > > > >    >      problem. (I have tried this with groups successfully only
> > > > though
> > > > >    so
> > > > >    >      far.)
> > > > >    >      Since I have been mainly getting this problem with multiple
> > > > >    >      reductions using the
> > > > >    >      randomized-queue build of Charm++, I wonder if my
> > requirement
> > > > >    that a
> > > > >    >      logic
> > > > >    >      involving SDAG and multiple reductions to execute correctly
> > > > >    (i.e.
> > > > >    >      without this
> > > > >    >      error) makes sense even with randomized queues. I am
> > thinking
> > > > >    that
> > > > >    >      randomized
> > > > >    >      queues will most likely fire off multiple reductions in
> > > > >    different
> > > > >    >      (i.e., random)
> > > > >    >      order, effectively taking the ordering out of my hand. Do
> > you
> > > > >    think
> > > > >    >      that's true?
> > > > >    >      Aren't multiple reductions inherently incompatible with
> > > > >    randomized
> > > > >    >      queues?
> > > > >    >      To make it more concrete, I have the following simplified
> > > > >    scenario
> > > > >    >      in
> > > > >    >      pseudo-code:
> > > > >    >      class ChareArray : public CProxy_ChareArray {
> > > > >    >      /*entry*/ void dt() {
> > > > >    >        // compute some dt specific to this array element
> > > > >    >        double dt = ...
> > > > >    >        // allreduce:
> > > > >    >        contribute( to all elements of ChareArray targeting
> > > > >    advance(mindt)
> > > > >    >      delivering
> > > > >    >                    the minimum of some dt to all elements )
> > > > >    >      }
> > > > >    >      /*entry*/ void advance(double mindt) {
> > > > >    >        contribute( to some single chare collecting some
> > diagnostics
> > > > )
> > > > >    >        if (continue time stepping)
> > > > >    >          dt();
> > > > >    >        else
> > > > >    >          contribute( to some single chare eventually calling
> > > > ckExit()
> > > > >    )
> > > > >    >      }
> > > > >    >      }
> > > > >    >      So during time stepping there are really two contribute
> > calls
> > > > >    and
> > > > >    >      I'm pretty
> > > > >    >      sure these two generate the "mis-matched client callbacks
> > in
> > > > >    >      reduction messages"
> > > > >    >      error. (I don't think the logic gets to the contribute that
> > > > will
> > > > >    >      eventually get
> > > > >    >      to ckExit().)
> > > > >    >      When I start one of them from a bound/shadow array, I
> > still get
> > > > >    the
> > > > >    >      error but
> > > > >    >      only with randomized queues. The order of contributions to
> > the
> > > > >    two
> > > > >    >      reductions
> > > > >    >      (per single chare), I believe, is guaranteed here. But
> > won't
> > > > >    >      randomized queues
> > > > >    >      screw up the order? Can that even be done? Do I want too
> > much?
> > > > >    >      Jozsef
> > > > >    >
> > > > >    >    On 10.29.2017 17:22, Jozsef Bakosi wrote:
> > > > >    >    > On 10.27.2017 11:38, Jozsef Bakosi wrote:
> > > > >    >    > > On 10.27.2017 11:02, Phil Miller wrote:
> > > > >    >    > > >    We use an approach of creating bound 'shadow'
> > arrays to
> > > > >    act as
> > > > >    >    > > >    independent reduction (sequencing) contexts to
> > address
> > > > >    this
> > > > >    >    limitation.
> > > > >    >    > > >    We've used this approach in a few places in our
> > code,
> > > > >    >    including the
> > > > >    >    > > >    LiveViz in-situ visualization library and the
> > collision
> > > > >    >    detection
> > > > >    >    > > >    library.
> > > > >    >    > > >    In a little more detail, when constructing a chare
> > > > array,
> > > > >    it's
> > > > >    >    possible
> > > > >    >    > > >    to specify that it should be bound to another
> > existing
> > > > >    chare
> > > > >    >    array.
> > > > >    >    > > >    That means that elements of the same index will
> > always
> > > > >    live on
> > > > >    >    the same
> > > > >    >    > > >    PE. So, you can instantiate some auxiliary arrays,
> > one
> > > > >    per
> > > > >    >    reduction
> > > > >    >    > > >    stream, and bind them to your main computation
> > arrays.
> > > > >    Since
> > > > >    >    elements
> > > > >    >    > > >    with corresponding indices are guaranteed to be
> > > > >    co-located,
> > > > >    >    the main
> > > > >    >    > > >    element can get a pointer to each auxiliary via a
> > > > >    ckLocal()
> > > > >    >    call, and
> > > > >    >    > > >    then call aux->contribute(...) rather than
> > implicitly
> > > > >    >    > > >    this->contribute(). So, the setup code get a bit
> > more
> > > > >    >    complicated, and
> > > > >    >    > > >    the code actually invoking the reductions get just a
> > > > >    little
> > > > >    >    more
> > > > >    >    > > >    involved.
> > > > >    >    > > >    Is that a clear description? Does that approach
> > work for
> > > > >    you?
> > > > >    >    > >
> > > > >    >    > > I think that would work and I do use bound arrays for a
> > > > >    different
> > > > >    >    purpose.
> > > > >    >    > >
> > > > >    >    > > So how would I have to use this? Here is what I think I
> > need
> > > > >    to do:
> > > > >    >    I have to
> > > > >    >    > > identify all reductions that can happen in an order that
> > is
> > > > >    not
> > > > >    >    necessarily
> > > > >    >    > > guaranteed to be always the same and fire them from bound
> > > > >    arrays
> > > > >    >    instead (each
> > > > >    >    > > from a different chare array)?
> > > > >    >    >
> > > > >    >    > Is there a way to tell which two reductions caused the
> > > > >    "mis-matched
> > > > >    >    client
> > > > >    >    > callbacks in reduction messages" error? I do get a
> > traceback
> > > > >    from
> > > > >    >    one, but can I
> > > > >    >    > get one from the other one somehow so I know which
> > reduction I
> > > > >    have
> > > > >    >    to initiate
> > > > >    >    > from a shadow array?
> > > > >    >    >
> > > > >    >    > Thanks,
> > > > >    >    > J




Archive powered by MHonArc 2.6.19.

Top of Page