Skip to Content.
Sympa Menu

charm - Re: [charm] mis-matched client callbacks in reduction messages

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] mis-matched client callbacks in reduction messages


Chronological Thread 
  • From: Sam White <white67 AT illinois.edu>
  • To: Jozsef Bakosi <jbakosi AT lanl.gov>
  • Cc: "Miller, Philip B" <mille121 AT illinois.edu>, "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>, Evan Ramos <evan AT hpccharm.com>
  • Subject: Re: [charm] mis-matched client callbacks in reduction messages
  • Date: Wed, 2 May 2018 09:35:08 -0500
  • Authentication-results: illinois.edu; spf=pass smtp.mailfrom=samt.white AT gmail.com; dkim=pass header.d=gmail.com header.s=20161025; dkim=pass header.d=illinois-edu.20150623.gappssmtp.com header.s=20150623; dmarc=none header.from=illinois.edu

It looks like you have two reductions colliding, one with CkReduction::nop and one with CkReduction::min_double, both targeted to a chare.
Note that CkReduction::nop is implicitly used when calling "contribute(CkCallback cb)" without specifying a reducer.

-Sam

On Wed, May 2, 2018 at 9:20 AM, Jozsef Bakosi <jbakosi AT lanl.gov> wrote:
Alright, I finally got an error message out of this:

Mismatched callback details: reducers 49, 1; callback types 6, 6;

Can you guys please help me with how I can find the reductions that collide here?

Thanks,
Jozsef

On 11.20.2017 11:39, Phil Miller wrote:
> Hi Jozsef,
>
> Could you please try building mainline charm and your code now, and show us
> the full link command and resulting output/errors?
>
> Phil
>
> On Tue, Nov 7, 2017 at 4:35 PM, Jozsef Bakosi <jbakosi AT lanl.gov> wrote:
>
> > On 11.07.2017 16:31, Phil Miller wrote:
> > >    On Tue, Nov 7, 2017 at 4:30 PM, Jozsef Bakosi <[1]jbakosi AT lanl.gov>
> > >    wrote:
> > >
> > >      Hi Phil,
> > >      I'm having a hard time with that checkout. Here is what I do:
> > >      git clone [2]https://charm.cs.illinois.edu/gerrit/charm && cd charm
> > >      git fetch [3]https://charm.cs.illinois.edu/gerrit/charm
> > >      refs/changes/27/3227/1 && git checkout FETCH_HEAD
> > >      ./build charm++ mpi-linux-x86_64 --with-prio-type=int
> > >      --enable-randomized-msgq --suffix randq-debug --build-shared -j36 -g
> > >      This is fine, then when I build my code, I get the link error:
> > >      /usr/bin/ld: cannot find -lhwloc_embedded
> > >      Does that ring some bells for you? Is that being pulled in by my
> > >      mpi? I'm
> > >      probably screwing something up...
> > >
> > >    That's related to some recent changes we've made, though that
> > >    particular failure is kinda surprising to me.
> > >    What machine is this, and what output do you get from the following
> > >    commands?
> > >    which mpicxx
> > >    mpicxx -show
> >
> >
> > This is my custom-installed openmpi based on gcc-7 (which is
> > system-install on
> > debian/testing):
> >
> > $ which mpicxx
> > /opt/openmpi/2.0.2/gnu/7/bin/mpicxx
> >
> > $ mpicxx -show
> > g++-7 -I/opt/openmpi/2.0.2/gnu/7/include -pthread -Wl,-rpath
> > -Wl,/opt/openmpi/2.0.2/gnu/7/lib -Wl,--enable-new-dtags
> > -L/opt/openmpi/2.0.2/gnu/7/lib -lmpi_cxx -lmpi
> >
> > $ g++-7 --version
> > g++-7 (Debian 7.2.0-12) 7.2.1 20171025
> > Copyright (C) 2017 Free Software Foundation, Inc.
> > This is free software; see the source for copying conditions.  There is NO
> > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> >
> > $ which g++-7
> > /usr/bin/g++-7
> >
> >
> > >
> > >      Thanks,
> > >      Jozsef
> > >      On 11.03.2017 15:09, Phil Miller wrote:
> > >      >    You can try out a rough patch to print basic details of the
> > >      mis-matched
> > >      >    reductions here:
> > >      >
> > >      >    [1][4]https://charm.cs.illinois.edu/gerrit/3227
> > >      >
> > >      >    Right now, it will just say what the reducers and callback
> > >      types are
> > >      >    numerically - deeper information would require a bunch more
> > >      code, and
> > >      >    those bits should be enough to identify among a couple
> > >      >    suspect contribute() calls.
> > >      >
> > >      >    On Fri, Nov 3, 2017 at 2:58 PM, Phil Miller
> > >      <[2][5]mille121 AT illinois.edu>
> > >      >    wrote:
> > >      >
> > >      >    Hi Jozsef,
> > >      >    It's not whining at all. This is a bothersome problem to
> > >      address.
> > >      >    Randomized queues will change the order in which available
> > >      messages get
> > >      >    delivered to individual chares. If there is some perverse order
> > >      it
> > >      >    creates that leads to an inconsistent reduction sequence, that
> > >      order is
> > >      >    entirely possible to occur by chance in non-randomized
> > >      execution as
> > >      >    well. Note that it only operates on messages queued for
> > >      delivery to
> > >      >    objects - the objects themselves can (and must) structure and
> > >      sequence
> > >      >    their processing to ensure consistent operation. Multiple
> > >      reductions
> > >      >    are thus *not* inconsistent with randomized queueing. If the
> > >      abort is
> > >      >    triggered, there's a message delivery order that causes
> > >      different
> > >      >    elements in an array to make a different sequence of contribute
> > >      calls.
> > >      >    In the particular case you present, I would recommend sparing
> > >      yourself
> > >      >    the bulk of the frustrating reasoning about message ordering,
> > >      and
> > >      >    moving the contribution of the diagnostics onto a bound array.
> > >      >    When running under randomized queues and still getting the
> > >      error, there
> > >      >    may be more going on than is apparent in the code you
> > >      presented. I'm
> > >      >    putting together a patch that will provide deeper diagnostic
> > >      >    information for you.
> > >      >    Phil
> > >      >
> > >
> > >    >    On Thu, Nov 2, 2017 at 12:24 PM, Jozsef Bakosi wrote:
> > >    >
> > >    >      Hi Phil,
> > >    >      Sorry for the whining, but this error is giving me way too much
> > >    >      trouble and I
> > >    >      don't think my understanding is getting better.
> > >    >      So I am successfully using shadow arrays and they do appear to
> > >    work
> > >    >      around this
> > >    >      problem. (I have tried this with groups successfully only
> > though
> > >    so
> > >    >      far.)
> > >    >      Since I have been mainly getting this problem with multiple
> > >    >      reductions using the
> > >    >      randomized-queue build of Charm++, I wonder if my requirement
> > >    that a
> > >    >      logic
> > >    >      involving SDAG and multiple reductions to execute correctly
> > >    (i.e.
> > >    >      without this
> > >    >      error) makes sense even with randomized queues. I am thinking
> > >    that
> > >    >      randomized
> > >    >      queues will most likely fire off multiple reductions in
> > >    different
> > >    >      (i.e., random)
> > >    >      order, effectively taking the ordering out of my hand. Do you
> > >    think
> > >    >      that's true?
> > >    >      Aren't multiple reductions inherently incompatible with
> > >    randomized
> > >    >      queues?
> > >    >      To make it more concrete, I have the following simplified
> > >    scenario
> > >    >      in
> > >    >      pseudo-code:
> > >    >      class ChareArray : public CProxy_ChareArray {
> > >    >      /*entry*/ void dt() {
> > >    >        // compute some dt specific to this array element
> > >    >        double dt = ...
> > >    >        // allreduce:
> > >    >        contribute( to all elements of ChareArray targeting
> > >    advance(mindt)
> > >    >      delivering
> > >    >                    the minimum of some dt to all elements )
> > >    >      }
> > >    >      /*entry*/ void advance(double mindt) {
> > >    >        contribute( to some single chare collecting some diagnostics
> > )
> > >    >        if (continue time stepping)
> > >    >          dt();
> > >    >        else
> > >    >          contribute( to some single chare eventually calling
> > ckExit()
> > >    )
> > >    >      }
> > >    >      }
> > >    >      So during time stepping there are really two contribute calls
> > >    and
> > >    >      I'm pretty
> > >    >      sure these two generate the "mis-matched client callbacks in
> > >    >      reduction messages"
> > >    >      error. (I don't think the logic gets to the contribute that
> > will
> > >    >      eventually get
> > >    >      to ckExit().)
> > >    >      When I start one of them from a bound/shadow array, I still get
> > >    the
> > >    >      error but
> > >    >      only with randomized queues. The order of contributions to the
> > >    two
> > >    >      reductions
> > >    >      (per single chare), I believe, is guaranteed here. But won't
> > >    >      randomized queues
> > >    >      screw up the order? Can that even be done? Do I want too much?
> > >    >      Jozsef
> > >    >
> > >    >    On 10.29.2017 17:22, Jozsef Bakosi wrote:
> > >    >    > On 10.27.2017 11:38, Jozsef Bakosi wrote:
> > >    >    > > On 10.27.2017 11:02, Phil Miller wrote:
> > >    >    > > >    We use an approach of creating bound 'shadow' arrays to
> > >    act as
> > >    >    > > >    independent reduction (sequencing) contexts to address
> > >    this
> > >    >    limitation.
> > >    >    > > >    We've used this approach in a few places in our code,
> > >    >    including the
> > >    >    > > >    LiveViz in-situ visualization library and the collision
> > >    >    detection
> > >    >    > > >    library.
> > >    >    > > >    In a little more detail, when constructing a chare
> > array,
> > >    it's
> > >    >    possible
> > >    >    > > >    to specify that it should be bound to another existing
> > >    chare
> > >    >    array.
> > >    >    > > >    That means that elements of the same index will always
> > >    live on
> > >    >    the same
> > >    >    > > >    PE. So, you can instantiate some auxiliary arrays, one
> > >    per
> > >    >    reduction
> > >    >    > > >    stream, and bind them to your main computation arrays.
> > >    Since
> > >    >    elements
> > >    >    > > >    with corresponding indices are guaranteed to be
> > >    co-located,
> > >    >    the main
> > >    >    > > >    element can get a pointer to each auxiliary via a
> > >    ckLocal()
> > >    >    call, and
> > >    >    > > >    then call aux->contribute(...) rather than implicitly
> > >    >    > > >    this->contribute(). So, the setup code get a bit more
> > >    >    complicated, and
> > >    >    > > >    the code actually invoking the reductions get just a
> > >    little
> > >    >    more
> > >    >    > > >    involved.
> > >    >    > > >    Is that a clear description? Does that approach work for
> > >    you?
> > >    >    > >
> > >    >    > > I think that would work and I do use bound arrays for a
> > >    different
> > >    >    purpose.
> > >    >    > >
> > >    >    > > So how would I have to use this? Here is what I think I need
> > >    to do:
> > >    >    I have to
> > >    >    > > identify all reductions that can happen in an order that is
> > >    not
> > >    >    necessarily
> > >    >    > > guaranteed to be always the same and fire them from bound
> > >    arrays
> > >    >    instead (each
> > >    >    > > from a different chare array)?
> > >    >    >
> > >    >    > Is there a way to tell which two reductions caused the
> > >    "mis-matched
> > >    >    client
> > >    >    > callbacks in reduction messages" error? I do get a traceback
> > >    from
> > >    >    one, but can I
> > >    >    > get one from the other one somehow so I know which reduction I
> > >    have
> > >    >    to initiate
> > >    >    > from a shadow array?
> > >    >    >
> > >    >    > Thanks,
> > >    >    > J




Archive powered by MHonArc 2.6.19.

Top of Page