Skip to Content.
Sympa Menu

charm - Re: [charm] mis-matched client callbacks in reduction messages

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] mis-matched client callbacks in reduction messages


Chronological Thread 
  • From: Phil Miller <mille121 AT illinois.edu>
  • To: Jozsef Bakosi <jbakosi AT lanl.gov>
  • Cc: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>, Evan Ramos <evan AT hpccharm.com>
  • Subject: Re: [charm] mis-matched client callbacks in reduction messages
  • Date: Mon, 20 Nov 2017 11:39:10 -0600
  • Authentication-results: illinois.edu; spf=pass smtp.mailfrom=unmobile AT gmail.com

Hi Jozsef,

Could you please try building mainline charm and your code now, and show us the full link command and resulting output/errors?

Phil

On Tue, Nov 7, 2017 at 4:35 PM, Jozsef Bakosi <jbakosi AT lanl.gov> wrote:
On 11.07.2017 16:31, Phil Miller wrote:
>    On Tue, Nov 7, 2017 at 4:30 PM, Jozsef Bakosi <[1]jbakosi AT lanl.gov>
>    wrote:
>
>      Hi Phil,
>      I'm having a hard time with that checkout. Here is what I do:
>      git clone [2]https://charm.cs.illinois.edu/gerrit/charm && cd charm
>      git fetch [3]https://charm.cs.illinois.edu/gerrit/charm
>      refs/changes/27/3227/1 && git checkout FETCH_HEAD
>      ./build charm++ mpi-linux-x86_64 --with-prio-type=int
>      --enable-randomized-msgq --suffix randq-debug --build-shared -j36 -g
>      This is fine, then when I build my code, I get the link error:
>      /usr/bin/ld: cannot find -lhwloc_embedded
>      Does that ring some bells for you? Is that being pulled in by my
>      mpi? I'm
>      probably screwing something up...
>
>    That's related to some recent changes we've made, though that
>    particular failure is kinda surprising to me.
>    What machine is this, and what output do you get from the following
>    commands?
>    which mpicxx
>    mpicxx -show


This is my custom-installed openmpi based on gcc-7 (which is system-install on
debian/testing):

$ which mpicxx
/opt/openmpi/2.0.2/gnu/7/bin/mpicxx

$ mpicxx -show
g++-7 -I/opt/openmpi/2.0.2/gnu/7/include -pthread -Wl,-rpath -Wl,/opt/openmpi/2.0.2/gnu/7/lib -Wl,--enable-new-dtags -L/opt/openmpi/2.0.2/gnu/7/lib -lmpi_cxx -lmpi

$ g++-7 --version
g++-7 (Debian 7.2.0-12) 7.2.1 20171025
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ which g++-7
/usr/bin/g++-7


>
>      Thanks,
>      Jozsef
>      On 11.03.2017 15:09, Phil Miller wrote:
>      >    You can try out a rough patch to print basic details of the
>      mis-matched
>      >    reductions here:
>      >
>      >    [1][4]https://charm.cs.illinois.edu/gerrit/3227
>      >
>      >    Right now, it will just say what the reducers and callback
>      types are
>      >    numerically - deeper information would require a bunch more
>      code, and
>      >    those bits should be enough to identify among a couple
>      >    suspect contribute() calls.
>      >
>      >    On Fri, Nov 3, 2017 at 2:58 PM, Phil Miller
>      <[2][5]mille121 AT illinois.edu>
>      >    wrote:
>      >
>      >    Hi Jozsef,
>      >    It's not whining at all. This is a bothersome problem to
>      address.
>      >    Randomized queues will change the order in which available
>      messages get
>      >    delivered to individual chares. If there is some perverse order
>      it
>      >    creates that leads to an inconsistent reduction sequence, that
>      order is
>      >    entirely possible to occur by chance in non-randomized
>      execution as
>      >    well. Note that it only operates on messages queued for
>      delivery to
>      >    objects - the objects themselves can (and must) structure and
>      sequence
>      >    their processing to ensure consistent operation. Multiple
>      reductions
>      >    are thus *not* inconsistent with randomized queueing. If the
>      abort is
>      >    triggered, there's a message delivery order that causes
>      different
>      >    elements in an array to make a different sequence of contribute
>      calls.
>      >    In the particular case you present, I would recommend sparing
>      yourself
>      >    the bulk of the frustrating reasoning about message ordering,
>      and
>      >    moving the contribution of the diagnostics onto a bound array.
>      >    When running under randomized queues and still getting the
>      error, there
>      >    may be more going on than is apparent in the code you
>      presented. I'm
>      >    putting together a patch that will provide deeper diagnostic
>      >    information for you.
>      >    Phil
>      >
>
>    >    On Thu, Nov 2, 2017 at 12:24 PM, Jozsef Bakosi wrote:
>    >
>    >      Hi Phil,
>    >      Sorry for the whining, but this error is giving me way too much
>    >      trouble and I
>    >      don't think my understanding is getting better.
>    >      So I am successfully using shadow arrays and they do appear to
>    work
>    >      around this
>    >      problem. (I have tried this with groups successfully only though
>    so
>    >      far.)
>    >      Since I have been mainly getting this problem with multiple
>    >      reductions using the
>    >      randomized-queue build of Charm++, I wonder if my requirement
>    that a
>    >      logic
>    >      involving SDAG and multiple reductions to execute correctly
>    (i.e.
>    >      without this
>    >      error) makes sense even with randomized queues. I am thinking
>    that
>    >      randomized
>    >      queues will most likely fire off multiple reductions in
>    different
>    >      (i.e., random)
>    >      order, effectively taking the ordering out of my hand. Do you
>    think
>    >      that's true?
>    >      Aren't multiple reductions inherently incompatible with
>    randomized
>    >      queues?
>    >      To make it more concrete, I have the following simplified
>    scenario
>    >      in
>    >      pseudo-code:
>    >      class ChareArray : public CProxy_ChareArray {
>    >      /*entry*/ void dt() {
>    >        // compute some dt specific to this array element
>    >        double dt = ...
>    >        // allreduce:
>    >        contribute( to all elements of ChareArray targeting
>    advance(mindt)
>    >      delivering
>    >                    the minimum of some dt to all elements )
>    >      }
>    >      /*entry*/ void advance(double mindt) {
>    >        contribute( to some single chare collecting some diagnostics )
>    >        if (continue time stepping)
>    >          dt();
>    >        else
>    >          contribute( to some single chare eventually calling ckExit()
>    )
>    >      }
>    >      }
>    >      So during time stepping there are really two contribute calls
>    and
>    >      I'm pretty
>    >      sure these two generate the "mis-matched client callbacks in
>    >      reduction messages"
>    >      error. (I don't think the logic gets to the contribute that will
>    >      eventually get
>    >      to ckExit().)
>    >      When I start one of them from a bound/shadow array, I still get
>    the
>    >      error but
>    >      only with randomized queues. The order of contributions to the
>    two
>    >      reductions
>    >      (per single chare), I believe, is guaranteed here. But won't
>    >      randomized queues
>    >      screw up the order? Can that even be done? Do I want too much?
>    >      Jozsef
>    >
>    >    On 10.29.2017 17:22, Jozsef Bakosi wrote:
>    >    > On 10.27.2017 11:38, Jozsef Bakosi wrote:
>    >    > > On 10.27.2017 11:02, Phil Miller wrote:
>    >    > > >    We use an approach of creating bound 'shadow' arrays to
>    act as
>    >    > > >    independent reduction (sequencing) contexts to address
>    this
>    >    limitation.
>    >    > > >    We've used this approach in a few places in our code,
>    >    including the
>    >    > > >    LiveViz in-situ visualization library and the collision
>    >    detection
>    >    > > >    library.
>    >    > > >    In a little more detail, when constructing a chare array,
>    it's
>    >    possible
>    >    > > >    to specify that it should be bound to another existing
>    chare
>    >    array.
>    >    > > >    That means that elements of the same index will always
>    live on
>    >    the same
>    >    > > >    PE. So, you can instantiate some auxiliary arrays, one
>    per
>    >    reduction
>    >    > > >    stream, and bind them to your main computation arrays.
>    Since
>    >    elements
>    >    > > >    with corresponding indices are guaranteed to be
>    co-located,
>    >    the main
>    >    > > >    element can get a pointer to each auxiliary via a
>    ckLocal()
>    >    call, and
>    >    > > >    then call aux->contribute(...) rather than implicitly
>    >    > > >    this->contribute(). So, the setup code get a bit more
>    >    complicated, and
>    >    > > >    the code actually invoking the reductions get just a
>    little
>    >    more
>    >    > > >    involved.
>    >    > > >    Is that a clear description? Does that approach work for
>    you?
>    >    > >
>    >    > > I think that would work and I do use bound arrays for a
>    different
>    >    purpose.
>    >    > >
>    >    > > So how would I have to use this? Here is what I think I need
>    to do:
>    >    I have to
>    >    > > identify all reductions that can happen in an order that is
>    not
>    >    necessarily
>    >    > > guaranteed to be always the same and fire them from bound
>    arrays
>    >    instead (each
>    >    > > from a different chare array)?
>    >    >
>    >    > Is there a way to tell which two reductions caused the
>    "mis-matched
>    >    client
>    >    > callbacks in reduction messages" error? I do get a traceback
>    from
>    >    one, but can I
>    >    > get one from the other one somehow so I know which reduction I
>    have
>    >    to initiate
>    >    > from a shadow array?
>    >    >
>    >    > Thanks,
>    >    > J
>
> References
>
>    1. mailto:jbakosi AT lanl.gov
>    2. https://charm.cs.illinois.edu/gerrit/charm
>    3. https://charm.cs.illinois.edu/gerrit/charm
>    4. https://charm.cs.illinois.edu/gerrit/3227
>    5. mailto:mille121 AT illinois.edu

--
Jozsef Bakosi
Computational Physics and Methods (CCS-2)
MS D413, Los Alamos National Laboratory
Los Alamos, NM 87545, USA
email: jbakosi AT lanl.gov
phone: 505-665-0950
fax: 505-665-4972




Archive powered by MHonArc 2.6.19.

Top of Page