Skip to Content.
Sympa Menu

charm - Re: [charm] Use of uninitialised value of size 8 from CkReductionMsg::buildNew()

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Use of uninitialised value of size 8 from CkReductionMsg::buildNew()


Chronological Thread 
  • From: Jozsef Bakosi <jbakosi AT gmail.com>
  • To: Phil Miller <mille121 AT illinois.edu>, "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
  • Subject: Re: [charm] Use of uninitialised value of size 8 from CkReductionMsg::buildNew()
  • Date: Tue, 22 Nov 2016 22:12:20 -0700

On Tue, Nov 22, 2016 at 9:40 PM, Phil Miller <mille121 AT illinois.edu> wrote:
Some clarifying/exploratory questions (which should be pretty generally applicable):
- Which system do you observe this on?

LANL's trinitite (a smaller version of trinity). This is a Cray XC40 with KNLs. I don't have all the details handy, but I suspect it is a very similar hw and sw stack to cori.

- What build configuration are you using on the Cray system? gni-crayxc? mpi-crayxc? smp or no? The full ./build command line would be useful, along with output from 'module list'

No smp. The build command and modules:

$ build charm++ mpi-crayxc --with-production -j40 -O3 -DNDEBUG

Currently Loaded Modulefiles:
  1) modules/3.2.10.4                  6) craype/2.5.6                     11) pmi/5.0.10-1.0000.11050.0.0.ari  16) dvs/2.7_0.9.0-2.201              21) cmake/3.6.2
  2) eswrap/2.0.6-2.9                  7) cray-mpich/7.4.2                 12) dmapp/7.1.0-12.37                17) alps/6.1.6-20.1                  22) craype-hugepages8M
  3) gcc/6.1.0                         8) cray-libsci/16.07.1              13) gni-headers/5.0.7-3.1            18) rca/1.0.0-6.21                   23) cray-netcdf-hdf5parallel/4.4.1
  4) craype-haswell                    9) udreg/2.3.2-4.6                  14) xpmem/0.1-4.5                    19) atp/2.0.2                        24) cray-hdf5-parallel/1.10.0
  5) craype-network-aries             10) ugni/6.0.12-2.1                  15) job/1.5.5-3.58                   20) PrgEnv-gnu/6.0.3
 
- Outside of valgrind, do you otherwise observe failures on the Cray?

No. Though I have only tried another application, my unittest harness, which is also Charm++ and parallel, but does not use reductions or groups.
 
- Can you reproduce this with a maximally simplified build on the Cray ? E.g. without smp, and on whichver network layer (gni or mpi) you're not currently using?

I'm using a non-smp cray-mpi build.
 
- How many nodes and PEs does this take to reproduce? How few can you use? Does it reproduce on just 1 PE?

I can reproduce it using a single core with charmrun +p1.
 
- Can you reproduce this in a smaller, simplified test program? Alternately, can you point us to the code in your repository and a set of inputs and command line arguments that reproduces it?

I will try to reproduce this tomorrow on a simpler example. I'll get back to you on this.

I have also noticed a similar poblem but this time producing an invalid write of size 8 instead of a read, attempting to contribute to not an array of doubles but an uint64_t using CkReduction::sum_int. I guess this latter with uint64_t is already a type-length problem, so I would expect that to be a problem everywhere, yet it is not, only on cray.

More tomorrow. Thanks, Phil.
 
On Tue, Nov 22, 2016 at 10:25 AM, Jozsef Bakosi <jbakosi AT gmail.com> wrote:
Hi folks,

I'm getting the following valgrind message only on Cray (no problem on, e.g., linux/mac):

==48771== Use of uninitialised value of size 8 
==48771==    at 0x21696E61: memcpy (memcpy.S:201)
==48771==    by 0x212E12E0: CkReductionMsg::buildNew(int, void const*, CkReduction::reducerType, CkReductionMsg*) 
==48771==    by 0x212EF68A: Group::contribute(int, void const*, CkReduction::reducerType, CkCallback const&, unsigned short)

This is from a chare group reduction of an array of doubles with CkReduction::sum_double.  There is a single memcpy() in src/ck-core/ckreduction.C:1501:

    memcpy(ret->data,srcData,NdataSize);

I am suspecting the memory size allocated behind srcData is smaller (by a single double) than NdataSize, probably because I'm feeding the wrong data size. The way I feed the data size to the contribute call is  via static_cast<int>( vec.size() * sizeof(double) ), and the data pointer is vec.data(), which I assume ends up being passed on to be srcData. Here vec is a std::vector<double>. I believe, this should be correct, but for some reason this is only a segfault on cray - valgrind does not even complain on linux or mac.

Does any have an idea how I can debug this?

Thanks,
Jozsef







Archive powered by MHonArc 2.6.19.

Top of Page