Skip to Content.
Sympa Menu

charm - Re: [charm] memory management errors after ckexit called

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] memory management errors after ckexit called


Chronological Thread 
  • From: Phil Miller <mille121 AT illinois.edu>
  • To: Scott Field <sfield AT astro.cornell.edu>
  • Cc: "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
  • Subject: Re: [charm] memory management errors after ckexit called
  • Date: Tue, 16 Jun 2015 15:25:45 -0500
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Hi Scott,

This list is definitely an appropriate place to post about potential bugs. Thanks for bringing it up.

The build you're using is a little bit weird, relative to the system you're running on. That's a 32-bit build of the RTS, running on a 64-bit processor and OS. You also don't need to specify gcc, as it is the platform default. So, I'd go with 'multicore-linux64' instead.

When debugging a problem, it's often a good idea to leave assertions enabled, by dropping the --with-production option. Sometimes, the RTS can catch that things have gone wrong much sooner that way.

Now, for your particular issue: how many PEs are you running? Does the reported error reproduce when run with +p 1 and MALLOC_CHECK_  set to 3? Can you compile the whole thing with -g and run it under gdb to see which ostensibly invalid pointer is actually being freed at that point?

Thanks.

Phil





On Tue, Jun 16, 2015 at 3:14 PM, Scott Field <sfield AT astro.cornell.edu> wrote:
Hi,

  Recently, after pulling a bleeding-edge version of the charm++ code, all of our regression tests now fail with either a segmentation fault or "double free or corruption (!prev): 0x0000000001c4de20 ***". The error appears to occur after ckexit is called. Charm++ was built on my laptop with 

 >>> ./build charm++ multicore-linux32 gcc --with-production -j3 -std=c++11

  Using git's bisect utility, I was able to track down the first commit version where things go wrong. The git hash and commit messages are c96750026bbc7a9190f1381e7ac9ea56ae86f80e and "Bug #695:  disable comm thread in multicore builds". More specifically, if I edit line 200 of the file src/arch/util/machine-common-core.c from "#define CMK_SMP_NO_COMMTHD CMK_MULTICORE" to "#define CMK_SMP_NO_COMMTHD 0" the error message goes away and all tests pass again. 

  Honestly I don't really know what why this change fixed the problem -- its pretty far under-the-hood. 

  A few questions:

1) Is this list a appropriate place to post information about potential bugs?

2) Does this seem to be a charm++ bug introduced by that commit? Or a fix which has simply broken our code? I had a hard time tracking down the source of the error. Oddly enough, I could not reproduce the same error when using valgrind (although it did report an  "Uninitialised value was created by a stack allocation" which it tracked to one of the declaration files created by charmc). With MALLOC_CHECK_ set to 3 I get the following 

*** Error in `./Evolve1DScalarWave': free(): invalid pointer: 0x000000000203c920 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x7338f)[0x7f4cebc2e38f]
/lib/x86_64-linux-gnu/libc.so.6(+0x81fb6)[0x7f4cebc3cfb6]
/lib/x86_64-linux-gnu/libc.so.6(+0x3c280)[0x7f4cebbf7280]
/lib/x86_64-linux-gnu/libc.so.6(+0x3c2a5)[0x7f4cebbf72a5]
./Evolve1DScalarWave[0x670b4a]
./Evolve1DScalarWave[0x5e39ed]
./Evolve1DScalarWave(CsdScheduleForever+0x48)[0x673e88]
./Evolve1DScalarWave(CsdScheduler+0x2d)[0x67413d]
./Evolve1DScalarWave(_ZN12ElementChareI16ScalarWaveSystemILi1EEE11endTimeStepEv+0x448)[0x580d3c]
./Evolve1DScalarWave(_ZN12ElementChareI16ScalarWaveSystemILi1EEE13endComputeRhsEv+0x5331DScalarWave': free(): invalid pointer: 0x000000000203c920 ***


Best,
Scott

_______________________________________________
charm mailing list
charm AT cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/charm





Archive powered by MHonArc 2.6.16.

Top of Page