Skip to Content.
Sympa Menu

charm - Re: [charm] memory management errors after ckexit called

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] memory management errors after ckexit called


Chronological Thread 
  • From: Scott Field <sfield AT astro.cornell.edu>
  • To: Phil Miller <mille121 AT illinois.edu>
  • Cc: "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
  • Subject: Re: [charm] memory management errors after ckexit called
  • Date: Wed, 17 Jun 2015 16:01:00 -0400
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Hi Phil,

On Tue, Jun 16, 2015 at 4:25 PM, Phil Miller <mille121 AT illinois.edu> wrote:
Hi Scott,

This list is definitely an appropriate place to post about potential bugs. Thanks for bringing it up.

The build you're using is a little bit weird, relative to the system you're running on. That's a 32-bit build of the RTS, running on a 64-bit processor and OS. You also don't need to specify gcc, as it is the platform default. So, I'd go with 'multicore-linux64' instead.

When debugging a problem, it's often a good idea to leave assertions enabled, by dropping the --with-production option. Sometimes, the RTS can catch that things have gone wrong much sooner that way.

Thank you for the tips. Disabling production didn't catch anything, unfortunately.
 

Now, for your particular issue: how many PEs are you running? Does the reported error reproduce when run with +p 1 and MALLOC_CHECK_  set to 3?

Running with +p1 is fine. Errors occur only when more than 1 thread is used.
 
Can you compile the whole thing with -g and run it under gdb to see which ostensibly invalid pointer is actually being freed at that point?

Sure. The charm++ build is now ./build charm++ multicore-linux64  -g -std=c++11

I set my breakpoint at CkExit and get the following output with +p1 (reported here to compare with +p2):

Breakpoint 1, CkExit () at init.C:895
895  envelope *env = _allocEnv(StartExitMsg);
(gdb) 
(gdb) n
896  env->setSrcPe(CkMyPe());
(gdb) 
897  CmiSetHandler(env, _exitHandlerIdx);
(gdb) 
898  CmiSyncSendAndFree(0, env->getTotalsize(), (char *)env);
(gdb) 
903 if(!CharmLibInterOperate)
(gdb) 
904  CsdScheduler(-1);
(gdb) 

Breakpoint 1, CkExit () at init.C:895
895  envelope *env = _allocEnv(StartExitMsg);
(gdb) 
896  env->setSrcPe(CkMyPe());
(gdb) 
897  CmiSetHandler(env, _exitHandlerIdx);
(gdb) 
898  CmiSyncSendAndFree(0, env->getTotalsize(), (char *)env);
(gdb) 
903 if(!CharmLibInterOperate)
(gdb) 
904  CsdScheduler(-1);
(gdb) 
[Partition 0][Node 0] End of program
[Inferior 1 (process 7163) exited normally]

I was a bit surprised that CkExit was called twice. Anyway, running with +p2 produces

Breakpoint 1, CkExit () at init.C:895
895  envelope *env = _allocEnv(StartExitMsg);
(gdb) n
896  env->setSrcPe(CkMyPe());
(gdb) 
897  CmiSetHandler(env, _exitHandlerIdx);
(gdb) 
898  CmiSyncSendAndFree(0, env->getTotalsize(), (char *)env);
(gdb) 
903 if(!CharmLibInterOperate)
(gdb) 
904  CsdScheduler(-1);
(gdb) 

Breakpoint 1, CkExit () at init.C:895
895  envelope *env = _allocEnv(StartExitMsg);
(gdb) 
896  env->setSrcPe(CkMyPe());
(gdb) 
897  CmiSetHandler(env, _exitHandlerIdx);
(gdb) 
898  CmiSyncSendAndFree(0, env->getTotalsize(), (char *)env);
(gdb) 
903 if(!CharmLibInterOperate)
(gdb) 
904  CsdScheduler(-1);
(gdb) step
CsdScheduler (maxmsgs=-1) at convcore.c:1797
1797 if (maxmsgs<0) CsdScheduleForever();
(gdb) 
CsdScheduleForever () at convcore.c:1848
1848  int isIdle=0;
(gdb) 
CsdSchedulerState_new (s=0x7fffffffd5b0) at convcore.c:1660
1660 s->localQ=CpvAccess(CmiLocalQueue);
(gdb) n

Program received signal SIGSEGV, Segmentation fault.
CsdSchedulerState_new (s=0x7fffffffd5b0) at convcore.c:1660
1660 s->localQ=CpvAccess(CmiLocalQueue);


Best,
Scott 


Thanks.

Phil





On Tue, Jun 16, 2015 at 3:14 PM, Scott Field <sfield AT astro.cornell.edu> wrote:
Hi,

  Recently, after pulling a bleeding-edge version of the charm++ code, all of our regression tests now fail with either a segmentation fault or "double free or corruption (!prev): 0x0000000001c4de20 ***". The error appears to occur after ckexit is called. Charm++ was built on my laptop with 

 >>> ./build charm++ multicore-linux32 gcc --with-production -j3 -std=c++11

  Using git's bisect utility, I was able to track down the first commit version where things go wrong. The git hash and commit messages are c96750026bbc7a9190f1381e7ac9ea56ae86f80e and "Bug #695:  disable comm thread in multicore builds". More specifically, if I edit line 200 of the file src/arch/util/machine-common-core.c from "#define CMK_SMP_NO_COMMTHD CMK_MULTICORE" to "#define CMK_SMP_NO_COMMTHD 0" the error message goes away and all tests pass again. 

  Honestly I don't really know what why this change fixed the problem -- its pretty far under-the-hood. 

  A few questions:

1) Is this list a appropriate place to post information about potential bugs?

2) Does this seem to be a charm++ bug introduced by that commit? Or a fix which has simply broken our code? I had a hard time tracking down the source of the error. Oddly enough, I could not reproduce the same error when using valgrind (although it did report an  "Uninitialised value was created by a stack allocation" which it tracked to one of the declaration files created by charmc). With MALLOC_CHECK_ set to 3 I get the following 

*** Error in `./Evolve1DScalarWave': free(): invalid pointer: 0x000000000203c920 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x7338f)[0x7f4cebc2e38f]
/lib/x86_64-linux-gnu/libc.so.6(+0x81fb6)[0x7f4cebc3cfb6]
/lib/x86_64-linux-gnu/libc.so.6(+0x3c280)[0x7f4cebbf7280]
/lib/x86_64-linux-gnu/libc.so.6(+0x3c2a5)[0x7f4cebbf72a5]
./Evolve1DScalarWave[0x670b4a]
./Evolve1DScalarWave[0x5e39ed]
./Evolve1DScalarWave(CsdScheduleForever+0x48)[0x673e88]
./Evolve1DScalarWave(CsdScheduler+0x2d)[0x67413d]
./Evolve1DScalarWave(_ZN12ElementChareI16ScalarWaveSystemILi1EEE11endTimeStepEv+0x448)[0x580d3c]
./Evolve1DScalarWave(_ZN12ElementChareI16ScalarWaveSystemILi1EEE13endComputeRhsEv+0x5331DScalarWave': free(): invalid pointer: 0x000000000203c920 ***


Best,
Scott

_______________________________________________
charm mailing list
charm AT cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/charm






Archive powered by MHonArc 2.6.16.

Top of Page