charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] memory management errors after ckexit called

From: Scott Field <sfield AT astro.cornell.edu>
To: Eric Bohm <ebohm AT illinois.edu>
Cc: charm AT lists.cs.illinois.edu
Subject: Re: [charm] memory management errors after ckexit called
Date: Sun, 18 Oct 2015 15:54:16 -0400

Hi Eric and Phil,

I didn't have a bug ticket to follow, but I was just wondering if this issue was resolved? Today I pulled a fresh copy of charm++ and my tests now pass without any segfaults.

Best,

Scott

On Fri, Aug 21, 2015 at 4:59 PM, Scott Field <sfield AT astro.cornell.edu> wrote:

Hi Eric,

Sorry about that. After checking more carefully I can confirm that the seg fault occurs during startup. I typically get about 1 such failed startup per 15 tries.

Let me know if I can be of any help tracking down the issue. I've checked that this is not a problem until git hash c96750026bbc7a9190f1381e7ac9ea56ae86f80e ("Bug #695: disable comm thread in multicore builds").

Best,
Scott

On Thu, Aug 20, 2015 at 3:00 PM, Eric Bohm <ebohm AT illinois.edu> wrote:

I can confirm that there is an occasional crash. However the manifestation is different from the one previously reported under this subject line. In this case, the crash that I can reproduce occurs during startup. Though it may require hundreds of repetitions for it to occur.

Does that match your experience?

So far the most sensible stack trace I've been able to extract implicates string handling of command line arguments.

#5 0x000000000052914d in std::set<std::string, std::less<std::string>, std::allocator<std::string> >::insert(std::string&&) ()
#6 0x00000000005223ae in _registerCommandLineOpt(char const*) ()
#7 0x0000000000593a73 in LBDatabase::initnodeFn() ()
#8 0x0000000000523e51 in InitCallTable::enumerateInitCalls() ()

On 08/19/2015 01:37 PM, Scott Field wrote:

Hi Phil,

Thank you (and the charm++ development team) for taking a look into this issue.

There might be a remaining problem since the segfaults continue to appear. I'm using the most recent version (git hash 28284bb5e62196febdf7a72fce5cbba1e3613639) and have built charm++ with './build charm++ multicore-linux64 gcc -j3 -std=c++11'. The segfaults appear in some of the included examples, for example the one located in 'examples/charm++/hello/1darray'. Three such errors were generated after running 50 executions of './hello +p2'. As before, no errors are produced when run with +p1.

Best,

Scott

On Tue, Aug 18, 2015 at 12:01 PM, Phil Miller <mille121 AT illinois.edu> wrote:

There's been a longer delay than we'd hoped, by we believe this issue is now resolved:
https://charm.cs.illinois.edu/redmine/issues/761

Please let us know if you encounter further issues.

Thanks

Phil

On Sun, Jun 21, 2015 at 5:37 PM, Phil Miller <mille121 AT illinois.edu> wrote:

The case I have seems to reproduce the issue with 100% reliability. I've entered it in our bug tracker here https://charm.cs.illinois.edu/redmine/issues/761

Hopefully, we'll have this fixed in the next week or so.

On Sun, Jun 21, 2015 at 5:33 PM, Scott Field <sfield AT astro.cornell.edu> wrote:

Hi Phil,

Thanks for taking a closer look at this. It sounds like you have some good leads, but if there's anything I can do on my end please let me know. I did have a look at the jacobi3d example (using "./jacobi3d 10 10 +p3") and found the error to occur more frequently as compared to simplearrayhello. On my machine, I needed to run "MALLOC_CHECK_=3 ./hello +p4 10" 50 times before encountering an error. By comparison, our regression tests fail about 50% of the time.

Best,

Scott

On Sat, Jun 20, 2015 at 4:44 PM, Phil Miller <mille121 AT illinois.edu> wrote:

I just tried to reproduce your report on a simple example program, tests/charm++/simplearrayhello. Running that with the command line "MALLOC_CHECK_=3 ./hello +p4 10" seems to show the issue. I think we can take it from here - no need for you to do further testing on your end.

On Fri, Jun 19, 2015 at 5:42 PM, Phil Miller <mille121 AT illinois.edu> wrote:

Could you please post the backtrace from the two calls to CkExit()?

Could you also try your test on these alternative builds of Charm++:

net-linux-x86_64-smp

netlrts-linux-x86_64-smp

Instead of just +p P, you'll need to pass +p P +ppn P when launching your program. This is to test a hypothesis that the switch from an older (net) to newer (netlrts) generation of the underlying infrastructure broke the code in some subtle way. Alternately, you could try just reverting commit 61718f94316d22087075f213e9f1d60d9efbdb95 and building/running multicore-linux64 against that.

Thank you for your help in hunting this down. If you have a reasonably small test case that you'd rather just pass us to see if it's a runtime system bug, we'd be happy to take a look.

On Wed, Jun 17, 2015 at 3:01 PM, Scott Field <sfield AT astro.cornell.edu> wrote:

Hi Phil,

On Tue, Jun 16, 2015 at 4:25 PM, Phil Miller <mille121 AT illinois.edu> wrote:

Hi Scott,

This list is definitely an appropriate place to post about potential bugs. Thanks for bringing it up.

The build you're using is a little bit weird, relative to the system you're running on. That's a 32-bit build of the RTS, running on a 64-bit processor and OS. You also don't need to specify gcc, as it is the platform default. So, I'd go with 'multicore-linux64' instead.

When debugging a problem, it's often a good idea to leave assertions enabled, by dropping the --with-production option. Sometimes, the RTS can catch that things have gone wrong much sooner that way.

Thank you for the tips. Disabling production didn't catch anything, unfortunately.

Now, for your particular issue: how many PEs are you running? Does the reported error reproduce when run with +p 1 and MALLOC_CHECK_ set to 3?

Running with +p1 is fine. Errors occur only when more than 1 thread is used.

Can you compile the whole thing with -g and run it under gdb to see which ostensibly invalid pointer is actually being freed at that point?

Sure. The charm++ build is now ./build charm++ multicore-linux64 -g -std=c++11

I set my breakpoint at CkExit and get the following output with +p1 (reported here to compare with +p2):

Breakpoint 1, CkExit () at init.C:895

895 envelope *env = _allocEnv(StartExitMsg);

(gdb)

(gdb) n

896 env->setSrcPe(CkMyPe());

(gdb)

897 CmiSetHandler(env, _exitHandlerIdx);

(gdb)

898 CmiSyncSendAndFree(0, env->getTotalsize(), (char *)env);

(gdb)

903 if(!CharmLibInterOperate)

(gdb)

904 CsdScheduler(-1);

(gdb)

Breakpoint 1, CkExit () at init.C:895

895 envelope *env = _allocEnv(StartExitMsg);

(gdb)

896 env->setSrcPe(CkMyPe());

(gdb)

897 CmiSetHandler(env, _exitHandlerIdx);

(gdb)

898 CmiSyncSendAndFree(0, env->getTotalsize(), (char *)env);

(gdb)

903 if(!CharmLibInterOperate)

(gdb)

904 CsdScheduler(-1);

(gdb)

[Partition 0][Node 0] End of program

[Inferior 1 (process 7163) exited normally]

I was a bit surprised that CkExit was called twice. Anyway, running with +p2 produces

Breakpoint 1, CkExit () at init.C:895

895 envelope *env = _allocEnv(StartExitMsg);

(gdb) n

896 env->setSrcPe(CkMyPe());

(gdb)

897 CmiSetHandler(env, _exitHandlerIdx);

(gdb)

898 CmiSyncSendAndFree(0, env->getTotalsize(), (char *)env);

(gdb)

903 if(!CharmLibInterOperate)

(gdb)

904 CsdScheduler(-1);

(gdb)

Breakpoint 1, CkExit () at init.C:895

895 envelope *env = _allocEnv(StartExitMsg);

(gdb)

896 env->setSrcPe(CkMyPe());

(gdb)

897 CmiSetHandler(env, _exitHandlerIdx);

(gdb)

898 CmiSyncSendAndFree(0, env->getTotalsize(), (char *)env);

(gdb)

903 if(!CharmLibInterOperate)

(gdb)

904 CsdScheduler(-1);

(gdb) step

CsdScheduler (maxmsgs=-1) at convcore.c:1797

1797 if (maxmsgs<0) CsdScheduleForever();

(gdb)

CsdScheduleForever () at convcore.c:1848

1848 int isIdle=0;

(gdb)

CsdSchedulerState_new (s=0x7fffffffd5b0) at convcore.c:1660

1660 s->localQ=CpvAccess(CmiLocalQueue);

(gdb) n

Program received signal SIGSEGV, Segmentation fault.

CsdSchedulerState_new (s=0x7fffffffd5b0) at convcore.c:1660

1660 s->localQ=CpvAccess(CmiLocalQueue);

Best,

Scott

Thanks.

Phil

On Tue, Jun 16, 2015 at 3:14 PM, Scott Field <sfield AT astro.cornell.edu> wrote:

Hi,

Recently, after pulling a bleeding-edge version of the charm++ code, all of our regression tests now fail with either a segmentation fault or "double free or corruption (!prev): 0x0000000001c4de20 ***". The error appears to occur after ckexit is called. Charm++ was built on my laptop with

>>> ./build charm++ multicore-linux32 gcc --with-production -j3 -std=c++11

Using git's bisect utility, I was able to track down the first commit version where things go wrong. The git hash and commit messages are c96750026bbc7a9190f1381e7ac9ea56ae86f80e and "Bug #695: disable comm thread in multicore builds". More specifically, if I edit line 200 of the file src/arch/util/machine-common-core.c from "#define CMK_SMP_NO_COMMTHD CMK_MULTICORE" to "#define CMK_SMP_NO_COMMTHD 0" the error message goes away and all tests pass again.

Honestly I don't really know what why this change fixed the problem -- its pretty far under-the-hood.

A few questions:

1) Is this list a appropriate place to post information about potential bugs?

2) Does this seem to be a charm++ bug introduced by that commit? Or a fix which has simply broken our code? I had a hard time tracking down the source of the error. Oddly enough, I could not reproduce the same error when using valgrind (although it did report an "Uninitialised value was created by a stack allocation" which it tracked to one of the declaration files created by charmc). With MALLOC_CHECK_ set to 3 I get the following

*** Error in `./Evolve1DScalarWave': free(): invalid pointer: 0x000000000203c920 ***

======= Backtrace: =========

/lib/x86_64-linux-gnu/libc.so.6(+0x7338f)[0x7f4cebc2e38f]

/lib/x86_64-linux-gnu/libc.so.6(+0x81fb6)[0x7f4cebc3cfb6]

/lib/x86_64-linux-gnu/libc.so.6(+0x3c280)[0x7f4cebbf7280]

/lib/x86_64-linux-gnu/libc.so.6(+0x3c2a5)[0x7f4cebbf72a5]

./Evolve1DScalarWave[0x670b4a]

./Evolve1DScalarWave[0x5e39ed]

./Evolve1DScalarWave(CsdScheduleForever+0x48)[0x673e88]

./Evolve1DScalarWave(CsdScheduler+0x2d)[0x67413d]

./Evolve1DScalarWave(_ZN12ElementChareI16ScalarWaveSystemILi1EEE11endTimeStepEv+0x448)[0x580d3c]

./Evolve1DScalarWave(_ZN12ElementChareI16ScalarWaveSystemILi1EEE13endComputeRhsEv+0x5331DScalarWave': free(): invalid pointer: 0x000000000203c920 ***

Best,

Scott

_______________________________________________
charm mailing list
charm AT cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/charm

Re: [charm] memory management errors after ckexit called, Scott Field, 10/18/2015
- Re: [charm] memory management errors after ckexit called, Phil Miller, 10/18/2015