Skip to Content.
Sympa Menu

charm - Re: [charm] segfault upon migration

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] segfault upon migration


Chronological Thread 
  • From: Phil Miller <mille121 AT illinois.edu>
  • To: Nicolas Bock <nicolasbock AT gmail.com>
  • Cc: "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
  • Subject: Re: [charm] segfault upon migration
  • Date: Wed, 21 Aug 2013 19:56:00 -0500
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Hi Nick,

There was a Charm++ bug exposed by your test code, filed as #270 in
our issue tracker [1]. Essentially, threaded entry methods were being
executed in two phases, thread creation and thread activation. If the
underlying object migrated between those two phases, then the thread
would not follow it and would run on a 'NULL' object, causing
undefined behavior and ultimately a crash.

The particular bug is now fixed on both our stable branch and the mainline.

Note, however, that the same problem will occur for threaded entry
methods that are suspended while their underlying object migrates.
This can still occur for array elements with periodic load balancing
active, or if the object calls migrateMe for some reason while an
associated thread is suspended. I'll open a separate issue about that,
but it's less likely to get fixed quickly.

Thank you for the report, and for taking the time to produce a focused
test case.

Phil


[1] https://charm.cs.illinois.edu/redmine/issues/270

On Wed, Aug 21, 2013 at 2:26 PM, Nicolas Bock
<nicolasbock AT gmail.com>
wrote:
> Hi,
>
> I changed the load balancing strategy from periodic to "at sync" and now
> everything seems to work. The code runs through 50 iterations without
> segfaulting. The only difference between the code I posted previously and
> this one is the timing of the load balancing step. My guess is that since
> periodic load balancing can migrate chares during the reduction, while some
> messages are still in the queue, there is some issue with rerouting queued
> messages such that from time to time a chare thread is restarted that
> doesn't exist anymore because it was migrated, leading to a segfault. This
> only happens if the entry method called for the reduction is [threaded],
> otherwise I never saw a segfault.
>
> Can you reproduce this behavior? Do you see any errors in the "periodic"
> version? If not, should I file a bug report?
>
> Thanks,
>
> nick
>
>
>
> On Tue, Aug 20, 2013 at 1:53 PM, Nicolas Bock
> <nicolasbock AT gmail.com>
> wrote:
>>
>> Hi,
>>
>> I wrote too soon. The program is still segfaulting. in order to isolate
>> what is ultimately causing this behavior I trimmed the program further. The
>> attached version is I guess as basic as it gets in terms of a reduction on
>> a
>> chare array. When I declare Work::doSomething() as [threaded] in
>> migration.ci, the program segfaults after a few iterations. When that
>> method
>> is declared a simple entry method, then the code runs fine. Since
>> Work:doSomething() is not suspending itself, the [threaded] attribute is
>> not
>> necessary, but is it harmful?
>>
>> Thanks,
>>
>> nick
>>
>>
>>
>> On Mon, Aug 19, 2013 at 12:33 PM, Nicolas Bock
>> <nicolasbock AT gmail.com>
>> wrote:
>>>
>>> Hi Jonathan and Nikhil,
>>>
>>> thanks that did it. Although I had read section 10.2 many times, it never
>>> occurred to me that [threaded] is also necessary for migration.
>>>
>>> Thanks again,
>>>
>>> nick
>>>
>>>
>>> On Fri, Aug 16, 2013 at 7:35 PM, Jonathan Lifflander
>>> <jliffl2 AT illinois.edu>
>>> wrote:
>>>>
>>>> Hey,
>>>>
>>>> Nikhil and I looked over your code and noticed another problem. It's
>>>> actually not related to the load balancing. For a entry method to call
>>>> a "sync" method, it must be threaded (see manual section 12.2)
>>>> (imagine the scenario when the object it is calling "sync" is on the
>>>> same processor, it must suspend to execute the method). We need to add
>>>> more runtime error checking to make sure this is the case and print
>>>> out a useful error message.
>>>>
>>>> So the fix is to make "doSomething" threaded. Then the code seems to
>>>> work fine.
>>>>
>>>> What was happening (as far as I can tell), is that the method was
>>>> waiting for the return value, and the load balancer was trying to move
>>>> it. This was not valid because of the lack of a thread, hence the
>>>> stack state was not migrateable. I'm surprised the code didn't hang
>>>> before you encountered this problem.
>>>>
>>>> Jonathan
>>>>
>>>> On Fri, Aug 16, 2013 at 6:02 PM, Nicolas Bock
>>>> <nicolasbock AT gmail.com>
>>>> wrote:
>>>> > Hi,
>>>> >
>>>> > please have a look at the attached code. The code consists of two
>>>> > chare
>>>> > arrays, one holding some data, one doing some work. The main code
>>>> > calls a
>>>> > reduction on the Work array which gets information from the Data array
>>>> > to do
>>>> > something. When I run this (make run) on more than one PE with the
>>>> > GreedyCommLB load balancer the code segfaults at random points when
>>>> > the load
>>>> > balancer kicks in. I think what's going on is that the Data::info()
>>>> > call in
>>>> > Work::doSomething() suspends the chare and it just so happens that it
>>>> > sometimes is migrated while being suspended. If I comment out the code
>>>> > block
>>>> > that calls Data::info() the program executes just fine.
>>>> >
>>>> > Is what I am thinking correct, or is there another problem in the code
>>>> > that
>>>> > I have overlooked?
>>>> >
>>>> > Thanks already,
>>>> >
>>>> > nick
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > charm mailing list
>>>> > charm AT cs.uiuc.edu
>>>> > http://lists.cs.uiuc.edu/mailman/listinfo/charm
>>>> >
>>>
>>>
>>
>
>
> _______________________________________________
> charm mailing list
> charm AT cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/charm
>




Archive powered by MHonArc 2.6.16.

Top of Page