Skip to Content.
Sympa Menu

charm - Re: [charm] segfault upon migration

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] segfault upon migration


Chronological Thread 
  • From: Nicolas Bock <nicolasbock AT gmail.com>
  • To: Jonathan Lifflander <jliffl2 AT illinois.edu>, Phil Miller <mille121 AT illinois.edu>
  • Cc: "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
  • Subject: Re: [charm] segfault upon migration
  • Date: Thu, 22 Aug 2013 09:55:33 -0600
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Thanks for the detailed description of the problem. That makes sense. By the way, which branch is the "master" branch? Is it "charm"?

Thanks for your help,

nick


On Wed, Aug 21, 2013 at 6:56 PM, Jonathan Lifflander <jliffl2 AT illinois.edu> wrote:
Hey,

So the bug you were experiencing in the prior version of your test
code (without the AtSync()) does not have anything to do with the
reduction, but actually how the thread is started. The thread was
being created, and then not immediately begun, so if the load balancer
happens to run, when the thread begins it will have an invalid pointer
to the object. Phil and I have fixed this problem, see issue #270 on
redmine, the change is on master and will be backported to the
release. With this change your code runs correctly.

However, we have realized there is a deeper problem with running
periodic load balancing and using threads. If the thread suspends, and
the thread is not migratable (which is the default), and the object is
migrated then when the thread is resumed there will be a crash. There
are two ways to solve this: either make the thread migratable by using
iso_malloc, or just use AtSync() when you are sure that the thread is
not suspended.

Hope this helps, we will try to resolve the deeper problem, but it may
be a while until we have a solid solution implemented.

Jonathan

On Wed, Aug 21, 2013 at 2:26 PM, Nicolas Bock <nicolasbock AT gmail.com> wrote:
> Hi,
>
> I changed the load balancing strategy from periodic to "at sync" and now
> everything seems to work. The code runs through 50 iterations without
> segfaulting. The only difference between the code I posted previously and
> this one is the timing of the load balancing step. My guess is that since
> periodic load balancing can migrate chares during the reduction, while some
> messages are still in the queue, there is some issue with rerouting queued
> messages such that from time to time a chare thread is restarted that
> doesn't exist anymore because it was migrated, leading to a segfault. This
> only happens if the entry method called for the reduction is [threaded],
> otherwise I never saw a segfault.
>
> Can you reproduce this behavior? Do you see any errors in the "periodic"
> version? If not, should I file a bug report?
>
> Thanks,
>
> nick
>
>
>
> On Tue, Aug 20, 2013 at 1:53 PM, Nicolas Bock <nicolasbock AT gmail.com> wrote:
>>
>> Hi,
>>
>> I wrote too soon. The program is still segfaulting. in order to isolate
>> what is ultimately causing this behavior I trimmed the program further. The
>> attached version is I guess as basic as it gets in terms of a reduction on a
>> chare array. When I declare Work::doSomething() as [threaded] in
>> migration.ci, the program segfaults after a few iterations. When that method
>> is declared a simple entry method, then the code runs fine. Since
>> Work:doSomething() is not suspending itself, the [threaded] attribute is not
>> necessary, but is it harmful?
>>
>> Thanks,
>>
>> nick
>>
>>
>>
>> On Mon, Aug 19, 2013 at 12:33 PM, Nicolas Bock <nicolasbock AT gmail.com>
>> wrote:
>>>
>>> Hi Jonathan and Nikhil,
>>>
>>> thanks that did it. Although I had read section 10.2 many times, it never
>>> occurred to me that [threaded] is also necessary for migration.
>>>
>>> Thanks again,
>>>
>>> nick
>>>
>>>
>>> On Fri, Aug 16, 2013 at 7:35 PM, Jonathan Lifflander
>>> <jliffl2 AT illinois.edu> wrote:
>>>>
>>>> Hey,
>>>>
>>>> Nikhil and I looked over your code and noticed another problem. It's
>>>> actually not related to the load balancing. For a entry method to call
>>>> a "sync" method, it must be threaded (see manual section 12.2)
>>>> (imagine the scenario when the object it is calling "sync" is on the
>>>> same processor, it must suspend to execute the method). We need to add
>>>> more runtime error checking to make sure this is the case and print
>>>> out a useful error message.
>>>>
>>>> So the fix is to make "doSomething" threaded. Then the code seems to
>>>> work fine.
>>>>
>>>> What was happening (as far as I can tell), is that the method was
>>>> waiting for the return value, and the load balancer was trying to move
>>>> it. This was not valid because of the lack of a thread, hence the
>>>> stack state was not migrateable. I'm surprised the code didn't hang
>>>> before you encountered this problem.
>>>>
>>>> Jonathan
>>>>
>>>> On Fri, Aug 16, 2013 at 6:02 PM, Nicolas Bock <nicolasbock AT gmail.com>
>>>> wrote:
>>>> > Hi,
>>>> >
>>>> > please have a look at the attached code. The code consists of two
>>>> > chare
>>>> > arrays, one holding some data, one doing some work. The main code
>>>> > calls a
>>>> > reduction on the Work array which gets information from the Data array
>>>> > to do
>>>> > something. When I run this (make run) on more than one PE with the
>>>> > GreedyCommLB load balancer the code segfaults at random points when
>>>> > the load
>>>> > balancer kicks in. I think what's going on is that the Data::info()
>>>> > call in
>>>> > Work::doSomething() suspends the chare and it just so happens that it
>>>> > sometimes is migrated while being suspended. If I comment out the code
>>>> > block
>>>> > that calls Data::info() the program executes just fine.
>>>> >
>>>> > Is what I am thinking correct, or is there another problem in the code
>>>> > that
>>>> > I have overlooked?
>>>> >
>>>> > Thanks already,
>>>> >
>>>> > nick
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > charm mailing list
>>>> > charm AT cs.uiuc.edu
>>>> > http://lists.cs.uiuc.edu/mailman/listinfo/charm
>>>> >
>>>
>>>
>>
>




Archive powered by MHonArc 2.6.16.

Top of Page