Skip to Content.
Sympa Menu

charm - Re: [charm] segfault upon migration

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] segfault upon migration


Chronological Thread 
  • From: Jonathan Lifflander <jliffl2 AT illinois.edu>
  • To: Nicolas Bock <nicolasbock AT gmail.com>
  • Cc: "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
  • Subject: Re: [charm] segfault upon migration
  • Date: Thu, 22 Aug 2013 10:57:26 -0500
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Yes, the master branch is named "charm".

On Thu, Aug 22, 2013 at 10:55 AM, Nicolas Bock
<nicolasbock AT gmail.com>
wrote:
> Thanks for the detailed description of the problem. That makes sense. By the
> way, which branch is the "master" branch? Is it "charm"?
>
> Thanks for your help,
>
> nick
>
>
> On Wed, Aug 21, 2013 at 6:56 PM, Jonathan Lifflander
> <jliffl2 AT illinois.edu>
> wrote:
>>
>> Hey,
>>
>> So the bug you were experiencing in the prior version of your test
>> code (without the AtSync()) does not have anything to do with the
>> reduction, but actually how the thread is started. The thread was
>> being created, and then not immediately begun, so if the load balancer
>> happens to run, when the thread begins it will have an invalid pointer
>> to the object. Phil and I have fixed this problem, see issue #270 on
>> redmine, the change is on master and will be backported to the
>> release. With this change your code runs correctly.
>>
>> However, we have realized there is a deeper problem with running
>> periodic load balancing and using threads. If the thread suspends, and
>> the thread is not migratable (which is the default), and the object is
>> migrated then when the thread is resumed there will be a crash. There
>> are two ways to solve this: either make the thread migratable by using
>> iso_malloc, or just use AtSync() when you are sure that the thread is
>> not suspended.
>>
>> Hope this helps, we will try to resolve the deeper problem, but it may
>> be a while until we have a solid solution implemented.
>>
>> Jonathan
>>
>> On Wed, Aug 21, 2013 at 2:26 PM, Nicolas Bock
>> <nicolasbock AT gmail.com>
>> wrote:
>> > Hi,
>> >
>> > I changed the load balancing strategy from periodic to "at sync" and now
>> > everything seems to work. The code runs through 50 iterations without
>> > segfaulting. The only difference between the code I posted previously
>> > and
>> > this one is the timing of the load balancing step. My guess is that
>> > since
>> > periodic load balancing can migrate chares during the reduction, while
>> > some
>> > messages are still in the queue, there is some issue with rerouting
>> > queued
>> > messages such that from time to time a chare thread is restarted that
>> > doesn't exist anymore because it was migrated, leading to a segfault.
>> > This
>> > only happens if the entry method called for the reduction is [threaded],
>> > otherwise I never saw a segfault.
>> >
>> > Can you reproduce this behavior? Do you see any errors in the "periodic"
>> > version? If not, should I file a bug report?
>> >
>> > Thanks,
>> >
>> > nick
>> >
>> >
>> >
>> > On Tue, Aug 20, 2013 at 1:53 PM, Nicolas Bock
>> > <nicolasbock AT gmail.com>
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> I wrote too soon. The program is still segfaulting. in order to isolate
>> >> what is ultimately causing this behavior I trimmed the program further.
>> >> The
>> >> attached version is I guess as basic as it gets in terms of a reduction
>> >> on a
>> >> chare array. When I declare Work::doSomething() as [threaded] in
>> >> migration.ci, the program segfaults after a few iterations. When that
>> >> method
>> >> is declared a simple entry method, then the code runs fine. Since
>> >> Work:doSomething() is not suspending itself, the [threaded] attribute
>> >> is not
>> >> necessary, but is it harmful?
>> >>
>> >> Thanks,
>> >>
>> >> nick
>> >>
>> >>
>> >>
>> >> On Mon, Aug 19, 2013 at 12:33 PM, Nicolas Bock
>> >> <nicolasbock AT gmail.com>
>> >> wrote:
>> >>>
>> >>> Hi Jonathan and Nikhil,
>> >>>
>> >>> thanks that did it. Although I had read section 10.2 many times, it
>> >>> never
>> >>> occurred to me that [threaded] is also necessary for migration.
>> >>>
>> >>> Thanks again,
>> >>>
>> >>> nick
>> >>>
>> >>>
>> >>> On Fri, Aug 16, 2013 at 7:35 PM, Jonathan Lifflander
>> >>> <jliffl2 AT illinois.edu>
>> >>> wrote:
>> >>>>
>> >>>> Hey,
>> >>>>
>> >>>> Nikhil and I looked over your code and noticed another problem. It's
>> >>>> actually not related to the load balancing. For a entry method to
>> >>>> call
>> >>>> a "sync" method, it must be threaded (see manual section 12.2)
>> >>>> (imagine the scenario when the object it is calling "sync" is on the
>> >>>> same processor, it must suspend to execute the method). We need to
>> >>>> add
>> >>>> more runtime error checking to make sure this is the case and print
>> >>>> out a useful error message.
>> >>>>
>> >>>> So the fix is to make "doSomething" threaded. Then the code seems to
>> >>>> work fine.
>> >>>>
>> >>>> What was happening (as far as I can tell), is that the method was
>> >>>> waiting for the return value, and the load balancer was trying to
>> >>>> move
>> >>>> it. This was not valid because of the lack of a thread, hence the
>> >>>> stack state was not migrateable. I'm surprised the code didn't hang
>> >>>> before you encountered this problem.
>> >>>>
>> >>>> Jonathan
>> >>>>
>> >>>> On Fri, Aug 16, 2013 at 6:02 PM, Nicolas Bock
>> >>>> <nicolasbock AT gmail.com>
>> >>>> wrote:
>> >>>> > Hi,
>> >>>> >
>> >>>> > please have a look at the attached code. The code consists of two
>> >>>> > chare
>> >>>> > arrays, one holding some data, one doing some work. The main code
>> >>>> > calls a
>> >>>> > reduction on the Work array which gets information from the Data
>> >>>> > array
>> >>>> > to do
>> >>>> > something. When I run this (make run) on more than one PE with the
>> >>>> > GreedyCommLB load balancer the code segfaults at random points when
>> >>>> > the load
>> >>>> > balancer kicks in. I think what's going on is that the Data::info()
>> >>>> > call in
>> >>>> > Work::doSomething() suspends the chare and it just so happens that
>> >>>> > it
>> >>>> > sometimes is migrated while being suspended. If I comment out the
>> >>>> > code
>> >>>> > block
>> >>>> > that calls Data::info() the program executes just fine.
>> >>>> >
>> >>>> > Is what I am thinking correct, or is there another problem in the
>> >>>> > code
>> >>>> > that
>> >>>> > I have overlooked?
>> >>>> >
>> >>>> > Thanks already,
>> >>>> >
>> >>>> > nick
>> >>>> >
>> >>>> >
>> >>>> > _______________________________________________
>> >>>> > charm mailing list
>> >>>> > charm AT cs.uiuc.edu
>> >>>> > http://lists.cs.uiuc.edu/mailman/listinfo/charm
>> >>>> >
>> >>>
>> >>>
>> >>
>> >
>
>




Archive powered by MHonArc 2.6.16.

Top of Page