Skip to Content.
Sympa Menu

charm - Re: [charm] code crash when run with migration based LB - charm++ 6.7.1

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] code crash when run with migration based LB - charm++ 6.7.1


Chronological Thread 
  • From: Fouzhan Hosseini <F.Hosseini AT leeds.ac.uk>
  • To: "Vipul Harsh, -" <vharsh2 AT illinois.edu>
  • Cc: "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
  • Subject: Re: [charm] code crash when run with migration based LB - charm++ 6.7.1
  • Date: Thu, 6 Oct 2016 11:31:45 +0000
  • Accept-language: en-US
  • Spamdiagnosticmetadata: NSPM
  • Spamdiagnosticoutput: 1:99

Hi Vipul,


Thanks a lot for your reply. I cannot pup CkSectioninfo objects. As I mentioned in my first email, I've defined CkSectionInfo objects in the local scope of entry methods which are contributing in a section reduction, and they are updated  by calling CkGetSectionInfo(). Charm++ manual, sec 14.3, says cookie should not be used as a one-time local variable. However, if i define CkSectionInfo objects as a data members of chare elements, the program will not finish (I guess, sth goes wrong in the underlying message passing). Moving definition of the CkSectionInfo objects to the local scope of functions seems to work and program always finish successfully as long as I do not use migration based LB.


I was confused with the charm++ manual suggestion. In each iteration, I define new array sections, broadcast/multicast to each section, and then each section contributes in only one reduction operation. I guess CkSectionInfo objects must be defined locally, as from one iteration to another a new one is needed. I do not know many implementation details, and obviously can be wrong! I can put together an example codeset that goes to infinite run, if anybody is happy to look at it.


I still get "corrupted double-linked list" error when using LB.


Thanks,

Fouzhan



From: Vipul Harsh, - <vharsh2 AT illinois.edu>
Sent: Wednesday, October 5, 2016 6:37:36 PM
To: Fouzhan Hosseini
Cc: charm AT cs.uiuc.edu
Subject: RE: code crash when run with migration based LB - charm++ 6.7.1
 
Hi Fouzhan,

CkMulticast library handles migrations. Contributes should work even with old values of the cookie (CkSectionInfo object),  after the chare has migrated. But make sure that you pup the CkSectionInfo object in your pup routine.

Thanks and Regards,
Vipul Harsh

From: Fouzhan Hosseini [F.Hosseini AT leeds.ac.uk]
Sent: 03 October 2016 18:36:46
To: charm AT cs.uiuc.edu
Subject: [charm] code crash when run with migration based LB - charm++ 6.7.1

Dear All,


I have coded a Charm++ program, which works fine running either on a multi-core machine or on a cluster. However, when this program is linked and executed with available migration based load balancing strategies (e.g +balancer GreedyLB), it usually crashes with error message "corrupted double-linked list.."  or "seg fault". I have been trying to track down the problem and not sure where it is coming from. I have a few questions.     

I am new to charm++ community, I hope here is the right place to raise questions/ask for help/report bugs.  


1) There are two char arrays in my code and PUP method is implemented for both. I only have simple entry methods (no threaded or sync method), but I heavily use structure daggers to express coordination between entry methods( for, if and when statements and matching on reference numbers).  "__sdag_pup(p);" is added in PUP methods. Is there anything else I am supposed to add to my code to be able to use migration based LB?


2) I am using CkMulticast library with array sections and section reductions. Each array section only contributes in one reduction, so I've define a local variable of type CkSectionInfo in relevant chare function members which are updated calling "CkGetSectionInfo()". I do not quite understand how CkSectionInfo are updated in CkMulticast lib in case of migration, so wondered if this can cause problem. 


3) There is an entry method called Merger() which is expressed by sdagger. In this method there is a when statement waiting on another entry method called RecvBSlabSet1(). RecvBSlabSet1() is called when a section reduction on the other array completes. This two entry methods often are mentioned is Error message stack trace. I am including the error message stack trace in case it would be useful. Both this entry methods belong to a chare array called JointContourNet.


** Error in `JCN': *** Error in `JCN': corrupted double-linked list: 0x000000000298f860 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7b184)[0x7ffff6957184]
/lib64/libc.so.6(+0x7d235)[0x7ffff6959235]
JCN(_ZN4SDAG10MsgClosureD0Ev+0x24)[0x60e0b4]
JCN(_ZN4SDAG6BufferD0Ev+0x46)[0x60e126]
JCN(_ZN15JointContourNet7_when_0EPN23Closure_JointContourNet16Merger_4_closureEi+0x2bc)[0x4c82ac]
JCN(_ZN15JointContourNet13RecvBSlabSet1EP14CkReductionMsg+0x188)[0x4c8af8]
JCN(CkDeliverMessageFree+0x22)[0x530652]
JCN(_ZN14CkLocRec_local11invokeEntryEP12CkMigratablePvib+0x240)[0x54a570]
JCN(_ZN14CkLocRec_local7deliverEP14CkArrayMessage11CkDeliver_ti+0x314)[0x54b504]
JCN(_ZN8CkLocMgr7deliverEP9CkMessage11CkDeliver_ti+0xec)[0x546fdc]
JCN(_Z15_processHandlerPvP11CkCoreState+0x437)[0x537327]
JCN(CsdScheduleForever+0x48)[0x5f9ff8]
JCN(CsdScheduler+0x2d)[0x5fa28d]
JCN(ConverseInit+0x3ea)[0x5f8f6a]
JCN(main+0x2c)[0x4bcd5c]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ffff68fdb15]


Regards, 

Fouzhan 




Archive powered by MHonArc 2.6.19.

Top of Page