Skip to Content.
Sympa Menu

charm - Re: [charm] how to suspend/resume a chare ?

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] how to suspend/resume a chare ?


Chronological Thread 
  • From: Gengbin Zheng <zhenggb AT gmail.com>
  • To: Christian Perez <christian.perez AT inria.fr>
  • Cc: Filippo Gioachin <gioachin AT uiuc.edu>, "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>, "Kale, Laxmikant V" <kale AT cs.uiuc.edu>, "kale AT illinois.edu" <kale AT illinois.edu>
  • Subject: Re: [charm] how to suspend/resume a chare ?
  • Date: Fri, 28 May 2010 09:57:02 -0500
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

ok, the virtual base class seems to be a problem here. If you remove
it, it works.
This exposes a bug in charm on how a plain chare is created and
registered in the chare table. I checked in a fix now, please try it
and let us know if it fixes the problem.

Gengbin

On Fri, May 28, 2010 at 4:03 AM, Christian Perez
<christian.perez AT inria.fr>
wrote:
> Here is a simplified test program that bugs  (line 31) while the version
> with line 30 on works (with one process).
> After compiling, it may happen that the 1st run works ... so it may be a
> race condition.
> Running it with 2 processes seems to be ok on my machine.
>
> Christian
>
> On 05/28/2010 04:03 AM, Gengbin Zheng wrote:
>>
>> I wrote a simple test program passing either thishandle, or CProxy, it
>> seems to work both ways.
>> I think it would be best if you can write a simple standalone program
>> that can reproduce your problem and send to us for investigation.
>>
>> Gengbin
>>
>> On Thu, May 27, 2010 at 8:51 AM, Christian Perez
>> <christian.perez AT inria.fr>
>>  wrote:
>>
>>>
>>> But, why does it work if I use the parameter 'psi'?
>>>
>>> If I remove the 'sync', there is a synchronization issue :(
>>>
>>> Christian
>>>
>>> On 05/26/2010 09:54 PM, Gengbin Zheng wrote:
>>>
>>>>
>>>> remove [sync] from your ci file. Calling sync will block the caller,
>>>> which in your case is not a threaded entry method.
>>>>
>>>> Gengbin
>>>>
>>>> On Wed, May 26, 2010 at 10:29 AM, Christian Perez
>>>> <christian.perez AT inria.fr>
>>>>    wrote:
>>>>
>>>>
>>>>>
>>>>> On 05/26/2010 05:00 PM, Gengbin Zheng wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> On Wed, May 26, 2010 at 6:37 AM, Christian Perez
>>>>>> <christian.perez AT inria.fr>
>>>>>>      wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> On 05/25/2010 07:22 PM, Gengbin Zheng wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Is the caller of connect() a thread (I mean is connect() a threaded
>>>>>>>> entry function)?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> It is not a threaded charm method: when I tried to turn it into a
>>>>>>> [threaded]
>>>>>>> entry
>>>>>>> method it I've got an error when invoking JNI method from this chare.
>>>>>>> I
>>>>>>> shall investigate it later.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> another way of doing this without using thread is to send
>>>>>>>> component_B
>>>>>>>> proxy to component_A, and let component_A directly send its proxy to
>>>>>>>> component_B.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> That is the alternative solution: the good news is that it works(*),
>>>>>>> the not-so-good news is that it make the mapping between components
>>>>>>> and charm objects a litte more complex.
>>>>>>>
>>>>>>> (*) I need to use a hack that seems weird to me:
>>>>>>>
>>>>>>> Chare component_A : SimpleInterface {
>>>>>>>  void set_s(CProxy_SetSimpleInterface&      pssi,
>>>>>>> CProxy_SimpleInterface&
>>>>>>>  psi) {
>>>>>>>     pssi.set_si(psi);
>>>>>>>  }
>>>>>>>
>>>>>>> works while the following version does not work (psi is changed to
>>>>>>> thishandle)
>>>>>>>
>>>>>>>
>>>>>>> Chare component_A : SimpleInterface {
>>>>>>>  void set_s(CProxy_SetSimpleInterface&      pssi,
>>>>>>> CProxy_SimpleInterface&
>>>>>>>  psi) {
>>>>>>>     pssi.set_si(thishandle);
>>>>>>>  }
>>>>>>>
>>>>>>> psi is obtained from a call to CProxy_component_A::ckNew.
>>>>>>>
>>>>>>> Question: how can I create a Proxy from within a chare?
>>>>>>>
>>>>>>> I tried
>>>>>>>
>>>>>>>     pssi.set_si(CProxy_SimpleInterface(thishandle));
>>>>>>>
>>>>>>> but it fails with the same error :
>>>>>>>
>>>>>>> [0] Assertion "n<len" failed in file cklists.h line 221.
>>>>>>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>>>>>>> Reason:
>>>>>>> [0] Stack Traceback:
>>>>>>>  [0:0] CmiAbort+0x89  [0x506ff2]
>>>>>>>  [0:1] __cmi_assert+0x4b  [0x514a0c]
>>>>>>>  [0:2] _ZN5CkVecIPvEixEm+0x36  [0x49c7a8]
>>>>>>>  [0:3] /opt/usr/stow/ULCMi/libexec/Charm-launcher [0x4966f9]
>>>>>>>  [0:4] _Z15_processHandlerPvP11CkCoreState+0x30d  [0x49712e]
>>>>>>>  [0:5] CmiHandleMessage+0x84  [0x50f5f1]
>>>>>>>  [0:6] CsdSchedulePoll+0x96  [0x50fb5f]
>>>>>>>  [0:7] CsdScheduler+0x23  [0x50f8bf]
>>>>>>>  [0:8] CthStandinCode+0xe  [0x50fc5a]
>>>>>>>  [0:9] CthStartThread+0x59  [0x4830b7]
>>>>>>>  [0:10] /lib/libc.so.6 [0x7f98d9aaf370]
>>>>>>> Fatal error on PE 0>
>>>>>>>
>>>>>>> Thanks for your help!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> It is hard to eyeball errors from segmented code.
>>>>>> What is in set_si?
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> I produce this bug with this piece of code, with only one process.
>>>>>
>>>>> ci file:
>>>>>    chare SingleProvider : SimpleInterface {
>>>>>        ...
>>>>>        entry [sync] void provider_set_s(CProxy_SetSimpleInterface&
>>>>>  pssi,
>>>>> CProxy_SimpleInterface&    psi, int n, char name[n]);
>>>>>        ...
>>>>>    }
>>>>>
>>>>> C file:
>>>>>
>>>>>    class SingleProvider : virtual public CBase_SingleProvider {
>>>>>        ...
>>>>>
>>>>>     void provider_set_s(CProxy_SetSimpleInterface&    pssi,
>>>>> CProxy_SimpleInterface&    psi, int n, char* name) {
>>>>>         pssi.ulcmi_set_si(CProxy_SimpleInterface(thishandle), n, name);
>>>>>  }
>>>>>
>>>>>
>>>>> It also fails if I use thishandle instead of
>>>>> CProxy_SimpleInterface(thishandle).
>>>>> It works fine if I use pssi (which as been with ckNew of
>>>>> CProxy_SingleProvider).
>>>>>
>>>>> | Can you build your program with -g, when it crashes again, get stack
>>>>> trace
>>>>> like:
>>>>>
>>>>> Here the output within gdb:
>>>>>
>>>>> [0] Assertion "n<len" failed in file cklists.h line 221.
>>>>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>>>>> Reason:
>>>>> [0] Stack Traceback:
>>>>>  [0:0] CmiAbort+0x89  [0x506f02]
>>>>>  [0:1] __cmi_assert+0x4b  [0x51491c]
>>>>>  [0:2] _ZN5CkVecIPvEixEm+0x36  [0x49c6b8]
>>>>>  [0:3] /opt/usr/stow/ULCMi/libexec/Charm-launcher [0x496609]
>>>>>  [0:4] _Z15_processHandlerPvP11CkCoreState+0x30d  [0x49703e]
>>>>>  [0:5] CmiHandleMessage+0x84  [0x50f501]
>>>>>  [0:6] CsdSchedulePoll+0x96  [0x50fa6f]
>>>>>  [0:7] CsdScheduler+0x23  [0x50f7cf]
>>>>>  [0:8] CthStandinCode+0xe  [0x50fb6a]
>>>>>  [0:9] CthStartThread+0x59  [0x482fc7]
>>>>>  [0:10] /lib/libc.so.6 [0x7ffff5729370]
>>>>> CHARM++ FATAL ERROR:
>>>>>
>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>> 0x0000000000506f31 in CmiAbort (message=0x53d785 "") at machine.c:612
>>>>> 612         *(int *)NULL = 0; /*Write to null, causing bus error*/
>>>>> (gdb) bt
>>>>> #0  0x0000000000506f31 in CmiAbort (message=0x53d785 "") at
>>>>> machine.c:612
>>>>> #1  0x000000000051491c in __cmi_assert (expr=0x534e87 "n<len",
>>>>>    file=0x534e7d "cklists.h", line=221) at convcore.c:3399
>>>>> #2  0x000000000049c6b8 in CkVec<void*>::operator[]
>>>>> (this=0x7ffff0067aa0,
>>>>>    n=140737226103192) at cklists.h:221
>>>>> #3  0x0000000000496609 in _processForPlainChareMsg (ck=0x7ffff006c640,
>>>>>    env=0x7ffff007e3e8) at ck.C:844
>>>>> #4  0x000000000049703e in _processHandler (converseMsg=0x7ffff007e3e8,
>>>>>    ck=0x7ffff006c640) at ck.C:1117
>>>>> #5  0x000000000050f501 in CmiHandleMessage (msg=0x7ffff007e3e8)
>>>>>    at convcore.c:1393
>>>>> #6  0x000000000050fa6f in CsdSchedulePoll () at convcore.c:1610
>>>>> #7  0x000000000050f7cf in CsdScheduler (maxmsgs=0) at convcore.c:1491
>>>>> #8  0x000000000050fb6a in CthStandinCode () at convcore.c:1674
>>>>> #9  0x0000000000482fc7 in CthStartThread (fn1=0, fn2=5307228, arg1=0,
>>>>> arg2=0)
>>>>>    at threads.c:1579
>>>>> #10 0x00007ffff5729370 in ?? () from /lib/libc.so.6
>>>>> #11 0x0000000000000000 in ?? ()
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> addr2line -e ./your_binay   0x4966f9
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Can you send me the output of that for lines:
>>>>>>
>>>>>>  [0:2] _ZN5CkVecIPvEixEm+0x36  [0x49c7a8]
>>>>>>  [0:3] /opt/usr/stow/ULCMi/libexec/Charm-launcher [0x4966f9]
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> addr2line -e /opt/usr/stow/ULCMi/libexec/Charm-launcher 0x49c6b8
>>>>>
>>>>> /home/cperez/Research/charm-6.2.0/net-linux-x86_64-smp/tmp/cklists.h:222
>>>>>
>>>>> addr2line -e /opt/usr/stow/ULCMi/libexec/Charm-launcher  0x496609
>>>>> /home/cperez/Research/charm-6.2.0/net-linux-x86_64-smp/tmp/ck.C:844
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> btw, did you ever run megatest (in charm/tests/charm++/megatest)
>>>>>> successfully? Just wanted to make sure charm threaded entry method
>>>>>> indeed works on your system.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> I run it up to "++p 16" and it works fine.
>>>>> My system is a linux box (Intel(R) Core(TM)2 Duo CPU) running
>>>>> debian/unstable (gcc (Debian 4.4.4-2) 4.4.4).
>>>>>
>>>>> Thank you
>>>>>
>>>>> Christian
>>>>>
>>>>>
>>>>>
>>>
>>>
>
>





Archive powered by MHonArc 2.6.16.

Top of Page