Skip to Content.
Sympa Menu

charm - Re: [charm] [ppl] load balancer question (freeze/crash)

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [ppl] load balancer question (freeze/crash)


Chronological Thread 
  • From: Evghenii Gaburov <e-gaburov AT northwestern.edu>
  • To: Gengbin Zheng <zhenggb AT gmail.com>, Pritish Jetley <pjetley2 AT illinois.edu>
  • Cc: "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>, "Kale, Laxmikant V" <kale AT illinois.edu>
  • Subject: Re: [charm] [ppl] load balancer question (freeze/crash)
  • Date: Wed, 5 Oct 2011 15:10:45 +0000
  • Accept-language: en-US
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Hi,

> It looks like your program hangs during object doing migration. What
> is your relevant code in ResumeFromSync()? and how is your pup()
> function?
Attached is the loadBalancer.cpp that does load balancing.
void System::loadBalancer(CkCallback&) is called form the Main chare with
CkCalllbackResumeThread() as an argument to this method.


> You may have to send us your code for us to take a look.
I would like to give you my code and help to run it, cause I really
want to solve this problem,
which may well be caused by my lack of understanding of some charm feature.
I will also provide instructions on how to compile it.

Please let me know what is the convenient way for me to give the code
(e.g. tarball as attachment for example, it is about ~10k lines in total).

Cheers,
Evghenii


Attachment: loadBalancer.cpp
Description: loadBalancer.cpp




>
> Gengbin
>
> On Wed, Oct 5, 2011 at 7:44 AM, Evghenii Gaburov
> <e-gaburov AT northwestern.edu>
> wrote:
>>> Try rotatelb, which migrates every object (without consideration of load).
>>> This may help shake out bugs in migration code.
>> After running with RotateLB, I got a freeze this time inside the load
>> balancer itself
>> and after much longer running time:
>>
>> Typical log output was:
>>
>> *** System::loadBalancer() call
>> [RotateLB] Load balancing step 180 starting at 26848.548410 in PE0
>> Memory:1505.542969MB
>> RotateLB: 1024 objects migrating.
>> Strategy took 0.000474 seconds.
>> [RotateLB] memUsage: LBManager:924KB CentralLB:989KB
>> [RotateLB] Load balancing step 180 finished at 26852.223625 duration
>> 3.675215
>> *** System::loadBalancer() done in 8.312 sec
>>
>> and the freeze happened at this point:
>>
>> *** System::loadBalancer() call
>> [RotateLB] Load balancing step 181 starting at 26998.477718 in PE0
>> Memory:1509.742188MB
>> RotateLB: 1024 objects migrating.
>> Strategy took 0.000470 seconds.
>> [RotateLB] memUsage: LBManager:924KB CentralLB:819KB
>>
>> so the line
>>
>> [RotateLB] Load balancing step 181 finished at xxxxx
>>
>> never showed up.
>>
>> I had similar freezes with GreedyCommLB as well, but usually with it
>> freeze happened after
>> [GreedyCommLB] Load balancing step xxx finished at xxx , where I suspect
>> not all chares may i
>> call ResumeFromSync(), or contribute to the Mainchare within
>> ResumeFromSync()
>> somehow got failed (even though I PUP the callback as well). And it
>> happens sporadically...
>>
>> This is not very comforting behaviour, and I will be happy if you could
>> help me to fix it.
>> THanks!
>>
>> Cheers,
>> Evghenii
>>
>>
>>
>>
>>
>>>
>>> Sanjay
>>>
>>> --
>>> Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu
>>> <http://charm.cs.uiuc.edu/>
>>> Professor, Computer Science
>>> kale AT illinois.edu
>>> 201 N. Goodwin Avenue Ph: (217) 244-0094
>>> Urbana, IL 61801-2302 FAX: (217) 265-6582
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 10/4/11 7:34 AM, "Evghenii Gaburov"
>>> <e-gaburov AT northwestern.edu>
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I have been using a load balancer functionality in charm for few days,
>>>> and while I have been able to find a workaround of several issues which I
>>>> described in previous posts, I have one left that I cannot solve or find
>>>> a way around: recurrent non-reproducible freezes, and sometimes even
>>>> crashes with LB.
>>>>
>>>> These freezes occur from about few mins up to 3-4h from the beginning of
>>>> the run. Some further research showed that ResumeAtSync(), while called,
>>>> never returns back via contribute, and this fail to the resume the
>>>> program further. It is has recurrent non-reproducible behaviour, and
>>>> affected all my runs I tried so far.
>>>>
>>>> I also tried MetisLB, and the program crashed instead of freezing with
>>>> the traceback to charm libraries. See the output & my simple LB code
>>>> below.
>>>>
>>>> Any help to sort out this problem will be highly appreciated!
>>>>
>>>> Thanks,
>>>> Evghenii
>>>>
>>>>
>>>> [threaded] void Main::startSimulation()
>>>> {
>>>> const double t0 = CkWallTimer();
>>>> CkPrintf(" *** System::loadBalancer() call \n");
>>>> systemProxy.loadBalancer(CkCallbackResumeThread());
>>>> CkPrintf(" *** System::loadBalancer() done in %g sec \n", CkWallTimer()
>>>> - t0);
>>>> }
>>>>
>>>>
>>>> void System::loadBalancer(CkCallback &cb)
>>>>
>>>> {
>>>>
>>>> loadBalancer_completeCb = cb;
>>>>
>>>> AtSync();
>>>> }
>>>>
>>>> void System::ResumeFromSync()
>>>>
>>>> {
>>>> contribute(loadBalancer_completeCb);
>>>>
>>>> }
>>>>
>>>>
>>>> void System::pup(PUP::er &p)
>>>> {
>>>> do PUPs;
>>>> }
>>>>
>>>> -------- output form LB
>>>>
>>>> The first and last lines are CkPrintf form my program, which prints
>>>> output just before and just after my LB routine:
>>>> systemProxy.loadBalancer(CkCallbackResumeThread());
>>>> and in between is the LB output with +LBDebug 1 flag in the execution
>>>> line.
>>>>
>>>> 8192 chares on 32 procs, no freeze here
>>>>>>
>>>> *** System::loadBalancer() call
>>>> [GreedyLB] Load balancing step 17 starting at 1449.034961 in PE0
>>>> Memory:2108.660156MB
>>>> [0] 7953 objects migrating.
>>>> GreedyLB: 7953 objects migrating.
>>>> Strategy took 0.010584 seconds.
>>>> [GreedyLB] memUsage: LBManager:953KB CentralLB:6655KB
>>>> [GreedyLB] Load balancing step 17 finished at 1456.170554 duration
>>>> 7.135593
>>>> *** System::loadBalancer() done in 8.26459 sec
>>>>
>>>> 8192 chares on 32 procs, freezes here, notice lack of ***
>>>> System::loadBalancer() done in ... sec line at the end
>>>>>>
>>>> *** System::loadBalancer() call
>>>> [GreedyLB] Load balancing step 18 starting at 1522.402579 in PE0
>>>> Memory:2106.847656MB
>>>> [0] 7948 objects migrating.
>>>> GreedyLB: 7948 objects migrating.
>>>> Strategy took 0.012154 seconds.
>>>> [GreedyLB] memUsage: LBManager:953KB CentralLB:4754KB
>>>>
>>>> 8192 chares on 128 procs, freezes as well
>>>>>>
>>>> *** System::loadBalancer() call
>>>> [GreedyLB] Load balancing step 178 starting at 11123.465707 in PE0
>>>> Memory:1158.570312MB
>>>> [0] 8129 objects migrating.
>>>> GreedyLB: 8129 objects migrating.
>>>> Strategy took 0.012672 seconds.
>>>> [GreedyLB] memUsage: LBManager:928KB CentralLB:21402KB
>>>> [GreedyLB] Load balancing step 178 finished at 11124.862594 duration
>>>> 1.396887
>>>>
>>>>
>>>> 8192 chares on 32 proc, crash
>>>>>>
>>>> *** System::loadBalancer() call
>>>> [MetisLB] Load balancing step 3 starting at 411.261122 in PE0
>>>> Memory:2088.796875MB
>>>> [0] calling METIS_PartGraphRecursive.
>>>> [0] after calling Metis functions.
>>>> [0] MetisLB done!
>>>> MetisLB: 8192 objects migrating.
>>>> Strategy took 1.223386 seconds.
>>>> [MetisLB] memUsage: LBManager:1882KB CentralLB:19217KB
>>>> [MetisLB] Load balancing step 3 finished at 425.898042 duration 14.636920
>>>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>>>> Reason: No reduction client!
>>>> You must register a client with either SetReductionClient or during
>>>> contribute.
>>>>
>>>> [0] Stack Traceback:
>>>> [0:0] CmiAbort+0x65 [0x6d7a19]
>>>> [0:1] _ZN14CkReductionMgr17endArrayReductionEv+0x422 [0x671c04]
>>>> [0:2] _ZN14CkReductionMgr21ArrayReductionHandlerEP14CkReductionMsg+0x64
>>>> [0x671e40]
>>>> [0:3]
>>>> _ZN22CkIndex_CkReductionMgr42_call_ArrayReductionHandler_CkReductionMsgEPv
>>>> P14CkReductionMgr+0x1d [0x671e65]
>>>> [0:4] CkDeliverMessageFree+0x43 [0x637e0f]
>>>> [0:5] ./fvmhd3d [0x637e9e]
>>>> [0:6] ./fvmhd3d [0x637f4f]
>>>> [0:7] ./fvmhd3d [0x638601]
>>>> [0:8] ./fvmhd3d [0x638a51]
>>>> [0:9] _Z15_processHandlerPvP11CkCoreState+0x113 [0x639986]
>>>> [0:10] CmiHandleMessage+0x7c [0x6d90f0]
>>>> [0:11] CsdScheduleForever+0x81 [0x6d938e]
>>>> [0:12] CsdScheduler+0x16 [0x6d92e5]
>>>> [0:13] ./fvmhd3d [0x6d7512]
>>>> [0:14] ConverseInit+0x458 [0x6d79ad]
>>>> [0:15] main+0x42 [0x640992]
>>>> [0:16] __libc_start_main+0xf4 [0x2ac22a6db974]
>>>> [0:17] __gxx_personality_v0+0x219 [0x4c7ec9]
>>>> --------------------------------------------------------------------------
>>>> MPI_ABORT was invoked on rank 23 in communicator MPI_COMM_WORLD
>>>> with errorcode 1.
>>>>
>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>>> You may or may not see output from other processes, depending on
>>>> exactly when Open MPI kills them.
>>>> --------------------------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> mpirun has exited due to process rank 0 with PID 30426 on
>>>> node qnode0312 exiting without calling "finalize". This may
>>>> have caused other processes in the application to be
>>>> terminated by signals sent by mpirun (as reported here).
>>>> --------------------------------------------------------------------------
>>>> [qnode0312:30424] 31 more processes have sent help message
>>>> help-mpi-api.txt / mpi-abort
>>>> [qnode0312:30424] Set MCA parameter "orte_base_help_aggregate" to 0 to
>>>> see all help / error messages
>>>>
>>>>
>>>>
>>>> On Oct 2, 2011, at 2:33 PM, Pritish Jetley wrote:
>>>>
>>>>> Hello Evghenii,
>>>>>
>>>>> CkCache is a software caching module for Charm++ applications. It is
>>>>> used to improve remote data reuse when several objects on a processor
>>>>> make requests to remote objects for the same data. In ChaNGa, it is used
>>>>> to share tree nodes and particles requested by TreePieces on a
>>>>> processor. Unfortunately, we do not have a manual entry for this yet,
>>>>> but the source file itself is well-documented.
>>>>>
>>>>> Look in src/libs/ck-libs/cache to find the CkCache.ci and CkCache.h
>>>>> files.
>>>>>
>>>>> Let me know if you need help with the code, or if you are trying to use
>>>>> it in your own application.
>>>>>
>>>>> Pritish
>>>>>
>>>>> On Sun, Oct 2, 2011 at 11:54 AM, Evghenii Gaburov
>>>>> <e-gaburov AT northwestern.edu>
>>>>> wrote:
>>>>> Hi All,
>>>>>
>>>>> Looking at ChaNGa source code, i came across something like CkCache.
>>>>> However, I am unable to find in charm++ manual or examples (outside
>>>>> ChaNGa) instruction on how to use it?
>>>>>
>>>>> Is there some short description available on how to use CkCache and
>>>>> what purpose does it serve.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Cheers,
>>>>> Evghenii
>>>>>
>>>>> --
>>>>> Evghenii Gaburov,
>>>>> e-gaburov AT northwestern.edu
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> charm mailing list
>>>>> charm AT cs.uiuc.edu
>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/charm
>>>>> _______________________________________________
>>>>> ppl mailing list
>>>>> ppl AT cs.uiuc.edu
>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/ppl
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Pritish Jetley
>>>>> Doctoral Candidate, Computer Science
>>>>> University of Illinois at Urbana-Champaign
>>>>
>>>> --
>>>> Evghenii Gaburov,
>>>> e-gaburov AT northwestern.edu
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> charm mailing list
>>>> charm AT cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/charm
>>>> _______________________________________________
>>>> ppl mailing list
>>>> ppl AT cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/ppl
>>>
>>
>> --
>> Evghenii Gaburov,
>> e-gaburov AT northwestern.edu
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> charm mailing list
>> charm AT cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/charm
>> _______________________________________________
>> ppl mailing list
>> ppl AT cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/ppl
>>

--
Evghenii Gaburov,
e-gaburov AT northwestern.edu









Archive powered by MHonArc 2.6.16.

Top of Page