Skip to Content.
Sympa Menu

charm - Re: [charm] [ppl] load balancer question (freeze/crash)

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [ppl] load balancer question (freeze/crash)


Chronological Thread 
  • From: "Kale, Laxmikant V" <kale AT illinois.edu>
  • To: Evghenii Gaburov <e-gaburov AT northwestern.edu>, "Jetley, Pritish" <pjetley2 AT illinois.edu>
  • Cc: "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
  • Subject: Re: [charm] [ppl] load balancer question (freeze/crash)
  • Date: Tue, 4 Oct 2011 13:08:36 +0000
  • Accept-language: en-US
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Try rotatelb, which migrates every object (without consideration of load).
This may help shake out bugs in migration code.

Sanjay

--
Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu
<http://charm.cs.uiuc.edu/>
Professor, Computer Science
kale AT illinois.edu
201 N. Goodwin Avenue Ph: (217) 244-0094
Urbana, IL 61801-2302 FAX: (217) 265-6582






On 10/4/11 7:34 AM, "Evghenii Gaburov"
<e-gaburov AT northwestern.edu>
wrote:

>Hi All,
>
>I have been using a load balancer functionality in charm for few days,
>and while I have been able to find a workaround of several issues which I
>described in previous posts, I have one left that I cannot solve or find
>a way around: recurrent non-reproducible freezes, and sometimes even
>crashes with LB.
>
>These freezes occur from about few mins up to 3-4h from the beginning of
>the run. Some further research showed that ResumeAtSync(), while called,
>never returns back via contribute, and this fail to the resume the
>program further. It is has recurrent non-reproducible behaviour, and
>affected all my runs I tried so far.
>
>I also tried MetisLB, and the program crashed instead of freezing with
>the traceback to charm libraries. See the output & my simple LB code
>below.
>
>Any help to sort out this problem will be highly appreciated!
>
>Thanks,
> Evghenii
>
>
>[threaded] void Main::startSimulation()
>{
> const double t0 = CkWallTimer();
> CkPrintf(" *** System::loadBalancer() call \n");
> systemProxy.loadBalancer(CkCallbackResumeThread());
> CkPrintf(" *** System::loadBalancer() done in %g sec \n", CkWallTimer()
>- t0);
>}
>
>
>void System::loadBalancer(CkCallback &cb)
>
>{
>
> loadBalancer_completeCb = cb;
>
> AtSync();
>}
>
>void System::ResumeFromSync()
>
>{
> contribute(loadBalancer_completeCb);
>
>}
>
>
>void System::pup(PUP::er &p)
>{
> do PUPs;
>}
>
>-------- output form LB
>
>The first and last lines are CkPrintf form my program, which prints
>output just before and just after my LB routine:
> systemProxy.loadBalancer(CkCallbackResumeThread());
>and in between is the LB output with +LBDebug 1 flag in the execution
>line.
>
>8192 chares on 32 procs, no freeze here
>>>
> *** System::loadBalancer() call
>[GreedyLB] Load balancing step 17 starting at 1449.034961 in PE0
>Memory:2108.660156MB
>[0] 7953 objects migrating.
>GreedyLB: 7953 objects migrating.
>Strategy took 0.010584 seconds.
>[GreedyLB] memUsage: LBManager:953KB CentralLB:6655KB
>[GreedyLB] Load balancing step 17 finished at 1456.170554 duration
>7.135593
> *** System::loadBalancer() done in 8.26459 sec
>
>8192 chares on 32 procs, freezes here, notice lack of ***
>System::loadBalancer() done in ... sec line at the end
>>>
> *** System::loadBalancer() call
>[GreedyLB] Load balancing step 18 starting at 1522.402579 in PE0
>Memory:2106.847656MB
>[0] 7948 objects migrating.
>GreedyLB: 7948 objects migrating.
>Strategy took 0.012154 seconds.
>[GreedyLB] memUsage: LBManager:953KB CentralLB:4754KB
>
>8192 chares on 128 procs, freezes as well
>>>
> *** System::loadBalancer() call
>[GreedyLB] Load balancing step 178 starting at 11123.465707 in PE0
>Memory:1158.570312MB
>[0] 8129 objects migrating.
>GreedyLB: 8129 objects migrating.
>Strategy took 0.012672 seconds.
>[GreedyLB] memUsage: LBManager:928KB CentralLB:21402KB
>[GreedyLB] Load balancing step 178 finished at 11124.862594 duration
>1.396887
>
>
>8192 chares on 32 proc, crash
>>>
> *** System::loadBalancer() call
>[MetisLB] Load balancing step 3 starting at 411.261122 in PE0
>Memory:2088.796875MB
>[0] calling METIS_PartGraphRecursive.
>[0] after calling Metis functions.
>[0] MetisLB done!
>MetisLB: 8192 objects migrating.
>Strategy took 1.223386 seconds.
>[MetisLB] memUsage: LBManager:1882KB CentralLB:19217KB
>[MetisLB] Load balancing step 3 finished at 425.898042 duration 14.636920
>------------- Processor 0 Exiting: Called CmiAbort ------------
>Reason: No reduction client!
>You must register a client with either SetReductionClient or during
>contribute.
>
>[0] Stack Traceback:
> [0:0] CmiAbort+0x65 [0x6d7a19]
> [0:1] _ZN14CkReductionMgr17endArrayReductionEv+0x422 [0x671c04]
> [0:2] _ZN14CkReductionMgr21ArrayReductionHandlerEP14CkReductionMsg+0x64
> [0x671e40]
> [0:3]
>_ZN22CkIndex_CkReductionMgr42_call_ArrayReductionHandler_CkReductionMsgEPv
>P14CkReductionMgr+0x1d [0x671e65]
> [0:4] CkDeliverMessageFree+0x43 [0x637e0f]
> [0:5] ./fvmhd3d [0x637e9e]
> [0:6] ./fvmhd3d [0x637f4f]
> [0:7] ./fvmhd3d [0x638601]
> [0:8] ./fvmhd3d [0x638a51]
> [0:9] _Z15_processHandlerPvP11CkCoreState+0x113 [0x639986]
> [0:10] CmiHandleMessage+0x7c [0x6d90f0]
> [0:11] CsdScheduleForever+0x81 [0x6d938e]
> [0:12] CsdScheduler+0x16 [0x6d92e5]
> [0:13] ./fvmhd3d [0x6d7512]
> [0:14] ConverseInit+0x458 [0x6d79ad]
> [0:15] main+0x42 [0x640992]
> [0:16] __libc_start_main+0xf4 [0x2ac22a6db974]
> [0:17] __gxx_personality_v0+0x219 [0x4c7ec9]
>--------------------------------------------------------------------------
>MPI_ABORT was invoked on rank 23 in communicator MPI_COMM_WORLD
>with errorcode 1.
>
>NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>You may or may not see output from other processes, depending on
>exactly when Open MPI kills them.
>--------------------------------------------------------------------------
>--------------------------------------------------------------------------
>mpirun has exited due to process rank 0 with PID 30426 on
>node qnode0312 exiting without calling "finalize". This may
>have caused other processes in the application to be
>terminated by signals sent by mpirun (as reported here).
>--------------------------------------------------------------------------
>[qnode0312:30424] 31 more processes have sent help message
>help-mpi-api.txt / mpi-abort
>[qnode0312:30424] Set MCA parameter "orte_base_help_aggregate" to 0 to
>see all help / error messages
>
>
>
>On Oct 2, 2011, at 2:33 PM, Pritish Jetley wrote:
>
>> Hello Evghenii,
>>
>> CkCache is a software caching module for Charm++ applications. It is
>>used to improve remote data reuse when several objects on a processor
>>make requests to remote objects for the same data. In ChaNGa, it is used
>>to share tree nodes and particles requested by TreePieces on a
>>processor. Unfortunately, we do not have a manual entry for this yet,
>>but the source file itself is well-documented.
>>
>> Look in src/libs/ck-libs/cache to find the CkCache.ci and CkCache.h
>>files.
>>
>> Let me know if you need help with the code, or if you are trying to use
>>it in your own application.
>>
>> Pritish
>>
>> On Sun, Oct 2, 2011 at 11:54 AM, Evghenii Gaburov
>><e-gaburov AT northwestern.edu>
>> wrote:
>> Hi All,
>>
>> Looking at ChaNGa source code, i came across something like CkCache.
>>However, I am unable to find in charm++ manual or examples (outside
>>ChaNGa) instruction on how to use it?
>>
>> Is there some short description available on how to use CkCache and
>>what purpose does it serve.
>>
>> Thanks!
>>
>> Cheers,
>> Evghenii
>>
>> --
>> Evghenii Gaburov,
>> e-gaburov AT northwestern.edu
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> charm mailing list
>> charm AT cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/charm
>> _______________________________________________
>> ppl mailing list
>> ppl AT cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/ppl
>>
>>
>>
>> --
>> Pritish Jetley
>> Doctoral Candidate, Computer Science
>> University of Illinois at Urbana-Champaign
>
>--
>Evghenii Gaburov,
>e-gaburov AT northwestern.edu
>
>
>
>
>
>
>
>_______________________________________________
>charm mailing list
>charm AT cs.uiuc.edu
>http://lists.cs.uiuc.edu/mailman/listinfo/charm
>_______________________________________________
>ppl mailing list
>ppl AT cs.uiuc.edu
>http://lists.cs.uiuc.edu/mailman/listinfo/ppl






Archive powered by MHonArc 2.6.16.

Top of Page