Skip to Content.
Sympa Menu

charm - Re: [charm] [ppl] load balancer question (freeze/crash)

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [ppl] load balancer question (freeze/crash)


Chronological Thread 
  • From: Evghenii Gaburov <e-gaburov AT northwestern.edu>
  • To: "Kale, Laxmikant V" <kale AT illinois.edu>
  • Cc: "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>, "Jetley, Pritish" <pjetley2 AT illinois.edu>
  • Subject: Re: [charm] [ppl] load balancer question (freeze/crash)
  • Date: Tue, 4 Oct 2011 13:34:27 +0000
  • Accept-language: en-US
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Hi All,

I was able to write a quick small program that reproduce this behaviour and
crashes with RotateLB. It is shown below, and tge full source code is
downloadable from:
http://darkstar.astro.northwestern.edu/charm/lbtest-crash.tar.gz

just untar & make & run with RotateLB. It crashes with charm-6.2
mpi-linux-x86_64

$ charmrun +p4 ./lbtest +balancer RotateLB +LBDebug 1

it should crash. (I tried with 32, 256, 512, 2048 chares, all crashes).

With GreedyLB it works, but I wouldn't trust it since in my production code
it sometimes freezes.

Also, if I uncomment

#if 0 -> #if 1
void ckAboutToMigrate() {}
void ckJustMigrated() {}
#endif

the program freezes.


I hope you could help me to fix this bug or find a way around it, cause I now
rely on such functionality.

Thanks!

Cheers,
Evghenii

#include "lbtest.decl.h"

namespace fv
{
/*readonly*/ CProxy_Main mainProxy;
class Main : public CBase_Main
{
public:
CProxy_LB_Test arrayProxy;

Main(CkArgMsg* m) {
CkAssert(CkMyPe() == 0);
mainProxy = thisProxy;
arrayProxy = CProxy_LB_Test::ckNew(2048);
mainProxy.doSimulation();
}

void doSimulation()
{
{
const double t0 = CkWallTimer();
CkPrintf(" starting LB \n");
arrayProxy.lb(CkCallbackResumeThread());
CkPrintf("LB done in %g sec \n" , CkWallTimer() - t0);
}
CkExit();
}
};

class LB_Test : public CBase_LB_Test
{
public:
CkCallback MainCB;
LB_Test()
{
usesAtSync = CmiTrue;
}

LB_Test(CkMigrateMessage *m)
{
}

void lb(CkCallback &cb)
{
MainCB = cb;
AtSync();
}
void pup(PUP::er &p)
{
CBase_LB_Test::pup(p);
}
void ResumeFromSync()
{
contribute(MainCB);
}

#if 0 // if set #if 1 freezes
void ckAboutToMigrate() {}
void ckJustMigrated() {}
#endif
};

}

#include "lbtest.def.h"


On Oct 4, 2011, at 8:08 AM, Kale, Laxmikant V wrote:

> Try rotatelb, which migrates every object (without consideration of load).
> This may help shake out bugs in migration code.
>
> Sanjay
>
> --
> Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu
> <http://charm.cs.uiuc.edu/>
> Professor, Computer Science
> kale AT illinois.edu
> 201 N. Goodwin Avenue Ph: (217) 244-0094
> Urbana, IL 61801-2302 FAX: (217) 265-6582
>
>
>
>
>
>
> On 10/4/11 7:34 AM, "Evghenii Gaburov"
> <e-gaburov AT northwestern.edu>
> wrote:
>
>> Hi All,
>>
>> I have been using a load balancer functionality in charm for few days,
>> and while I have been able to find a workaround of several issues which I
>> described in previous posts, I have one left that I cannot solve or find
>> a way around: recurrent non-reproducible freezes, and sometimes even
>> crashes with LB.
>>
>> These freezes occur from about few mins up to 3-4h from the beginning of
>> the run. Some further research showed that ResumeAtSync(), while called,
>> never returns back via contribute, and this fail to the resume the
>> program further. It is has recurrent non-reproducible behaviour, and
>> affected all my runs I tried so far.
>>
>> I also tried MetisLB, and the program crashed instead of freezing with
>> the traceback to charm libraries. See the output & my simple LB code
>> below.
>>
>> Any help to sort out this problem will be highly appreciated!
>>
>> Thanks,
>> Evghenii
>>
>>
>> [threaded] void Main::startSimulation()
>> {
>> const double t0 = CkWallTimer();
>> CkPrintf(" *** System::loadBalancer() call \n");
>> systemProxy.loadBalancer(CkCallbackResumeThread());
>> CkPrintf(" *** System::loadBalancer() done in %g sec \n", CkWallTimer()
>> - t0);
>> }
>>
>>
>> void System::loadBalancer(CkCallback &cb)
>>
>> {
>>
>> loadBalancer_completeCb = cb;
>>
>> AtSync();
>> }
>>
>> void System::ResumeFromSync()
>>
>> {
>> contribute(loadBalancer_completeCb);
>>
>> }
>>
>>
>> void System::pup(PUP::er &p)
>> {
>> do PUPs;
>> }
>>
>> -------- output form LB
>>
>> The first and last lines are CkPrintf form my program, which prints
>> output just before and just after my LB routine:
>> systemProxy.loadBalancer(CkCallbackResumeThread());
>> and in between is the LB output with +LBDebug 1 flag in the execution
>> line.
>>
>> 8192 chares on 32 procs, no freeze here
>>>>
>> *** System::loadBalancer() call
>> [GreedyLB] Load balancing step 17 starting at 1449.034961 in PE0
>> Memory:2108.660156MB
>> [0] 7953 objects migrating.
>> GreedyLB: 7953 objects migrating.
>> Strategy took 0.010584 seconds.
>> [GreedyLB] memUsage: LBManager:953KB CentralLB:6655KB
>> [GreedyLB] Load balancing step 17 finished at 1456.170554 duration
>> 7.135593
>> *** System::loadBalancer() done in 8.26459 sec
>>
>> 8192 chares on 32 procs, freezes here, notice lack of ***
>> System::loadBalancer() done in ... sec line at the end
>>>>
>> *** System::loadBalancer() call
>> [GreedyLB] Load balancing step 18 starting at 1522.402579 in PE0
>> Memory:2106.847656MB
>> [0] 7948 objects migrating.
>> GreedyLB: 7948 objects migrating.
>> Strategy took 0.012154 seconds.
>> [GreedyLB] memUsage: LBManager:953KB CentralLB:4754KB
>>
>> 8192 chares on 128 procs, freezes as well
>>>>
>> *** System::loadBalancer() call
>> [GreedyLB] Load balancing step 178 starting at 11123.465707 in PE0
>> Memory:1158.570312MB
>> [0] 8129 objects migrating.
>> GreedyLB: 8129 objects migrating.
>> Strategy took 0.012672 seconds.
>> [GreedyLB] memUsage: LBManager:928KB CentralLB:21402KB
>> [GreedyLB] Load balancing step 178 finished at 11124.862594 duration
>> 1.396887
>>
>>
>> 8192 chares on 32 proc, crash
>>>>
>> *** System::loadBalancer() call
>> [MetisLB] Load balancing step 3 starting at 411.261122 in PE0
>> Memory:2088.796875MB
>> [0] calling METIS_PartGraphRecursive.
>> [0] after calling Metis functions.
>> [0] MetisLB done!
>> MetisLB: 8192 objects migrating.
>> Strategy took 1.223386 seconds.
>> [MetisLB] memUsage: LBManager:1882KB CentralLB:19217KB
>> [MetisLB] Load balancing step 3 finished at 425.898042 duration 14.636920
>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> Reason: No reduction client!
>> You must register a client with either SetReductionClient or during
>> contribute.
>>
>> [0] Stack Traceback:
>> [0:0] CmiAbort+0x65 [0x6d7a19]
>> [0:1] _ZN14CkReductionMgr17endArrayReductionEv+0x422 [0x671c04]
>> [0:2] _ZN14CkReductionMgr21ArrayReductionHandlerEP14CkReductionMsg+0x64
>> [0x671e40]
>> [0:3]
>> _ZN22CkIndex_CkReductionMgr42_call_ArrayReductionHandler_CkReductionMsgEPv
>> P14CkReductionMgr+0x1d [0x671e65]
>> [0:4] CkDeliverMessageFree+0x43 [0x637e0f]
>> [0:5] ./fvmhd3d [0x637e9e]
>> [0:6] ./fvmhd3d [0x637f4f]
>> [0:7] ./fvmhd3d [0x638601]
>> [0:8] ./fvmhd3d [0x638a51]
>> [0:9] _Z15_processHandlerPvP11CkCoreState+0x113 [0x639986]
>> [0:10] CmiHandleMessage+0x7c [0x6d90f0]
>> [0:11] CsdScheduleForever+0x81 [0x6d938e]
>> [0:12] CsdScheduler+0x16 [0x6d92e5]
>> [0:13] ./fvmhd3d [0x6d7512]
>> [0:14] ConverseInit+0x458 [0x6d79ad]
>> [0:15] main+0x42 [0x640992]
>> [0:16] __libc_start_main+0xf4 [0x2ac22a6db974]
>> [0:17] __gxx_personality_v0+0x219 [0x4c7ec9]
>> --------------------------------------------------------------------------
>> MPI_ABORT was invoked on rank 23 in communicator MPI_COMM_WORLD
>> with errorcode 1.
>>
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on
>> exactly when Open MPI kills them.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun has exited due to process rank 0 with PID 30426 on
>> node qnode0312 exiting without calling "finalize". This may
>> have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here).
>> --------------------------------------------------------------------------
>> [qnode0312:30424] 31 more processes have sent help message
>> help-mpi-api.txt / mpi-abort
>> [qnode0312:30424] Set MCA parameter "orte_base_help_aggregate" to 0 to
>> see all help / error messages
>>
>>
>>
>> On Oct 2, 2011, at 2:33 PM, Pritish Jetley wrote:
>>
>>> Hello Evghenii,
>>>
>>> CkCache is a software caching module for Charm++ applications. It is
>>> used to improve remote data reuse when several objects on a processor
>>> make requests to remote objects for the same data. In ChaNGa, it is used
>>> to share tree nodes and particles requested by TreePieces on a
>>> processor. Unfortunately, we do not have a manual entry for this yet,
>>> but the source file itself is well-documented.
>>>
>>> Look in src/libs/ck-libs/cache to find the CkCache.ci and CkCache.h
>>> files.
>>>
>>> Let me know if you need help with the code, or if you are trying to use
>>> it in your own application.
>>>
>>> Pritish
>>>
>>> On Sun, Oct 2, 2011 at 11:54 AM, Evghenii Gaburov
>>> <e-gaburov AT northwestern.edu>
>>> wrote:
>>> Hi All,
>>>
>>> Looking at ChaNGa source code, i came across something like CkCache.
>>> However, I am unable to find in charm++ manual or examples (outside
>>> ChaNGa) instruction on how to use it?
>>>
>>> Is there some short description available on how to use CkCache and
>>> what purpose does it serve.
>>>
>>> Thanks!
>>>
>>> Cheers,
>>> Evghenii
>>>
>>> --
>>> Evghenii Gaburov,
>>> e-gaburov AT northwestern.edu
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> charm mailing list
>>> charm AT cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/charm
>>> _______________________________________________
>>> ppl mailing list
>>> ppl AT cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/ppl
>>>
>>>
>>>
>>> --
>>> Pritish Jetley
>>> Doctoral Candidate, Computer Science
>>> University of Illinois at Urbana-Champaign
>>
>> --
>> Evghenii Gaburov,
>> e-gaburov AT northwestern.edu
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> charm mailing list
>> charm AT cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/charm
>> _______________________________________________
>> ppl mailing list
>> ppl AT cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/ppl
>

--
Evghenii Gaburov,
e-gaburov AT northwestern.edu











Archive powered by MHonArc 2.6.16.

Top of Page