charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

[charm] load balancer question (freeze/crash)

From: Evghenii Gaburov <e-gaburov AT northwestern.edu>
To: Pritish Jetley <pjetley2 AT illinois.edu>
Cc: "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
Subject: [charm] load balancer question (freeze/crash)
Date: Tue, 4 Oct 2011 12:34:19 +0000
Accept-language: en-US
List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Hi All,

I have been using a load balancer functionality in charm for few days, and
while I have been able to find a workaround of several issues which I
described in previous posts, I have one left that I cannot solve or find a
way around: recurrent non-reproducible freezes, and sometimes even crashes
with LB.

These freezes occur from about few mins up to 3-4h from the beginning of the
run. Some further research showed that ResumeAtSync(), while called, never
returns back via contribute, and this fail to the resume the program further.
It is has recurrent non-reproducible behaviour, and affected all my runs I
tried so far.

I also tried MetisLB, and the program crashed instead of freezing with the
traceback to charm libraries. See the output & my simple LB code below.

Any help to sort out this problem will be highly appreciated!

Thanks,
Evghenii

[threaded] void Main::startSimulation()
{
const double t0 = CkWallTimer();
CkPrintf(" *** System::loadBalancer() call \n");
systemProxy.loadBalancer(CkCallbackResumeThread());
CkPrintf(" *** System::loadBalancer() done in %g sec \n", CkWallTimer() -
t0);
}

void System::loadBalancer(CkCallback &cb)

{

loadBalancer_completeCb = cb;

AtSync();
}

void System::ResumeFromSync()

{
contribute(loadBalancer_completeCb);

}

void System::pup(PUP::er &p)
{
do PUPs;
}

-------- output form LB

The first and last lines are CkPrintf form my program, which prints output
just before and just after my LB routine:
systemProxy.loadBalancer(CkCallbackResumeThread());
and in between is the LB output with +LBDebug 1 flag in the execution line.

8192 chares on 32 procs, no freeze here
>>
*** System::loadBalancer() call
[GreedyLB] Load balancing step 17 starting at 1449.034961 in PE0
Memory:2108.660156MB
[0] 7953 objects migrating.
GreedyLB: 7953 objects migrating.
Strategy took 0.010584 seconds.
[GreedyLB] memUsage: LBManager:953KB CentralLB:6655KB
[GreedyLB] Load balancing step 17 finished at 1456.170554 duration 7.135593
*** System::loadBalancer() done in 8.26459 sec

8192 chares on 32 procs, freezes here, notice lack of ***
System::loadBalancer() done in ... sec line at the end
>>
*** System::loadBalancer() call
[GreedyLB] Load balancing step 18 starting at 1522.402579 in PE0
Memory:2106.847656MB
[0] 7948 objects migrating.
GreedyLB: 7948 objects migrating.
Strategy took 0.012154 seconds.
[GreedyLB] memUsage: LBManager:953KB CentralLB:4754KB

8192 chares on 128 procs, freezes as well
>>
*** System::loadBalancer() call
[GreedyLB] Load balancing step 178 starting at 11123.465707 in PE0
Memory:1158.570312MB
[0] 8129 objects migrating.
GreedyLB: 8129 objects migrating.
Strategy took 0.012672 seconds.
[GreedyLB] memUsage: LBManager:928KB CentralLB:21402KB
[GreedyLB] Load balancing step 178 finished at 11124.862594 duration 1.396887

8192 chares on 32 proc, crash
>>
*** System::loadBalancer() call
[MetisLB] Load balancing step 3 starting at 411.261122 in PE0
Memory:2088.796875MB
[0] calling METIS_PartGraphRecursive.
[0] after calling Metis functions.
[0] MetisLB done!
MetisLB: 8192 objects migrating.
Strategy took 1.223386 seconds.
[MetisLB] memUsage: LBManager:1882KB CentralLB:19217KB
[MetisLB] Load balancing step 3 finished at 425.898042 duration 14.636920
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: No reduction client!
You must register a client with either SetReductionClient or during
contribute.

[0] Stack Traceback:
[0:0] CmiAbort+0x65 [0x6d7a19]
[0:1] _ZN14CkReductionMgr17endArrayReductionEv+0x422 [0x671c04]
[0:2] _ZN14CkReductionMgr21ArrayReductionHandlerEP14CkReductionMsg+0x64
[0x671e40]
[0:3]
_ZN22CkIndex_CkReductionMgr42_call_ArrayReductionHandler_CkReductionMsgEPvP14CkReductionMgr+0x1d
[0x671e65]
[0:4] CkDeliverMessageFree+0x43 [0x637e0f]
[0:5] ./fvmhd3d [0x637e9e]
[0:6] ./fvmhd3d [0x637f4f]
[0:7] ./fvmhd3d [0x638601]
[0:8] ./fvmhd3d [0x638a51]
[0:9] _Z15_processHandlerPvP11CkCoreState+0x113 [0x639986]
[0:10] CmiHandleMessage+0x7c [0x6d90f0]
[0:11] CsdScheduleForever+0x81 [0x6d938e]
[0:12] CsdScheduler+0x16 [0x6d92e5]
[0:13] ./fvmhd3d [0x6d7512]
[0:14] ConverseInit+0x458 [0x6d79ad]
[0:15] main+0x42 [0x640992]
[0:16] __libc_start_main+0xf4 [0x2ac22a6db974]
[0:17] __gxx_personality_v0+0x219 [0x4c7ec9]
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 23 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 30426 on
node qnode0312 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[qnode0312:30424] 31 more processes have sent help message help-mpi-api.txt /
mpi-abort
[qnode0312:30424] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all help / error messages

On Oct 2, 2011, at 2:33 PM, Pritish Jetley wrote:

> Hello Evghenii,
>
> CkCache is a software caching module for Charm++ applications. It is used
> to improve remote data reuse when several objects on a processor make
> requests to remote objects for the same data. In ChaNGa, it is used to
> share tree nodes and particles requested by TreePieces on a processor.
> Unfortunately, we do not have a manual entry for this yet, but the source
> file itself is well-documented.
>
> Look in src/libs/ck-libs/cache to find the CkCache.ci and CkCache.h files.
>
> Let me know if you need help with the code, or if you are trying to use it
> in your own application.
>
> Pritish
>
> On Sun, Oct 2, 2011 at 11:54 AM, Evghenii Gaburov
> <e-gaburov AT northwestern.edu>
> wrote:
> Hi All,
>
> Looking at ChaNGa source code, i came across something like CkCache.
> However, I am unable to find in charm++ manual or examples (outside ChaNGa)
> instruction on how to use it?
>
> Is there some short description available on how to use CkCache and what
> purpose does it serve.
>
> Thanks!
>
> Cheers,
> Evghenii
>
> --
> Evghenii Gaburov,
> e-gaburov AT northwestern.edu
>
>
>
>
>
>
>
> _______________________________________________
> charm mailing list
> charm AT cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/charm
> _______________________________________________
> ppl mailing list
> ppl AT cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/ppl
>
>
>
> --
> Pritish Jetley
> Doctoral Candidate, Computer Science
> University of Illinois at Urbana-Champaign

--
Evghenii Gaburov,
e-gaburov AT northwestern.edu

[charm] CkCache, Evghenii Gaburov, 10/02/2011
- Re: [charm] [ppl] CkCache, Pritish Jetley, 10/02/2011
  - [charm] load balancer question (freeze/crash), Evghenii Gaburov, 10/04/2011
    - Re: [charm] [ppl] load balancer question (freeze/crash), Kale, Laxmikant V, 10/04/2011
      - Re: [charm] [ppl] load balancer question (freeze/crash), Evghenii Gaburov, 10/04/2011
      - Re: [charm] [ppl] load balancer question (freeze/crash), Evghenii Gaburov, 10/04/2011
      - Re: [charm] [ppl] load balancer question (freeze/crash), Evghenii Gaburov, 10/05/2011
        
        Re: [charm] [ppl] load balancer question (freeze/crash), Gengbin Zheng, 10/05/2011
        
        Re: [charm] [ppl] load balancer question (freeze/crash), Evghenii Gaburov, 10/05/2011