Skip to Content.
Sympa Menu

charm - [charm] Questions about distributed load balancing tests

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

[charm] Questions about distributed load balancing tests


Chronological Thread 
  • From: Vinicius Freitas <vinicius.mct.freitas AT gmail.com>
  • To: charm AT cs.uiuc.edu
  • Cc: Laércio Lima Pilla <laercio.pilla AT ufsc.br>
  • Subject: [charm] Questions about distributed load balancing tests
  • Date: Wed, 23 Nov 2016 17:47:21 -0200

Hello, Charm++ team,

I have been having issues with the execution of distributed load balancing solutions in Charm++. 
My two main questions are about the total execution time of the lb_test benchmark using these distributed strategies, which seems to be varying way too much executing the same benchmark executing the same strategy every time; and about the migrations, which I tried to expose inside of the load balancing strategy, but failed to achieve such point in the execution.

This is my setup:

Computational Nodes: 8 Octa-core nodes

Benchmark: lb_test
            10,000 Elements
            150 iterations
            10 time/print
            30 load balancing interval
            30 ms min task time
            1,000 ms max task time

Charm++ 6.7 compiled for netlrts-linux-x86_64 --with_production
The Operating System is Debian Jessie, with GCC 4.9.4


This is a sample of the analysed time that the strategies I'm testing took to execute:

DistributedLB (Available w/ Charm++)>       Strategy took 0.002938s memory usage: 3.102875 MB.
DistributedLB (Available w/ Charm++)>       Strategy took 0.003182s memory usage: 3.103256 MB.
DistributedLB (Available w/ Charm++)>       Strategy took 0.003182s memory usage: 3.103592 MB.
DistributedLB (Available w/ Charm++)>       Strategy took 0.003357s memory usage: 3.106171 MB.

Sample Strategy 1>                          Strategy took 0.002934s memory usage: 3.101395 MB.
Sample Strategy 1>                          Strategy took 0.003095s memory usage: 3.105225 MB.
Sample Strategy 1>                          Strategy took 0.003319s memory usage: 3.105164 MB.
Sample Strategy 1>                          Strategy took 0.003221s memory usage: 3.107315 MB.

Sample Strategy 2>                          Strategy took 0.002996s memory usage: 3.102005 MB.
Sample Strategy 2>                          Strategy took 0.003107s memory usage: 3.103088 MB.
Sample Strategy 2>                          Strategy took 0.003184s memory usage: 3.105621 MB.
Sample Strategy 2>                          Strategy took 0.003255s memory usage: 3.107681 MB.

Sample Strategy 3>                          Strategy took 0.002904s memory usage: 3.104156 MB.
Sample Strategy 3>                          Strategy took 0.003186s memory usage: 3.104294 MB.
Sample Strategy 3>                          Strategy took 0.003310s memory usage: 3.107574 MB.
Sample Strategy 3>                          Strategy took 0.003397s memory usage: 3.108109 MB.

All strategies have a very similar execution time in each of their executions, but when we look at the final results (obtained with perf from linux tools):

DistributedLB:      15.886613586 seconds time elapsed
Sample Strategy 1:  15.622918170 seconds time elapsed
Sample Strategy 2:  11.997714095 seconds time elapsed
Sample Strategy 3:  15.749101873 seconds time elapsed
NullLB:             15.317063856 seconds time elapsed

In other samples, strategy times were similar to this results, but the strategies kept floating between 11 and 16 seconds of total execution time, apparentely exibiting no pattern. Sometimes NullLB would be the only one at ~11 seconds, sometimes all of them would be about that time.

The execution line was:
    perf stat $(call run, +p64 ./lb_test 10000 150 10 30 30 1000 mesh3d ++nodelist ~/charm/nodefile.dat +balancer DistributedLB +LBDebug 2)    

I also inserted prints inside each of the strategies at the location where a message is registered to be sent at the end of the "RecvAck" method in DistributedLB.C, about line 450. 
        CkPrintf("[%d] Sending load to %d", CkMyPe(), item->to_pe);

This print seems to never execute, as this message was not found in the output. It was supposed to print whenever a task would migrate, but it didn't work, unfortunately, which makes me believe that tasks aren't actually migrating.

My nodefile:
    group main ++cpus=8
        host edel-14.grenoble.grid5000.fr
        host edel-15.grenoble.grid5000.fr
        host edel-16.grenoble.grid5000.fr
        host edel-19.grenoble.grid5000.fr
        host edel-2.grenoble.grid5000.fr
        host genepi-18.grenoble.grid5000.fr
        host genepi-20.grenoble.grid5000.fr
        host genepi-23.grenoble.grid5000.fr

If you need any more information about the system or the execution report, just reply this e-mail.

Thanks for your help,




Archive powered by MHonArc 2.6.19.

Top of Page