Skip to Content.
Sympa Menu

charm - RE: [charm] Questions about distributed load balancing tests

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

RE: [charm] Questions about distributed load balancing tests


Chronological Thread 
  • From: "Galvez Garcia, Juan Jose" <jjgalvez AT illinois.edu>
  • To: Vinicius Freitas <vinicius.mct.freitas AT gmail.com>
  • Cc: Laércio Lima Pilla <laercio.pilla AT ufsc.br>, "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
  • Subject: RE: [charm] Questions about distributed load balancing tests
  • Date: Tue, 29 Nov 2016 18:54:36 +0000
  • Accept-language: en-US


Also, I wanted to add that a possibility for the variation might be due to OS moving processes around cores in each host. I suggest using the +pemap option if you aren't using it already, so that processes stay on separate cores. You can do so by adding +pemap 0-7 to the command-line options, telling Charm++ to use cores 0 to 7 by setting CPU affinity. There is more information in the Charm++ manual  if you are interested (http://charm.cs.illinois.edu/manuals/html/charm++/C.html)

Hope this helps,

-Juan


From: vinimmbb AT gmail.com [vinimmbb AT gmail.com] on behalf of Vinicius Freitas [vinicius.mct.freitas AT gmail.com]
Sent: Tuesday, November 29, 2016 12:33 PM
To: Galvez Garcia, Juan Jose
Cc: Laércio Lima Pilla; charm AT cs.uiuc.edu
Subject: Re: [charm] Questions about distributed load balancing tests

Hey, Juan

I'll run those tests you mentioned asap, but yes, the problem was the same with NullLB, the benchmark time would apparently randomly vary in that same interval, sometimes presenting high variance as in the example I first presented, sometimes NullLB would be the faster sometimes Distributed, sometimes one of my own implementations.
As for the imbalance, I've made several tests with lb_test configurations and this was supposed to present high imbalance, I'll run the tests you suggested, just to make sure, and answer as soon as I have the results.

Thank you for the reply,
Vinicius
-- 
Vinicius Marino Calvo Torres de Freitas Computer Science Undergratuate Student (Aluno de graduação em Ciência da Computação)
Research Assistant at the Embedded Computing Laboratory at UFSC
UFSC - CTC - INE - ECL, Brazil
Email: vinicius.mctf AT grad.ufsc.br or vinicius.mct.freitas AT gmail.com 
Tel: +55 (48) 96163803

2016-11-28 17:59 GMT-02:00 Galvez Garcia, Juan Jose <jjgalvez AT illinois.edu>:
Hi Vinicius,

I'll look into lb_test to see if I can find an explanation for the variation in total execution time. Does the NullLB strategy also present variation in total execution time?

As for migrations, this might be related to some issues we have found in DistributedLB and which we are currently looking at, or maybe the actual test case doesn't present much imbalance. It seems like you have aprox 156 chares per PE. This large number may mean that load is already balanced evenly.

To verify this, I would suggest testing with a centralized load balancer if you have not done so, like RefineLB, and +LBDebug 2 or +LBDebug 3. Centralized load balancers provide more output, like the number of migrations performed in each LB step, and processor load. RefineLB will move only a few objects (from overloaded to underloaded), so looking at the number of migrations (and processor loads with +LBDebug 3), can give you an idea of whether there is imbalance and how much load balancing can improve performance. You can also use GreedyLB which will balance load better than RefineLB but note that GreedyLB will migrate most objects regardless of the actual level of imbalance. For baseline comparison of centralized load balancers, DummyLB will probably be best, because with DummyLB PE0 will also receive the load balancing stats (same as any centralized LB) but won't do anything else.

-Juan


From: Vinicius Freitas [vinicius.mct.freitas AT gmail.com]
Sent: Wednesday, November 23, 2016 1:47 PM
To: charm AT cs.uiuc.edu
Cc: Laércio Lima Pilla
Subject: [charm] Questions about distributed load balancing tests

Hello, Charm++ team,

I have been having issues with the execution of distributed load balancing solutions in Charm++. 
My two main questions are about the total execution time of the lb_test benchmark using these distributed strategies, which seems to be varying way too much executing the same benchmark executing the same strategy every time; and about the migrations, which I tried to expose inside of the load balancing strategy, but failed to achieve such point in the execution.

This is my setup:

Computational Nodes: 8 Octa-core nodes

Benchmark: lb_test
            10,000 Elements
            150 iterations
            10 time/print
            30 load balancing interval
            30 ms min task time
            1,000 ms max task time

Charm++ 6.7 compiled for netlrts-linux-x86_64 --with_production
The Operating System is Debian Jessie, with GCC 4.9.4


This is a sample of the analysed time that the strategies I'm testing took to execute:

DistributedLB (Available w/ Charm++)>       Strategy took 0.002938s memory usage: 3.102875 MB.
DistributedLB (Available w/ Charm++)>       Strategy took 0.003182s memory usage: 3.103256 MB.
DistributedLB (Available w/ Charm++)>       Strategy took 0.003182s memory usage: 3.103592 MB.
DistributedLB (Available w/ Charm++)>       Strategy took 0.003357s memory usage: 3.106171 MB.

Sample Strategy 1>                          Strategy took 0.002934s memory usage: 3.101395 MB.
Sample Strategy 1>                          Strategy took 0.003095s memory usage: 3.105225 MB.
Sample Strategy 1>                          Strategy took 0.003319s memory usage: 3.105164 MB.
Sample Strategy 1>                          Strategy took 0.003221s memory usage: 3.107315 MB.

Sample Strategy 2>                          Strategy took 0.002996s memory usage: 3.102005 MB.
Sample Strategy 2>                          Strategy took 0.003107s memory usage: 3.103088 MB.
Sample Strategy 2>                          Strategy took 0.003184s memory usage: 3.105621 MB.
Sample Strategy 2>                          Strategy took 0.003255s memory usage: 3.107681 MB.

Sample Strategy 3>                          Strategy took 0.002904s memory usage: 3.104156 MB.
Sample Strategy 3>                          Strategy took 0.003186s memory usage: 3.104294 MB.
Sample Strategy 3>                          Strategy took 0.003310s memory usage: 3.107574 MB.
Sample Strategy 3>                          Strategy took 0.003397s memory usage: 3.108109 MB.

All strategies have a very similar execution time in each of their executions, but when we look at the final results (obtained with perf from linux tools):

DistributedLB:      15.886613586 seconds time elapsed
Sample Strategy 1:  15.622918170 seconds time elapsed
Sample Strategy 2:  11.997714095 seconds time elapsed
Sample Strategy 3:  15.749101873 seconds time elapsed
NullLB:             15.317063856 seconds time elapsed

In other samples, strategy times were similar to this results, but the strategies kept floating between 11 and 16 seconds of total execution time, apparentely exibiting no pattern. Sometimes NullLB would be the only one at ~11 seconds, sometimes all of them would be about that time.

The execution line was:
    perf stat $(call run, +p64 ./lb_test 10000 150 10 30 30 1000 mesh3d ++nodelist ~/charm/nodefile.dat +balancer DistributedLB +LBDebug 2)    

I also inserted prints inside each of the strategies at the location where a message is registered to be sent at the end of the "RecvAck" method in DistributedLB.C, about line 450. 
        CkPrintf("[%d] Sending load to %d", CkMyPe(), item->to_pe);

This print seems to never execute, as this message was not found in the output. It was supposed to print whenever a task would migrate, but it didn't work, unfortunately, which makes me believe that tasks aren't actually migrating.

My nodefile:
    group main ++cpus=8

If you need any more information about the system or the execution report, just reply this e-mail.

Thanks for your help,





Archive powered by MHonArc 2.6.19.

Top of Page