charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Questions about distributed load balancing tests

From: Vinicius Freitas <vinicius.mct.freitas AT gmail.com>
To: "Galvez Garcia, Juan Jose" <jjgalvez AT illinois.edu>
Cc: Laércio Lima Pilla <laercio.pilla AT ufsc.br>, "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
Subject: Re: [charm] Questions about distributed load balancing tests
Date: Wed, 7 Dec 2016 15:35:42 -0200

Hi Juan and Bilge,

Sorry for taking so long, I've been struggling with subjects, but now my semester is over and I was able to run the last tests you suggested. I have executed 10 iterations of my samples with "constant" task time (setting min and max to the same value) and the difference still appears. I changed the parameters a bit, but the behavior seems to be the same.

I believe that load balancing is happening, since the NullLB results don't seem to vary as much as the LB executions, but still I can't find my prints. Take a look at this numbers:

NullLB:

16.341916542 seconds time elapsed

16.187558437 seconds time elapsed

19.087430376 seconds time elapsed

19.001815082 seconds time elapsed

11.103533348 seconds time elapsed

11.075133062 seconds time elapsed

16.227937035 seconds time elapsed

11.156110680 seconds time elapsed

11.139616997 seconds time elapsed

16.131115276 seconds time elapsed

11.117537504 seconds time elapsed

DistributedLB:

17.107595710 seconds time elapsed

16.329215829 seconds time elapsed

28.898516650 seconds time elapsed

23.684155264 seconds time elapsed

21.462363162 seconds time elapsed

15.396506188 seconds time elapsed

15.392838136 seconds time elapsed

20.692176469 seconds time elapsed

16.129658334 seconds time elapsed

21.441109094 seconds time elapsed

23.699239281 seconds time elapsed

My Strat 1:

21.278574306 seconds time elapsed

20.855337860 seconds time elapsed

15.873945821 seconds time elapsed

15.663619550 seconds time elapsed

16.074061053 seconds time elapsed

21.435567805 seconds time elapsed

20.531390415 seconds time elapsed

23.817225212 seconds time elapsed

15.910386988 seconds time elapsed

21.231555725 seconds time elapsed

21.155889616 seconds time elapsed

My Strat 2:

12.011413926 seconds time elapsed

21.729497624 seconds time elapsed

15.562260526 seconds time elapsed

15.712264065 seconds time elapsed

16.253601938 seconds time elapsed

16.811928883 seconds time elapsed

24.014315056 seconds time elapsed

21.212371898 seconds time elapsed

16.304281984 seconds time elapsed

16.022030368 seconds time elapsed

16.284860273 seconds time elapsed

My Strat 3:

13.838796292 seconds time elapsed

16.024521016 seconds time elapsed

15.523160867 seconds time elapsed

15.992639755 seconds time elapsed

16.215731342 seconds time elapsed

15.924460211 seconds time elapsed

11.416297220 seconds time elapsed

16.094931218 seconds time elapsed

15.932088443 seconds time elapsed

39.665654400 seconds time elapsed

15.820001914 seconds time elapsed

This indicates, unfortunately, that a considerable amount of the standard deviation comes from the strategies themselves. But all of them have similar results, since they are doing load balance, but the environment with 256 PE's doesn't seem to be enough so it's worth doing the load balance.

This is the command line I ran:

./charmrun +p256 ./lb_test 8096 150 10 30 150 150 mesh3d +pemap 0-7 ++nodelist /root/charm/nodefile.dat +balancer NullLB +LBDebug 3

Thanks for your help,

Vinicius

--
Vinicius Marino Calvo Torres de Freitas Computer Science Undergratuate Student (Aluno de graduação em Ciência da Computação)

Research Assistant at the Embedded Computing Laboratory at UFSC

UFSC - CTC - INE - ECL, Brazil Email: vinicius.mctf AT grad.ufsc.br or vinicius.mct.freitas AT gmail.com Tel: +55 (48) 96163803

2016-11-30 17:59 GMT-02:00 Galvez Garcia, Juan Jose <jjgalvez AT illinois.edu>:

Hi Vinicius,

Did you take a look at the processor loads to determine if there is imbalance? For example, is there a big difference between the max processor load and the average?

If the execution time of lb_test is random, it's going to be a bit hard to tell just by looking at the results you have shown, but I do see that most strategies (including NullLB) have roughly the same execution time on average, and RefineLB is only migrating a very small percentage of objects, which could indicate that there is low imbalance.

-Juan
From: bilgeacun AT gmail.com [bilgeacun AT gmail.com] on behalf of Bilge Acun [acun2 AT illinois.edu]
Sent: Wednesday, November 30, 2016 1:31 PM
To: Vinicius Freitas
Cc: Galvez Garcia, Juan Jose; Laércio Lima Pilla; charm AT cs.uiuc.edu

Subject: Re: [charm] Questions about distributed load balancing tests
Hi Vinicius,

I think this variation is caused by the application's random behavior itself. The elements create their work time randomly between the min and maximum task time provided in the arguments.

Line 127 in Topo.C: work = min_us + (max_us-min_us)*pow((double)rand()/RAND_MAX,4.);

Can you make the min and max time same and verify if the variation disappears then? I know you don't want to do this for testing load balancing, but this could explain why you see performance variations even for NullLB.

Thanks,

~Bilge
On 30 November 2016 at 13:21, Vinicius Freitas <vinicius.mct.freitas AT gmail.com> wrote:
Hello,

I have sampled 4 tests with the advice you provided. Everything still seems a bit random. I've made sure everything is executing in the same cluster environment and executed the following execution line:

perf stat -o A.st --append ../../../../bin/testrun +p256 ./lb_test 10000 150 10 30 30 1000 mesh3d +pemap 0-7 ++nodelist ~/charm/nodefile.dat +balancer ALB +LBDebug 2 >> A.lbt

where "A" is the name of a given load balancing strategy. This are the timing results obtained after 4 executions:

Distributed LB:

13.747420506 seconds time elapsed

18.527592025 seconds time elapsed

21.548604275 seconds time elapsed

13.438776048 seconds time elapsed

Null LB:

11.054318397 seconds time elapsed

11.079078756 seconds time elapsed

18.884361482 seconds time elapsed

16.313307654 seconds time elapsed

My strategy 1:

13.386794114 seconds time elapsed

13.328720616 seconds time elapsed

13.479837916 seconds time elapsed

18.260640795 seconds time elapsed

My strategy 2:

21.297518349 seconds time elapsed

13.476602838 seconds time elapsed

18.431436065 seconds time elapsed

14.547045739 seconds time elapsed

My strategy 3:

9.380154967 seconds time elapsed

13.317408238 seconds time elapsed

13.333818487 seconds time elapsed

13.204306938 seconds time elapsed

Refine LB:

19.141983479 seconds time elapsed

14.020267341 seconds time elapsed

14.186652090 seconds time elapsed

14.400265432 seconds time elapsed

Greedy LB:

49.204311107 seconds time elapsed

40.146161190 seconds time elapsed

43.880617686 seconds time elapsed

43.717026033 seconds time elapsed

As for the load imbalance in the system, this is how Refine migrated tests on each of it's executions:

CharmLB> RefineLB: PE [0] #Objects migrating: 158, LBMigrateMsg size: 0.01 MB

CharmLB> RefineLB: PE [0] #Objects migrating: 76, LBMigrateMsg size: 0.01 MB

CharmLB> RefineLB: PE [0] #Objects migrating: 25, LBMigrateMsg size: 0.00 MB

CharmLB> RefineLB: PE [0] #Objects migrating: 17, LBMigrateMsg size: 0.00 MB

----

CharmLB> RefineLB: PE [0] #Objects migrating: 155, LBMigrateMsg size: 0.01 MB

CharmLB> RefineLB: PE [0] #Objects migrating: 76, LBMigrateMsg size: 0.01 MB

CharmLB> RefineLB: PE [0] #Objects migrating: 24, LBMigrateMsg size: 0.00 MB

CharmLB> RefineLB: PE [0] #Objects migrating: 14, LBMigrateMsg size: 0.00 MB

----

CharmLB> RefineLB: PE [0] #Objects migrating: 153, LBMigrateMsg size: 0.01 MB

CharmLB> RefineLB: PE [0] #Objects migrating: 76, LBMigrateMsg size: 0.01 MB

CharmLB> RefineLB: PE [0] #Objects migrating: 28, LBMigrateMsg size: 0.00 MB

CharmLB> RefineLB: PE [0] #Objects migrating: 16, LBMigrateMsg size: 0.00 MB

----

CharmLB> RefineLB: PE [0] #Objects migrating: 154, LBMigrateMsg size: 0.01 MB

CharmLB> RefineLB: PE [0] #Objects migrating: 72, LBMigrateMsg size: 0.01 MB

CharmLB> RefineLB: PE [0] #Objects migrating: 25, LBMigrateMsg size: 0.00 MB

CharmLB> RefineLB: PE [0] #Objects migrating: 19, LBMigrateMsg size: 0.00 MB

----

All of these results were obtained using the "+pemap 0-7" feature Juan suggested, but it still shows higher variance than I'm used to (almost 20% variance in Greedy, 30% in some of my strategies, which the execution time itself wasn't very different between executions).

My Strategy 2> Strategy took 0.647128s memory usage: 3.156113 MB.

My Strategy 2> Strategy took 0.588585s memory usage: 3.158844 MB.

My Strategy 2> Strategy took 0.626369s memory usage: 3.158798 MB.

My Strategy 2> Strategy took 0.486786s memory usage: 3.166199 MB.

----

My Strategy 2> Strategy took 0.746285s memory usage: 3.145523 MB.

My Strategy 2> Strategy took 0.567603s memory usage: 3.149231 MB.

My Strategy 2> Strategy took 0.566366s memory usage: 3.163010 MB.

My Strategy 2> Strategy took 0.526576s memory usage: 3.167007 MB.

----

My Strategy 2> Strategy took 0.666044s memory usage: 3.155136 MB.

My Strategy 2> Strategy took 0.646271s memory usage: 3.159821 MB.

My Strategy 2> Strategy took 0.626858s memory usage: 3.163010 MB.

My Strategy 2> Strategy took 0.467476s memory usage: 3.166199 MB.

----

My Strategy 2> Strategy took 0.667776s memory usage: 3.156113 MB.

My Strategy 2> Strategy took 0.627341s memory usage: 3.159454 MB.

My Strategy 2> Strategy took 0.546260s memory usage: 3.162247 MB.

My Strategy 2> Strategy took 0.546704s memory usage: 3.166214 MB.

----
--
Vinicius Marino Calvo Torres de Freitas Computer Science Undergratuate Student (Aluno de graduação em Ciência da Computação)
Research Assistant at the Embedded Computing Laboratory at UFSC
UFSC - CTC - INE - ECL, Brazil Email: vinicius.mctf AT grad.ufsc.br or vinicius.mct.freitas AT gmail.com Tel: +55 (48) 96163803
2016-11-29 16:54 GMT-02:00 Galvez Garcia, Juan Jose <jjgalvez AT illinois.edu>:
Also, I wanted to add that a possibility for the variation might be due to OS moving processes around cores in each host. I suggest using the +pemap option if you aren't using it already, so that processes stay on separate cores. You can do so by adding +pemap 0-7 to the command-line options, telling Charm++ to use cores 0 to 7 by setting CPU affinity. There is more information in the Charm++ manual if you are interested (http://charm.cs.illinois.edu/manuals/html/charm++/C.html)

Hope this helps,

-Juan
From: vinimmbb AT gmail.com [vinimmbb AT gmail.com] on behalf of Vinicius Freitas [vinicius.mct.freitas AT gmail.com]
Sent: Tuesday, November 29, 2016 12:33 PM
To: Galvez Garcia, Juan Jose
Cc: Laércio Lima Pilla; charm AT cs.uiuc.edu
Subject: Re: [charm] Questions about distributed load balancing tests
Hey, Juan

I'll run those tests you mentioned asap, but yes, the problem was the same with NullLB, the benchmark time would apparently randomly vary in that same interval, sometimes presenting high variance as in the example I first presented, sometimes NullLB would be the faster sometimes Distributed, sometimes one of my own implementations.
As for the imbalance, I've made several tests with lb_test configurations and this was supposed to present high imbalance, I'll run the tests you suggested, just to make sure, and answer as soon as I have the results.

Thank you for the reply,
Vinicius
--
Vinicius Marino Calvo Torres de Freitas Computer Science Undergratuate Student (Aluno de graduação em Ciência da Computação)
Research Assistant at the Embedded Computing Laboratory at UFSC
UFSC - CTC - INE - ECL, Brazil Email: vinicius.mctf AT grad.ufsc.br or vinicius.mct.freitas AT gmail.com Tel: +55 (48) 96163803
2016-11-28 17:59 GMT-02:00 Galvez Garcia, Juan Jose <jjgalvez AT illinois.edu>:

Hi Vinicius,

I'll look into lb_test to see if I can find an explanation for the variation in total execution time. Does the NullLB strategy also present variation in total execution time?

As for migrations, this might be related to some issues we have found in DistributedLB and which we are currently looking at, or maybe the actual test case doesn't present much imbalance. It seems like you have aprox 156 chares per PE. This large number may mean that load is already balanced evenly.

To verify this, I would suggest testing with a centralized load balancer if you have not done so, like RefineLB, and +LBDebug 2 or +LBDebug 3. Centralized load balancers provide more output, like the number of migrations performed in each LB step, and processor load. RefineLB will move only a few objects (from overloaded to underloaded), so looking at the number of migrations (and processor loads with +LBDebug 3), can give you an idea of whether there is imbalance and how much load balancing can improve performance. You can also use GreedyLB which will balance load better than RefineLB but note that GreedyLB will migrate most objects regardless of the actual level of imbalance. For baseline comparison of centralized load balancers, DummyLB will probably be best, because with DummyLB PE0 will also receive the load balancing stats (same as any centralized LB) but won't do anything else.

-Juan

From: Vinicius Freitas [vinicius.mct.freitas AT gmail.com]
Sent: Wednesday, November 23, 2016 1:47 PM
To: charm AT cs.uiuc.edu
Cc: Laércio Lima Pilla
Subject: [charm] Questions about distributed load balancing tests

Hello, Charm++ team,

I have been having issues with the execution of distributed load balancing solutions in Charm++.

My two main questions are about the total execution time of the lb_test benchmark using these distributed strategies, which seems to be varying way too much executing the same benchmark executing the same strategy every time; and about the migrations, which I tried to expose inside of the load balancing strategy, but failed to achieve such point in the execution.

This is my setup:

Computational Nodes: 8 Octa-core nodes

Benchmark: lb_test

10,000 Elements

150 iterations

10 time/print

30 load balancing interval

30 ms min task time

1,000 ms max task time

Charm++ 6.7 compiled for netlrts-linux-x86_64 --with_production

The Operating System is Debian Jessie, with GCC 4.9.4

This is a sample of the analysed time that the strategies I'm testing took to execute:

DistributedLB (Available w/ Charm++)> Strategy took 0.002938s memory usage: 3.102875 MB.

DistributedLB (Available w/ Charm++)> Strategy took 0.003182s memory usage: 3.103256 MB.

DistributedLB (Available w/ Charm++)> Strategy took 0.003182s memory usage: 3.103592 MB.

DistributedLB (Available w/ Charm++)> Strategy took 0.003357s memory usage: 3.106171 MB.

Sample Strategy 1> Strategy took 0.002934s memory usage: 3.101395 MB.

Sample Strategy 1> Strategy took 0.003095s memory usage: 3.105225 MB.

Sample Strategy 1> Strategy took 0.003319s memory usage: 3.105164 MB.

Sample Strategy 1> Strategy took 0.003221s memory usage: 3.107315 MB.

Sample Strategy 2> Strategy took 0.002996s memory usage: 3.102005 MB.

Sample Strategy 2> Strategy took 0.003107s memory usage: 3.103088 MB.

Sample Strategy 2> Strategy took 0.003184s memory usage: 3.105621 MB.

Sample Strategy 2> Strategy took 0.003255s memory usage: 3.107681 MB.

Sample Strategy 3> Strategy took 0.002904s memory usage: 3.104156 MB.

Sample Strategy 3> Strategy took 0.003186s memory usage: 3.104294 MB.

Sample Strategy 3> Strategy took 0.003310s memory usage: 3.107574 MB.

Sample Strategy 3> Strategy took 0.003397s memory usage: 3.108109 MB.

All strategies have a very similar execution time in each of their executions, but when we look at the final results (obtained with perf from linux tools):

DistributedLB: 15.886613586 seconds time elapsed

Sample Strategy 1: 15.622918170 seconds time elapsed

Sample Strategy 2: 11.997714095 seconds time elapsed

Sample Strategy 3: 15.749101873 seconds time elapsed

NullLB: 15.317063856 seconds time elapsed

In other samples, strategy times were similar to this results, but the strategies kept floating between 11 and 16 seconds of total execution time, apparentely exibiting no pattern. Sometimes NullLB would be the only one at ~11 seconds, sometimes all of them would be about that time.

The execution line was:

perf stat $(call run, +p64 ./lb_test 10000 150 10 30 30 1000 mesh3d ++nodelist ~/charm/nodefile.dat +balancer DistributedLB +LBDebug 2)

I also inserted prints inside each of the strategies at the location where a message is registered to be sent at the end of the "RecvAck" method in DistributedLB.C, about line 450.

CkPrintf("[%d] Sending load to %d", CkMyPe(), item->to_pe);

This print seems to never execute, as this message was not found in the output. It was supposed to print whenever a task would migrate, but it didn't work, unfortunately, which makes me believe that tasks aren't actually migrating.

My nodefile:

group main ++cpus=8

host edel-14.grenoble.grid5000.fr

host edel-15.grenoble.grid5000.fr

host edel-16.grenoble.grid5000.fr

host edel-19.grenoble.grid5000.fr

host edel-2.grenoble.grid5000.fr

host genepi-18.grenoble.grid5000.fr

host genepi-20.grenoble.grid5000.fr

host genepi-23.grenoble.grid5000.fr

If you need any more information about the system or the execution report, just reply this e-mail.

Thanks for your help,
--

Bilge Acun

PhD Candidate, Computer Science Department

University of Illinois at Urbana-Champaign

Re: [charm] Questions about distributed load balancing tests, Vinicius Freitas, 12/07/2016
- <Possible follow-up(s)>
- Re: [charm] Questions about distributed load balancing tests, Bilge Acun, 12/08/2016