Skip to Content.
Sympa Menu

charm - Re: [charm] Questions about distributed load balancing tests

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Questions about distributed load balancing tests


Chronological Thread 
  • From: Bilge Acun <acun2 AT illinois.edu>
  • To: Vinicius Freitas <vinicius.mct.freitas AT gmail.com>
  • Cc: "Galvez Garcia, Juan Jose" <jjgalvez AT illinois.edu>, Laércio Lima Pilla <laercio.pilla AT ufsc.br>, "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
  • Subject: Re: [charm] Questions about distributed load balancing tests
  • Date: Wed, 30 Nov 2016 14:31:50 -0500

Hi Vinicius,

I think this variation is caused by the application's random behavior itself. The elements create their work time randomly between the min and maximum task time provided in the arguments.

Line 127 in Topo.C: work = min_us + (max_us-min_us)*pow((double)rand()/RAND_MAX,4.);    

Can you make the min and max time same and verify if the variation disappears then? I know you don't want to do this for testing load balancing, but this could explain why you see performance variations even for NullLB.

Thanks,
~Bilge


On 30 November 2016 at 13:21, Vinicius Freitas <vinicius.mct.freitas AT gmail.com> wrote:
Hello,

I have sampled 4 tests with the advice you provided. Everything still seems a bit random. I've made sure everything is executing in the same cluster environment and executed the following execution line:

perf stat -o A.st --append ../../../../bin/testrun  +p256 ./lb_test 10000 150 10 30 30 1000 mesh3d +pemap 0-7 ++nodelist ~/charm/nodefile.dat +balancer ALB +LBDebug 2  >> A.lbt
where "A" is the name of a given load balancing strategy. This are the timing results obtained after 4 executions:

Distributed LB:
      13.747420506 seconds time elapsed
      18.527592025 seconds time elapsed
      21.548604275 seconds time elapsed
      13.438776048 seconds time elapsed

Null LB:
      11.054318397 seconds time elapsed
      11.079078756 seconds time elapsed
      18.884361482 seconds time elapsed
      16.313307654 seconds time elapsed

My strategy 1:
      13.386794114 seconds time elapsed
      13.328720616 seconds time elapsed
      13.479837916 seconds time elapsed
      18.260640795 seconds time elapsed

My strategy 2:
      21.297518349 seconds time elapsed
      13.476602838 seconds time elapsed
      18.431436065 seconds time elapsed
      14.547045739 seconds time elapsed

My strategy 3:
       9.380154967 seconds time elapsed
      13.317408238 seconds time elapsed
      13.333818487 seconds time elapsed
      13.204306938 seconds time elapsed

Refine LB:
      19.141983479 seconds time elapsed
      14.020267341 seconds time elapsed
      14.186652090 seconds time elapsed
      14.400265432 seconds time elapsed

Greedy LB:
      49.204311107 seconds time elapsed
      40.146161190 seconds time elapsed
      43.880617686 seconds time elapsed
      43.717026033 seconds time elapsed


As for the load imbalance in the system, this is how Refine migrated tests on each of it's executions:

CharmLB> RefineLB: PE [0] #Objects migrating: 158, LBMigrateMsg size: 0.01 MB
CharmLB> RefineLB: PE [0] #Objects migrating: 76, LBMigrateMsg size: 0.01 MB
CharmLB> RefineLB: PE [0] #Objects migrating: 25, LBMigrateMsg size: 0.00 MB
CharmLB> RefineLB: PE [0] #Objects migrating: 17, LBMigrateMsg size: 0.00 MB
----
CharmLB> RefineLB: PE [0] #Objects migrating: 155, LBMigrateMsg size: 0.01 MB
CharmLB> RefineLB: PE [0] #Objects migrating: 76, LBMigrateMsg size: 0.01 MB
CharmLB> RefineLB: PE [0] #Objects migrating: 24, LBMigrateMsg size: 0.00 MB
CharmLB> RefineLB: PE [0] #Objects migrating: 14, LBMigrateMsg size: 0.00 MB
----
CharmLB> RefineLB: PE [0] #Objects migrating: 153, LBMigrateMsg size: 0.01 MB
CharmLB> RefineLB: PE [0] #Objects migrating: 76, LBMigrateMsg size: 0.01 MB
CharmLB> RefineLB: PE [0] #Objects migrating: 28, LBMigrateMsg size: 0.00 MB
CharmLB> RefineLB: PE [0] #Objects migrating: 16, LBMigrateMsg size: 0.00 MB
----
CharmLB> RefineLB: PE [0] #Objects migrating: 154, LBMigrateMsg size: 0.01 MB
CharmLB> RefineLB: PE [0] #Objects migrating: 72, LBMigrateMsg size: 0.01 MB
CharmLB> RefineLB: PE [0] #Objects migrating: 25, LBMigrateMsg size: 0.00 MB
CharmLB> RefineLB: PE [0] #Objects migrating: 19, LBMigrateMsg size: 0.00 MB
----

All of these results were obtained using the "+pemap 0-7" feature Juan suggested, but it still shows higher variance than I'm used to (almost 20% variance in Greedy, 30% in some of my strategies, which the execution time itself wasn't very different between executions).

My Strategy 2> Strategy took 0.647128s memory usage: 3.156113 MB.
My Strategy 2> Strategy took 0.588585s memory usage: 3.158844 MB.
My Strategy 2> Strategy took 0.626369s memory usage: 3.158798 MB.
My Strategy 2> Strategy took 0.486786s memory usage: 3.166199 MB.
----
My Strategy 2> Strategy took 0.746285s memory usage: 3.145523 MB.
My Strategy 2> Strategy took 0.567603s memory usage: 3.149231 MB.
My Strategy 2> Strategy took 0.566366s memory usage: 3.163010 MB.
My Strategy 2> Strategy took 0.526576s memory usage: 3.167007 MB.
----
My Strategy 2> Strategy took 0.666044s memory usage: 3.155136 MB.
My Strategy 2> Strategy took 0.646271s memory usage: 3.159821 MB.
My Strategy 2> Strategy took 0.626858s memory usage: 3.163010 MB.
My Strategy 2> Strategy took 0.467476s memory usage: 3.166199 MB.
----
My Strategy 2> Strategy took 0.667776s memory usage: 3.156113 MB.
My Strategy 2> Strategy took 0.627341s memory usage: 3.159454 MB.
My Strategy 2> Strategy took 0.546260s memory usage: 3.162247 MB.
My Strategy 2> Strategy took 0.546704s memory usage: 3.166214 MB.
----


-- 
Vinicius Marino Calvo Torres de Freitas Computer Science Undergratuate Student (Aluno de graduação em Ciência da Computação)
Research Assistant at the Embedded Computing Laboratory at UFSC
UFSC - CTC - INE - ECL, Brazil
Email: vinicius.mctf AT grad.ufsc.br or vinicius.mct.freitas AT gmail.com 
Tel: +55 (48) 96163803

2016-11-29 16:54 GMT-02:00 Galvez Garcia, Juan Jose <jjgalvez AT illinois.edu>:

Also, I wanted to add that a possibility for the variation might be due to OS moving processes around cores in each host. I suggest using the +pemap option if you aren't using it already, so that processes stay on separate cores. You can do so by adding +pemap 0-7 to the command-line options, telling Charm++ to use cores 0 to 7 by setting CPU affinity. There is more information in the Charm++ manual  if you are interested (http://charm.cs.illinois.edu/manuals/html/charm++/C.html)

Hope this helps,

-Juan


From: vinimmbb AT gmail.com [vinimmbb AT gmail.com] on behalf of Vinicius Freitas [vinicius.mct.freitas AT gmail.com]
Sent: Tuesday, November 29, 2016 12:33 PM
To: Galvez Garcia, Juan Jose
Cc: Laércio Lima Pilla; charm AT cs.uiuc.edu
Subject: Re: [charm] Questions about distributed load balancing tests

Hey, Juan

I'll run those tests you mentioned asap, but yes, the problem was the same with NullLB, the benchmark time would apparently randomly vary in that same interval, sometimes presenting high variance as in the example I first presented, sometimes NullLB would be the faster sometimes Distributed, sometimes one of my own implementations.
As for the imbalance, I've made several tests with lb_test configurations and this was supposed to present high imbalance, I'll run the tests you suggested, just to make sure, and answer as soon as I have the results.

Thank you for the reply,
Vinicius
-- 
Vinicius Marino Calvo Torres de Freitas Computer Science Undergratuate Student (Aluno de graduação em Ciência da Computação)
Research Assistant at the Embedded Computing Laboratory at UFSC
UFSC - CTC - INE - ECL, Brazil
Email: vinicius.mctf AT grad.ufsc.br or vinicius.mct.freitas AT gmail.com 
Tel: +55 (48) 96163803

2016-11-28 17:59 GMT-02:00 Galvez Garcia, Juan Jose <jjgalvez AT illinois.edu>:
Hi Vinicius,

I'll look into lb_test to see if I can find an explanation for the variation in total execution time. Does the NullLB strategy also present variation in total execution time?

As for migrations, this might be related to some issues we have found in DistributedLB and which we are currently looking at, or maybe the actual test case doesn't present much imbalance. It seems like you have aprox 156 chares per PE. This large number may mean that load is already balanced evenly.

To verify this, I would suggest testing with a centralized load balancer if you have not done so, like RefineLB, and +LBDebug 2 or +LBDebug 3. Centralized load balancers provide more output, like the number of migrations performed in each LB step, and processor load. RefineLB will move only a few objects (from overloaded to underloaded), so looking at the number of migrations (and processor loads with +LBDebug 3), can give you an idea of whether there is imbalance and how much load balancing can improve performance. You can also use GreedyLB which will balance load better than RefineLB but note that GreedyLB will migrate most objects regardless of the actual level of imbalance. For baseline comparison of centralized load balancers, DummyLB will probably be best, because with DummyLB PE0 will also receive the load balancing stats (same as any centralized LB) but won't do anything else.

-Juan


From: Vinicius Freitas [vinicius.mct.freitas AT gmail.com]
Sent: Wednesday, November 23, 2016 1:47 PM
To: charm AT cs.uiuc.edu
Cc: Laércio Lima Pilla
Subject: [charm] Questions about distributed load balancing tests

Hello, Charm++ team,

I have been having issues with the execution of distributed load balancing solutions in Charm++. 
My two main questions are about the total execution time of the lb_test benchmark using these distributed strategies, which seems to be varying way too much executing the same benchmark executing the same strategy every time; and about the migrations, which I tried to expose inside of the load balancing strategy, but failed to achieve such point in the execution.

This is my setup:

Computational Nodes: 8 Octa-core nodes

Benchmark: lb_test
            10,000 Elements
            150 iterations
            10 time/print
            30 load balancing interval
            30 ms min task time
            1,000 ms max task time

Charm++ 6.7 compiled for netlrts-linux-x86_64 --with_production
The Operating System is Debian Jessie, with GCC 4.9.4


This is a sample of the analysed time that the strategies I'm testing took to execute:

DistributedLB (Available w/ Charm++)>       Strategy took 0.002938s memory usage: 3.102875 MB.
DistributedLB (Available w/ Charm++)>       Strategy took 0.003182s memory usage: 3.103256 MB.
DistributedLB (Available w/ Charm++)>       Strategy took 0.003182s memory usage: 3.103592 MB.
DistributedLB (Available w/ Charm++)>       Strategy took 0.003357s memory usage: 3.106171 MB.

Sample Strategy 1>                          Strategy took 0.002934s memory usage: 3.101395 MB.
Sample Strategy 1>                          Strategy took 0.003095s memory usage: 3.105225 MB.
Sample Strategy 1>                          Strategy took 0.003319s memory usage: 3.105164 MB.
Sample Strategy 1>                          Strategy took 0.003221s memory usage: 3.107315 MB.

Sample Strategy 2>                          Strategy took 0.002996s memory usage: 3.102005 MB.
Sample Strategy 2>                          Strategy took 0.003107s memory usage: 3.103088 MB.
Sample Strategy 2>                          Strategy took 0.003184s memory usage: 3.105621 MB.
Sample Strategy 2>                          Strategy took 0.003255s memory usage: 3.107681 MB.

Sample Strategy 3>                          Strategy took 0.002904s memory usage: 3.104156 MB.
Sample Strategy 3>                          Strategy took 0.003186s memory usage: 3.104294 MB.
Sample Strategy 3>                          Strategy took 0.003310s memory usage: 3.107574 MB.
Sample Strategy 3>                          Strategy took 0.003397s memory usage: 3.108109 MB.

All strategies have a very similar execution time in each of their executions, but when we look at the final results (obtained with perf from linux tools):

DistributedLB:      15.886613586 seconds time elapsed
Sample Strategy 1:  15.622918170 seconds time elapsed
Sample Strategy 2:  11.997714095 seconds time elapsed
Sample Strategy 3:  15.749101873 seconds time elapsed
NullLB:             15.317063856 seconds time elapsed

In other samples, strategy times were similar to this results, but the strategies kept floating between 11 and 16 seconds of total execution time, apparentely exibiting no pattern. Sometimes NullLB would be the only one at ~11 seconds, sometimes all of them would be about that time.

The execution line was:
    perf stat $(call run, +p64 ./lb_test 10000 150 10 30 30 1000 mesh3d ++nodelist ~/charm/nodefile.dat +balancer DistributedLB +LBDebug 2)    

I also inserted prints inside each of the strategies at the location where a message is registered to be sent at the end of the "RecvAck" method in DistributedLB.C, about line 450. 
        CkPrintf("[%d] Sending load to %d", CkMyPe(), item->to_pe);

This print seems to never execute, as this message was not found in the output. It was supposed to print whenever a task would migrate, but it didn't work, unfortunately, which makes me believe that tasks aren't actually migrating.

My nodefile:
    group main ++cpus=8

If you need any more information about the system or the execution report, just reply this e-mail.

Thanks for your help,






--
Bilge Acun
PhD Candidate, Computer Science Department
University of Illinois at Urbana-Champaign



Archive powered by MHonArc 2.6.19.

Top of Page