Skip to Content.
Sympa Menu

charm - RE: [charm] Adaptive MPI

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

RE: [charm] Adaptive MPI


Chronological Thread 
  • From: "Van Der Wijngaart, Rob F" <rob.f.van.der.wijngaart AT intel.com>
  • To: Phil Miller <mille121 AT illinois.edu>
  • Cc: Sam White <white67 AT illinois.edu>, "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
  • Subject: RE: [charm] Adaptive MPI
  • Date: Tue, 29 Nov 2016 00:07:47 +0000
  • Accept-language: en-US

No luck. The same type of string processing error occurs at some point when trying to read the key itself, see below.

 

rfvander@klondike:~/charm-6.7.1/examples/ampi/Cjacobi3D$ $HOME/charm-6.7.1/bin/charmrun ./jacobi 2 2 2 1000 +p 2 +vp 8 +isomalloc_sync +balancer RotateLB +LBDebug 1              

Running command: ./jacobi 2 2 2 1000 +p 2 +vp 8 +isomalloc_sync +balancer RotateLB +LBDebug 1

Charm++: standalone mode (not using charmrun)

Charm++> Running in Multicore mode:  2 threads

Converse/Charm++ Commit ID:

Warning> Randomization of stack pointer is turned on in kernel.

Charm++> synchronizing isomalloc memory region...

[0] consolidated Isomalloc memory region: 0x440000000 - 0x7f1a00000000 (133258240 megs)

CharmLB> Verbose level 1, load balancing period: 0.5 seconds

CharmLB> Load balancer assumes all CPUs are same.

Charm++> Running on 1 unique compute nodes (16-way SMP).

Charm++> cpu topology info is gathered in 0.000 seconds.

[0] RotateLB created

iter 1 time: 0.079971 maxerr: 2020.200000

iter 2 time: 0.059791 maxerr: 1696.968000

iter 3 time: 0.050566 maxerr: 1477.170240

iter 4 time: 0.046094 maxerr: 1319.433024

iter 5 time: 0.045918 maxerr: 1200.918072

iter 6 time: 0.045842 maxerr: 1108.425519

iter 7 time: 0.045895 maxerr: 1033.970839

iter 8 time: 0.045871 maxerr: 972.509242

iter 9 time: 0.045872 maxerr: 920.721889

iter 10 time: 0.045870 maxerr: 876.344030

CharmLB> RotateLB: PE [0] step 0 starting at 0.758304 Memory: 72.253906 MB

CharmLB> RotateLB: PE [0] strategy starting at 0.758354

CharmLB> RotateLB: PE [0] Memory: LBManager: 920 KB CentralLB: 3 KB

CharmLB> RotateLB: PE [0] #Objects migrating: 8, LBMigrateMsg size: 0.00 MB

CharmLB> RotateLB: PE [0] strategy finished at 0.758360 duration 0.000006 s

CharmLB> RotateLB: PE [0] step 0 finished at 0.786232 duration 0.027928 s

iter 11 time: 0.063298 maxerr: 837.779089

iter 12 time: 0.045806 maxerr: 803.868831

iter 13 time: 0.045729 maxerr: 773.751705

iter 14 time: 0.045843 maxerr: 746.772667

iter 15 time: 0.045770 maxerr: 722.424056

iter 16 time: 0.045805 maxerr: 700.305763

iter 17 time: 0.045858 maxerr: 680.097726

iter 18 time: 0.045809 maxerr: 661.540528

iter 19 time: 0.044910 maxerr: 644.421422

iter 20 time: 0.041548 maxerr: 628.564089

iter 21 time: 0.040014 maxerr: 613.821009

iter 22 time: 0.039945 maxerr: 600.067696

iter 23 time: 0.039926 maxerr: 587.198273

iter 24 time: 0.039924 maxerr: 575.122054

iter 25 time: 0.039885 maxerr: 563.760848

iter 26 time: 0.040128 maxerr: 553.046836

iter 27 time: 0.040071 maxerr: 542.920870

iter 28 time: 0.039904 maxerr: 533.331094

iter 29 time: 0.039919 maxerr: 524.231833

iter 30 time: 0.039921 maxerr: 515.582675

CharmLB> RotateLB: PE [0] step 1 starting at 1.648019 Memory: 75.172928 MB

CharmLB> RotateLB: PE [0] strategy starting at 1.648106

CharmLB> RotateLB: PE [0] Memory: LBManager: 920 KB CentralLB: 3 KB

CharmLB> RotateLB: PE [0] #Objects migrating: 8, LBMigrateMsg size: 0.00 MB

CharmLB> RotateLB: PE [0] strategy finished at 1.648112 duration 0.000006 s

CharmLB> RotateLB: PE [0] step 1 finished at 1.665523 duration 0.017504 s

iter 31 time: 0.050692 maxerr: 507.347718

iter 32 time: 0.040078 maxerr: 499.494943

iter 33 time: 0.040256 maxerr: 491.995690

iter 34 time: 0.040043 maxerr: 484.824219

iter 35 time: 0.040006 maxerr: 477.957338

iter 36 time: 0.040048 maxerr: 471.374089

iter 37 time: 0.040035 maxerr: 465.055477

iter 38 time: 0.040001 maxerr: 458.984241

iter 39 time: 0.040005 maxerr: 453.144656

iter 40 time: 0.040110 maxerr: 447.522361

iter 41 time: 0.040379 maxerr: 442.104210

iter 42 time: 0.040126 maxerr: 436.878145

iter 43 time: 0.040149 maxerr: 431.833082

iter 44 time: 0.040228 maxerr: 426.958810

iter 45 time: 0.040168 maxerr: 422.245909

iter 46 time: 0.040041 maxerr: 417.685669

iter 47 time: 0.040055 maxerr: 413.270025

iter 48 time: 0.040096 maxerr: 408.991494

iter 49 time: 0.039997 maxerr: 404.843126

iter 50 time: 0.040021 maxerr: 400.818454

CharmLB> RotateLB: PE [0] step 2 starting at 2.476987 Memory: 75.238968 MB

CharmLB> RotateLB: PE [0] strategy starting at 2.477029

CharmLB> RotateLB: PE [0] Memory: LBManager: 920 KB CentralLB: 3 KB

CharmLB> RotateLB: PE [0] #Objects migrating: 8, LBMigrateMsg size: 0.00 MB

CharmLB> RotateLB: PE [0] strategy finished at 2.477035 duration 0.000006 s

CharmLB> RotateLB: PE [0] step 2 finished at 2.493661 duration 0.016674 s

iter 51 time: 0.050363 maxerr: 396.911452

iter 52 time: 0.040102 maxerr: 393.116496

iter 53 time: 0.039939 maxerr: 389.428332

iter 54 time: 0.039998 maxerr: 385.842045

iter 55 time: 0.040045 maxerr: 382.353031

iter 56 time: 0.040046 maxerr: 378.956970

iter 57 time: 0.040027 maxerr: 375.649808

iter 58 time: 0.039957 maxerr: 372.427733

iter 59 time: 0.040017 maxerr: 369.287159

iter 60 time: 0.040044 maxerr: 366.224708

iter 61 time: 0.040012 maxerr: 363.237194

iter 62 time: 0.039956 maxerr: 360.321610

iter 63 time: 0.039989 maxerr: 357.475116

iter 64 time: 0.040022 maxerr: 354.695025

iter 65 time: 0.039989 maxerr: 351.978797

iter 66 time: 0.040025 maxerr: 349.324022

iter 67 time: 0.039996 maxerr: 346.728419

iter 68 time: 0.039968 maxerr: 344.189822

iter 69 time: 0.040082 maxerr: 341.706174

iter 70 time: 0.040181 maxerr: 339.275521

CharmLB> RotateLB: PE [0] step 3 starting at 3.302705 Memory: 75.305084 MB

CharmLB> RotateLB: PE [0] strategy starting at 3.302795

CharmLB> RotateLB: PE [0] Memory: LBManager: 920 KB CentralLB: 3 KB

CharmLB> RotateLB: PE [0] #Objects migrating: 8, LBMigrateMsg size: 0.00 MB

CharmLB> RotateLB: PE [0] strategy finished at 3.302802 duration 0.000007 s

CharmLB> RotateLB: PE [0] step 3 finished at 3.318951 duration 0.016246 s

iter 71 time: 0.049915 maxerr: 336.896006

iter 72 time: 0.040021 maxerr: 334.565860

iter 73 time: 0.040179 maxerr: 332.283400

iter 74 time: 0.040051 maxerr: 330.047020

iter 75 time: 0.040005 maxerr: 327.855193

iter 76 time: 0.040029 maxerr: 325.706456

iter 77 time: 0.040045 maxerr: 323.599418

iter 78 time: 0.040035 maxerr: 321.532746

iter 79 time: 0.040319 maxerr: 319.505169

iter 80 time: 0.040152 maxerr: 317.515469

iter 81 time: 0.040000 maxerr: 315.562481

iter 82 time: 0.040090 maxerr: 313.645090

iter 83 time: 0.040004 maxerr: 311.762228

iter 84 time: 0.040049 maxerr: 309.912871

iter 85 time: 0.040071 maxerr: 308.096037

iter 86 time: 0.039998 maxerr: 306.310783

iter 87 time: 0.040066 maxerr: 304.556206

iter 88 time: 0.039985 maxerr: 302.831437

iter 89 time: 0.040058 maxerr: 301.135641

iter 90 time: 0.040069 maxerr: 299.468016

WARNING: Unknown MPI_Info key given to AMPI_Migrate: ampi_load_balanceÿÿÿÿÿÿÿ%

 

From: Van Der Wijngaart, Rob F
Sent: Monday, November 28, 2016 3:59 PM
To: 'Phil Miller' <mille121 AT illinois.edu>
Cc: 'Sam White' <white67 AT illinois.edu>; 'charm AT cs.uiuc.edu' <charm AT cs.uiuc.edu>
Subject: RE: [charm] Adaptive MPI

 

For now I am overriding the load balancer test in the code that reads its key value and am just executing TCHARM_Migrate() whenever the key is found, regardless of its value. Keep fingers crossed.

 

From: Van Der Wijngaart, Rob F
Sent: Monday, November 28, 2016 3:36 PM
To: 'Phil Miller' <mille121 AT illinois.edu>
Cc: Sam White <white67 AT illinois.edu>; charm AT cs.uiuc.edu
Subject: RE: [charm] Adaptive MPI

 

Hi Phil,

 

So far I had been using charm6.7.0, but I started to notice errors that appeared to be caused by the migration routines in AMPI, so I tried out the new version, 6.7.1. The way the load balancing hints are read appears corrupted. Please see below for a run with an example from examples/ampi/Cjacobi3D. The first time the value of the load balancer key is read it is correct, but all subsequent times when it is actually used, the library attaches a random character. I inserted the debug line:

key 0 equals ampi_load_balance with value sync

 

Rob

 

rfvander@klondike:~/charm-6.7.1/examples/ampi/Cjacobi3D$ $HOME/charm-6.7.1/bin/charmrun ./jacobi 2 2 2 30 +p 2 +vp 8 +isomalloc_sync +balancer RotateLB +LBDebug 1                

Running command: ./jacobi 2 2 2 30 +p 2 +vp 8 +isomalloc_sync +balancer RotateLB +LBDebug 1

 

Charm++: standalone mode (not using charmrun)

Charm++> Running in Multicore mode:  2 threads

Converse/Charm++ Commit ID:

Warning> Randomization of stack pointer is turned on in kernel.

Charm++> synchronizing isomalloc memory region...

[0] consolidated Isomalloc memory region: 0x440000000 - 0x7f5d00000000 (133532672 megs)

CharmLB> Verbose level 1, load balancing period: 0.5 seconds

CharmLB> Load balancer assumes all CPUs are same.

Charm++> Running on 1 unique compute nodes (16-way SMP).

Charm++> cpu topology info is gathered in 0.000 seconds.

[0] RotateLB created

iter 1 time: 0.078998 maxerr: 2020.200000

iter 2 time: 0.059326 maxerr: 1696.968000

iter 3 time: 0.050306 maxerr: 1477.170240

iter 4 time: 0.045964 maxerr: 1319.433024

iter 5 time: 0.045959 maxerr: 1200.918072

iter 6 time: 0.045985 maxerr: 1108.425519

iter 7 time: 0.045932 maxerr: 1033.970839

iter 8 time: 0.045992 maxerr: 972.509242

iter 9 time: 0.045941 maxerr: 920.721889

iter 10 time: 0.045945 maxerr: 876.344030

key 0 equals ampi_load_balance with value sync

key 0 equals ampi_load_balance with value sync

key 0 equals ampi_load_balance with value sync

key 0 equals ampi_load_balance with value sync

key 0 equals ampi_load_balance with value sync

key 0 equals ampi_load_balance with value sync

key 0 equals ampi_load_balance with value sync

key 0 equals ampi_load_balance with value sync

 

CharmLB> RotateLB: PE [0] step 0 starting at 0.853504 Memory: 72.253906 MB

CharmLB> RotateLB: PE [0] strategy starting at 0.853559

CharmLB> RotateLB: PE [0] Memory: LBManager: 920 KB CentralLB: 3 KB

CharmLB> RotateLB: PE [0] #Objects migrating: 8, LBMigrateMsg size: 0.00 MB

CharmLB> RotateLB: PE [0] strategy finished at 0.853564 duration 0.000005 s

CharmLB> RotateLB: PE [0] step 0 finished at 0.882196 duration 0.028692 s

 

iter 11 time: 0.063316 maxerr: 837.779089

iter 12 time: 0.046134 maxerr: 803.868831

iter 13 time: 0.046079 maxerr: 773.751705

iter 14 time: 0.046063 maxerr: 746.772667

iter 15 time: 0.046088 maxerr: 722.424056

iter 16 time: 0.046083 maxerr: 700.305763

iter 17 time: 0.046087 maxerr: 680.097726

iter 18 time: 0.046047 maxerr: 661.540528

iter 19 time: 0.044149 maxerr: 644.421422

iter 20 time: 0.040968 maxerr: 628.564089

iter 21 time: 0.040264 maxerr: 613.821009

iter 22 time: 0.040429 maxerr: 600.067696

iter 23 time: 0.040471 maxerr: 587.198273

iter 24 time: 0.040278 maxerr: 575.122054

iter 25 time: 0.040325 maxerr: 563.760848

iter 26 time: 0.040425 maxerr: 553.046836

iter 27 time: 0.040186 maxerr: 542.920870

iter 28 time: 0.040066 maxerr: 533.331094

iter 29 time: 0.040020 maxerr: 524.231833

key 0 equals ampi_load_balance with value synca

WARNING: Unknown MPI_Info value (synca) given to AMPI_Migrate for key: ampi_load_balance

key 0 equals ampi_load_balance with value synca

WARNING: Unknown MPI_Info value (synca) given to AMPI_Migrate for key: ampi_load_balance

key 0 equals ampi_load_balance with value synca

WARNING: Unknown MPI_Info value (synca) given to AMPI_Migrate for key: ampi_load_balance

key 0 equals ampi_load_balance with value synca

WARNING: Unknown MPI_Info value (synca) given to AMPI_Migrate for key: ampi_load_balance

key 0 equals ampi_load_balance with value synca

WARNING: Unknown MPI_Info value (synca) given to AMPI_Migrate for key: ampi_load_balance

key 0 equals ampi_load_balance with value synca

WARNING: Unknown MPI_Info value (synca) given to AMPI_Migrate for key: ampi_load_balance

key 0 equals ampi_load_balance with value synca

WARNING: Unknown MPI_Info value (synca) given to AMPI_Migrate for key: ampi_load_balance

iter 30 time: 0.040080 maxerr: 515.582675

key 0 equals ampi_load_balance with value synca

WARNING: Unknown MPI_Info value (synca) given to AMPI_Migrate for key: ampi_load_balance

[Partition 0][Node 0] End of program

 

From: unmobile AT gmail.com [mailto:unmobile AT gmail.com] On Behalf Of Phil Miller
Sent: Friday, November 25, 2016 2:09 PM
To: Van Der Wijngaart, Rob F <rob.f.van.der.wijngaart AT intel.com>
Cc: Sam White <white67 AT illinois.edu>; charm AT cs.uiuc.edu
Subject: Re: [charm] Adaptive MPI

 

Sam: It seems like it should be straightforward to add an assertion in our API entry/exit tracking sentries to catch this kind of issue. Essentially, it would need to check that the calling thread is actually an AMPI process thread that's supposed to be running. We should also document that PUP routines for AMPI code can't call MPI routines.

 

On Thu, Nov 24, 2016 at 5:36 PM, Van Der Wijngaart, Rob F <rob.f.van.der.wijngaart AT intel.com> wrote:

Hi Sam,

 

I put the code away for a bit and just started looking at it again. I identified one major (and vexing source) of errors: I tried to get ranks to print what they were doing (using MP{I_Comm_rank) inside the PUP routine, and also to synchronize (MPI_Barrier) to order the output. But that it evidently not valid inside the routine, depending on the mode with which it is called. The first two entries are fine, but once migration takes place, errors result. I took all MPI calls out of the PUP routine, and now the code progresses a lot further. Still bombs, but I am pretty sure I can track down the segmentation violation.

Happy Thanksgiving!

 

Rob

 

From: samt.white AT gmail.com [mailto:samt.white AT gmail.com] On Behalf Of Sam White
Sent: Wednesday, November 23, 2016 1:30 PM


To: Van Der Wijngaart, Rob F <rob.f.van.der.wijngaart AT intel.com>
Cc: charm AT cs.uiuc.edu
Subject: Re: Adaptive MPI

 

Your code is failing inside the call to pup_isPacking(p)? Or it is failing while packing? A pup_er is indeed a pointer.
Also, you should still be using '+isomalloc_sync' whenever Charm gives you that warning during startup: even though you aren't using Isomalloc Heaps, AMPI is using Isomalloc Stacks for its user-level threads.

-Sam

 

On Wed, Nov 23, 2016 at 3:06 PM, Van Der Wijngaart, Rob F <rob.f.van.der.wijngaart AT intel.com> wrote:

Thanks, Sam. The code crashes inside AMPI_Migrate, so it doesn’t reach any print statements after that. I tracked down the statement that causes the crash. It is this one: pup_isPacking(p), where p is of type pup_er, I presume that is  a pointer, so I printed it as such. They all look like reasonable addresses to me. None of the ranks prints NULL.

 

Rob

 

From: samt.white AT gmail.com [mailto:samt.white AT gmail.com] On Behalf Of Sam White
Sent: Wednesday, November 23, 2016 12:21 PM
To: Van Der Wijngaart, Rob F <rob.f.van.der.wijngaart AT intel.com>
Cc: charm AT cs.uiuc.edu
Subject: Re: Adaptive MPI

 

The Isomalloc failure appears to be a locking issue during Charm/Converse startup in SMP/multicore builds when running with Isomalloc. We are looking at this now: https://charm.cs.illinois.edu/redmine/issues/1310. If you switch to a non-SMP/multicore build it will work.

To debug the issue with your PUP code, I would suggest adding print statements before/after your AMPI_Migrate() call, and inside the PUP routine. It often helps to see where in the PUP process (sizing, packing, deleting, unpacking) the runtime is when it fails to debug these types of issues.

-Sam

 

On Wed, Nov 23, 2016 at 11:28 AM, Van Der Wijngaart, Rob F <rob.f.van.der.wijngaart AT intel.com> wrote:

Hi Sam,

 

The first experiment was successful, but the isomalloc example hangs. See below. Unless it is a symptom of something bigger, I am not going to worry about the latter, since I wasn’t planning to use isomalloc for heap migration anyway. My regular MPI code on which the AMPI version is based runs fine for all the parameters I have tried, but I reckon that it may contain a memory bug that manifests itself only with load balancing

 

Rob

 

rfvander@klondike:~/Cjacobi3D$ make

/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicxx  -c jacobi.C

/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicxx  -o jacobi jacobi.o -module CommonLBs -lm

/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicxx  -c -DNO_PUP jacobi.C -o jacobi.iso.o

/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicxx  -o jacobi.iso jacobi.iso.o -module CommonLBs -memory isomalloc

/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicxx  -c -tlsglobal jacobi.C -o jacobi.tls.o

/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicxx  -o jacobi.tls jacobi.tls.o -tlsglobal -module CommonLBs #-memory isomalloc

/opt/charm/charm-6.7.0/multicore-linux64/bin/../lib/libconv-util.a(sockRoutines.o): In function `skt_lookup_ip':

sockRoutines.c:(.text+0x334): warning: Using 'gethostbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking

/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicxx  -c jacobi-get.C

/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicxx  -o jacobi-get jacobi-get.o -module CommonLBs -lm

rfvander@klondike:~/Cjacobi3D$ ./charmrun +p3 ./jacobi 2 2 2 +vp8 +balancer RotateLB +LBDebug 1

Running command: ./jacobi 2 2 2 +vp8 +balancer RotateLB +LBDebug 1 +p3

Charm++: standalone mode (not using charmrun)

Charm++> Running in Multicore mode:  3 threads

Converse/Charm++ Commit ID: v6.7.0-1-gca55e1d

Warning> Randomization of stack pointer is turned on in kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try run with '+isomalloc_sync'. 

CharmLB> Verbose level 1, load balancing period: 0.5 seconds

CharmLB> Load balancer assumes all CPUs are same.

Charm++> Running on 1 unique compute nodes (16-way SMP).

Charm++> cpu topology info is gathered in 0.000 seconds.

[0] RotateLB created

iter 1 time: 0.142733 maxerr: 2020.200000

iter 2 time: 0.157225 maxerr: 1696.968000

iter 3 time: 0.172039 maxerr: 1477.170240

iter 4 time: 0.146178 maxerr: 1319.433024

iter 5 time: 0.123098 maxerr: 1200.918072

iter 6 time: 0.131063 maxerr: 1108.425519

iter 7 time: 0.138213 maxerr: 1033.970839

iter 8 time: 0.138295 maxerr: 972.509242

iter 9 time: 0.138113 maxerr: 920.721889

iter 10 time: 0.121553 maxerr: 876.344030

CharmLB> RotateLB: PE [0] step 0 starting at 1.489509 Memory: 72.253906 MB

CharmLB> RotateLB: PE [0] strategy starting at 1.489573

CharmLB> RotateLB: PE [0] Memory: LBManager: 920 KB CentralLB: 3 KB

CharmLB> RotateLB: PE [0] #Objects migrating: 8, LBMigrateMsg size: 0.00 MB

CharmLB> RotateLB: PE [0] strategy finished at 1.489592 duration 0.000019 s

CharmLB> RotateLB: PE [0] step 0 finished at 1.507922 duration 0.018413 s

iter 11 time: 0.152840 maxerr: 837.779089

iter 12 time: 0.136401 maxerr: 803.868831

iter 13 time: 0.138095 maxerr: 773.751705

iter 14 time: 0.139319 maxerr: 746.772667

iter 15 time: 0.139327 maxerr: 722.424056

iter 16 time: 0.141794 maxerr: 700.305763

iter 17 time: 0.142484 maxerr: 680.097726

iter 18 time: 0.141056 maxerr: 661.540528

iter 19 time: 0.153895 maxerr: 644.421422

iter 20 time: 0.198588 maxerr: 628.564089

[Partition 0][Node 0] End of program

rfvander@klondike:~/Cjacobi3D$ ./charmrun +p3 ./jacobi.iso 2 2 2 +vp8 +balancer RotateLB +LBDebug 1                                                                               

Running command: ./jacobi.iso 2 2 2 +vp8 +balancer RotateLB +LBDebug 1 +p3

Charm++: standalone mode (not using charmrun)

Charm++> Running in Multicore mode:  3 threads

^C

rfvander@klondike:~/Cjacobi3D$ ./charmrun +p3 ./jacobi.iso 2 2 2 +vp8 +balancer RotateLB +LBDebug 1 +isomalloc_sync

Running command: ./jacobi.iso 2 2 2 +vp8 +balancer RotateLB +LBDebug 1 +isomalloc_sync +p3

Charm++: standalone mode (not using charmrun)

Charm++> Running in Multicore mode:  3 threads

 

From: samt.white AT gmail.com [mailto:samt.white AT gmail.com] On Behalf Of Sam White
Sent: Wednesday, November 23, 2016 7:10 AM
To: Van Der Wijngaart, Rob F <rob.f.van.der.wijngaart AT intel.com>
Cc: charm AT cs.uiuc.edu
Subject: Re: Adaptive MPI

 

Can you try an example AMPI program with load balancing? You can try charm/examples/ampi/Cjacobi3D/, running with something like '

./charmrun +p3 ./jacobi 2 2 2 +vp8 +balancer RotateLB +LBDebug 1'. You can also test that example with Isomalloc by running jacobi.iso (and as the warning in the Charm preamble output suggests, run with +isomalloc_sync). It also might help to build Charm++/AMPI with '-g' to get stacktraces.

-Sam

 

On Wed, Nov 23, 2016 at 2:19 AM, Van Der Wijngaart, Rob F <rob.f.van.der.wijngaart AT intel.com> wrote:

Hello Team,

 

I am trying to troubleshoot my Adaptive MPI code that uses dynamic load balancing. It crashes with a segmentation fault in AMPI_Migrate. I checked and dchunkpup (which I supplied) is called within AMPI_Migrate and finishes on all ranks. That is not to say it is correct, but the crash is not happening there. It could have corrupted memory elsewhere, though, so I gutted it, such that it only asks for and prints the MPI rank of the ranks entering it. I added graceful exit code after the call to AMPI_Migrate, But that is evidently not reached. I understand that this information is not enough for you to identify the problem, but at present I don’t know where to start, since the error occurs in code that I did not write. Could you give me some pointers where to start? Thanks!

Below is some relevant output. If I replace the RotateLB load balancer with RefineLB, some ranks do pass the AMPI_Migrate call, but that is evidently because the load balancer left them alone.

 

Rob

 

rfvander@klondike:~/esg-prk-devel/AMPI/AMR$ make clean; make amr USE_PUPER=1

rm -f amr.o MPI_bail_out.o wtime.o  amr *.optrpt *~ charmrun stats.json amr.decl.h amr.def.h

/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicc -O3 -std=c99  -DADAPTIVE_MPI -DRESTRICT_KEYWORD=0 -DVERBOSE=0  -DDOUBLE=1   -DRADIUS=2  -DSTAR=1 -DLOOPGEN=0  -DUSE_PUPER=1  -I../../include -c amr.c

In file included from amr.c:66:0:

../../include/par-res-kern_general.h: In function ‘prk_malloc’:

../../include/par-res-kern_general.h:136:11: warning: implicit declaration of function ‘posix_memalign’ [-Wimplicit-function-declaration]

     ret = posix_memalign(&ptr,alignment,bytes);

           ^

amr.c: In function ‘AMPI_Main’:

amr.c:842:14: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘long int’ [-Wformat=]

       printf("ERROR: rank %d's BG work tile smaller than stencil radius: %d\n",

              ^

amr.c:1080:14: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=]

       printf("ERROR: rank %d's work tile %d smaller than stencil radius: %d\n",

              ^

amr.c:1518:14: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘long int’ [-Wformat=]

       printf("Rank %d about to call AMPI_Migrate in iter %d\n", my_ID, iter);

              ^

amr.c:1520:14: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘long int’ [-Wformat=]

       printf("Rank %d called AMPI_Migrate in iter %d\n", my_ID, iter);

              ^

/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicc -O3 -std=c99  -DADAPTIVE_MPI -DRESTRICT_KEYWORD=0 -DVERBOSE=0  -DDOUBLE=1   -DRADIUS=2  -DSTAR=1 -DLOOPGEN=0  -DUSE_PUPER=1  -I../../include -c ../../common/MPI_bail_out.c

In file included from ../../common/MPI_bail_out.c:51:0:

../../include/par-res-kern_general.h: In function ‘prk_malloc’:

../../include/par-res-kern_general.h:136:11: warning: implicit declaration of function ‘posix_memalign’ [-Wimplicit-function-declaration]

     ret = posix_memalign(&ptr,alignment,bytes);

           ^

/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicc -O3 -std=c99  -DADAPTIVE_MPI -DRESTRICT_KEYWORD=0 -DVERBOSE=0  -DDOUBLE=1   -DRADIUS=2  -DSTAR=1 -DLOOPGEN=0  -DUSE_PUPER=1  -I../../include -c ../../common/wtime.c

/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicc -language ampi -o amr   -O3 -std=c99  -DADAPTIVE_MPI amr.o MPI_bail_out.o wtime.o  -lm -module CommonLBs

cc1plus: warning: command line option ‘-std=c99’ is valid for C/ObjC but not for C++

 

rfvander@klondike:~/esg-prk-devel/AMPI/AMR$ /opt/charm/charm-6.7.0/bin/charmrun ./amr 20 1000 500 3 10 5 1 FINE_GRAIN +p 8 +vp 16 +balancer RotateLB +LBDebug 1           

Running command: ./amr 20 1000 500 3 10 5 1 FINE_GRAIN +p 8 +vp 16 +balancer RotateLB +LBDebug 1

Charm++: standalone mode (not using charmrun)

Charm++> Running in Multicore mode:  8 threads

Converse/Charm++ Commit ID: v6.7.0-1-gca55e1d

Warning> Randomization of stack pointer is turned on in kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try run with '+isomalloc_sync'. 

CharmLB> Verbose level 1, load balancing period: 0.5 seconds

CharmLB> Load balancer assumes all CPUs are same.

Charm++> Running on 1 unique compute nodes (16-way SMP).

Charm++> cpu topology info is gathered in 0.001 seconds.

[0] RotateLB created

Parallel Research Kernels Version 2.17

MPI AMR stencil execution on 2D grid

Number of ranks                 = 16

Background grid size            = 1000

Radius of stencil               = 2

Tiles in x/y-direction on BG    = 4/4

Tiles in x/y-direction on ref 0 = 4/4

Tiles in x/y-direction on ref 1 = 4/4

Tiles in x/y-direction on ref 2 = 4/4

Tiles in x/y-direction on ref 3 = 4/4

Type of stencil                 = star

Data type                       = double precision

Compact representation of stencil loop body

Number of iterations            = 20

Load balancer                   = FINE_GRAIN

Refinement rank spread          = 16

Refinements:

   Background grid points       = 500

   Grid size                    = 3993

   Refinement level             = 3

   Period                       = 10

   Duration                     = 5

   Sub-iterations               = 1

Rank 12 about to call AMPI_Migrate in iter 0

Rank 12 entered dchunkpup

Rank 7 about to call AMPI_Migrate in iter 0

Rank 7 entered dchunkpup

Rank 8 about to call AMPI_Migrate in iter 0

Rank 8 entered dchunkpup

Rank 4 about to call AMPI_Migrate in iter 0

Rank 4 entered dchunkpup

Rank 15 about to call AMPI_Migrate in iter 0

Rank 15 entered dchunkpup

Rank 11 about to call AMPI_Migrate in iter 0

Rank 11 entered dchunkpup

Rank 3 about to call AMPI_Migrate in iter 0

Rank 1 about to call AMPI_Migrate in iter 0

Rank 1 entered dchunkpup

Rank 3 entered dchunkpup

Rank 13 about to call AMPI_Migrate in iter 0

Rank 13 entered dchunkpup

Rank 6 about to call AMPI_Migrate in iter 0

Rank 6 entered dchunkpup

Rank 0 about to call AMPI_Migrate in iter 0

Rank 0 entered dchunkpup

Rank 9 about to call AMPI_Migrate in iter 0

Rank 9 entered dchunkpup

Rank 5 about to call AMPI_Migrate in iter 0

Rank 5 entered dchunkpup

Rank 2 about to call AMPI_Migrate in iter 0

Rank 2 entered dchunkpup

Rank 10 about to call AMPI_Migrate in iter 0

Rank 10 entered dchunkpup

Rank 14 about to call AMPI_Migrate in iter 0

Rank 14 entered dchunkpup

CharmLB> RotateLB: PE [0] step 0 starting at 0.507547 Memory: 990.820312 MB

CharmLB> RotateLB: PE [0] strategy starting at 0.511685

CharmLB> RotateLB: PE [0] Memory: LBManager: 920 KB CentralLB: 19 KB

CharmLB> RotateLB: PE [0] #Objects migrating: 16, LBMigrateMsg size: 0.00 MB

CharmLB> RotateLB: PE [0] strategy finished at 0.511696 duration 0.000011 s

Segmentation fault (core dumped)

 

 

 

 

 




Archive powered by MHonArc 2.6.19.

Top of Page