Skip to Content.
Sympa Menu

charm - [charm] Adaptive MPI

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

[charm] Adaptive MPI


Chronological Thread 
  • From: "Van Der Wijngaart, Rob F" <rob.f.van.der.wijngaart AT intel.com>
  • To: Sam White <white67 AT illinois.edu>
  • Cc: "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
  • Subject: [charm] Adaptive MPI
  • Date: Wed, 23 Nov 2016 08:19:33 +0000
  • Accept-language: en-US

Hello Team,

 

I am trying to troubleshoot my Adaptive MPI code that uses dynamic load balancing. It crashes with a segmentation fault in AMPI_Migrate. I checked and dchunkpup (which I supplied) is called within AMPI_Migrate and finishes on all ranks. That is not to say it is correct, but the crash is not happening there. It could have corrupted memory elsewhere, though, so I gutted it, such that it only asks for and prints the MPI rank of the ranks entering it. I added graceful exit code after the call to AMPI_Migrate, But that is evidently not reached. I understand that this information is not enough for you to identify the problem, but at present I don’t know where to start, since the error occurs in code that I did not write. Could you give me some pointers where to start? Thanks!

Below is some relevant output. If I replace the RotateLB load balancer with RefineLB, some ranks do pass the AMPI_Migrate call, but that is evidently because the load balancer left them alone.

 

Rob

 

rfvander@klondike:~/esg-prk-devel/AMPI/AMR$ make clean; make amr USE_PUPER=1

rm -f amr.o MPI_bail_out.o wtime.o  amr *.optrpt *~ charmrun stats.json amr.decl.h amr.def.h

/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicc -O3 -std=c99  -DADAPTIVE_MPI -DRESTRICT_KEYWORD=0 -DVERBOSE=0  -DDOUBLE=1   -DRADIUS=2  -DSTAR=1 -DLOOPGEN=0  -DUSE_PUPER=1  -I../../include -c amr.c

In file included from amr.c:66:0:

../../include/par-res-kern_general.h: In function ‘prk_malloc’:

../../include/par-res-kern_general.h:136:11: warning: implicit declaration of function ‘posix_memalign’ [-Wimplicit-function-declaration]

     ret = posix_memalign(&ptr,alignment,bytes);

           ^

amr.c: In function ‘AMPI_Main’:

amr.c:842:14: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘long int’ [-Wformat=]

       printf("ERROR: rank %d's BG work tile smaller than stencil radius: %d\n",

              ^

amr.c:1080:14: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=]

       printf("ERROR: rank %d's work tile %d smaller than stencil radius: %d\n",

              ^

amr.c:1518:14: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘long int’ [-Wformat=]

       printf("Rank %d about to call AMPI_Migrate in iter %d\n", my_ID, iter);

              ^

amr.c:1520:14: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘long int’ [-Wformat=]

       printf("Rank %d called AMPI_Migrate in iter %d\n", my_ID, iter);

              ^

/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicc -O3 -std=c99  -DADAPTIVE_MPI -DRESTRICT_KEYWORD=0 -DVERBOSE=0  -DDOUBLE=1   -DRADIUS=2  -DSTAR=1 -DLOOPGEN=0  -DUSE_PUPER=1  -I../../include -c ../../common/MPI_bail_out.c

In file included from ../../common/MPI_bail_out.c:51:0:

../../include/par-res-kern_general.h: In function ‘prk_malloc’:

../../include/par-res-kern_general.h:136:11: warning: implicit declaration of function ‘posix_memalign’ [-Wimplicit-function-declaration]

     ret = posix_memalign(&ptr,alignment,bytes);

           ^

/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicc -O3 -std=c99  -DADAPTIVE_MPI -DRESTRICT_KEYWORD=0 -DVERBOSE=0  -DDOUBLE=1   -DRADIUS=2  -DSTAR=1 -DLOOPGEN=0  -DUSE_PUPER=1  -I../../include -c ../../common/wtime.c

/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicc -language ampi -o amr   -O3 -std=c99  -DADAPTIVE_MPI amr.o MPI_bail_out.o wtime.o  -lm -module CommonLBs

cc1plus: warning: command line option ‘-std=c99’ is valid for C/ObjC but not for C++

 

rfvander@klondike:~/esg-prk-devel/AMPI/AMR$ /opt/charm/charm-6.7.0/bin/charmrun ./amr 20 1000 500 3 10 5 1 FINE_GRAIN +p 8 +vp 16 +balancer RotateLB +LBDebug 1           

Running command: ./amr 20 1000 500 3 10 5 1 FINE_GRAIN +p 8 +vp 16 +balancer RotateLB +LBDebug 1

Charm++: standalone mode (not using charmrun)

Charm++> Running in Multicore mode:  8 threads

Converse/Charm++ Commit ID: v6.7.0-1-gca55e1d

Warning> Randomization of stack pointer is turned on in kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try run with '+isomalloc_sync'. 

CharmLB> Verbose level 1, load balancing period: 0.5 seconds

CharmLB> Load balancer assumes all CPUs are same.

Charm++> Running on 1 unique compute nodes (16-way SMP).

Charm++> cpu topology info is gathered in 0.001 seconds.

[0] RotateLB created

Parallel Research Kernels Version 2.17

MPI AMR stencil execution on 2D grid

Number of ranks                 = 16

Background grid size            = 1000

Radius of stencil               = 2

Tiles in x/y-direction on BG    = 4/4

Tiles in x/y-direction on ref 0 = 4/4

Tiles in x/y-direction on ref 1 = 4/4

Tiles in x/y-direction on ref 2 = 4/4

Tiles in x/y-direction on ref 3 = 4/4

Type of stencil                 = star

Data type                       = double precision

Compact representation of stencil loop body

Number of iterations            = 20

Load balancer                   = FINE_GRAIN

Refinement rank spread          = 16

Refinements:

   Background grid points       = 500

   Grid size                    = 3993

   Refinement level             = 3

   Period                       = 10

   Duration                     = 5

   Sub-iterations               = 1

Rank 12 about to call AMPI_Migrate in iter 0

Rank 12 entered dchunkpup

Rank 7 about to call AMPI_Migrate in iter 0

Rank 7 entered dchunkpup

Rank 8 about to call AMPI_Migrate in iter 0

Rank 8 entered dchunkpup

Rank 4 about to call AMPI_Migrate in iter 0

Rank 4 entered dchunkpup

Rank 15 about to call AMPI_Migrate in iter 0

Rank 15 entered dchunkpup

Rank 11 about to call AMPI_Migrate in iter 0

Rank 11 entered dchunkpup

Rank 3 about to call AMPI_Migrate in iter 0

Rank 1 about to call AMPI_Migrate in iter 0

Rank 1 entered dchunkpup

Rank 3 entered dchunkpup

Rank 13 about to call AMPI_Migrate in iter 0

Rank 13 entered dchunkpup

Rank 6 about to call AMPI_Migrate in iter 0

Rank 6 entered dchunkpup

Rank 0 about to call AMPI_Migrate in iter 0

Rank 0 entered dchunkpup

Rank 9 about to call AMPI_Migrate in iter 0

Rank 9 entered dchunkpup

Rank 5 about to call AMPI_Migrate in iter 0

Rank 5 entered dchunkpup

Rank 2 about to call AMPI_Migrate in iter 0

Rank 2 entered dchunkpup

Rank 10 about to call AMPI_Migrate in iter 0

Rank 10 entered dchunkpup

Rank 14 about to call AMPI_Migrate in iter 0

Rank 14 entered dchunkpup

CharmLB> RotateLB: PE [0] step 0 starting at 0.507547 Memory: 990.820312 MB

CharmLB> RotateLB: PE [0] strategy starting at 0.511685

CharmLB> RotateLB: PE [0] Memory: LBManager: 920 KB CentralLB: 19 KB

CharmLB> RotateLB: PE [0] #Objects migrating: 16, LBMigrateMsg size: 0.00 MB

CharmLB> RotateLB: PE [0] strategy finished at 0.511696 duration 0.000011 s

Segmentation fault (core dumped)

 




Archive powered by MHonArc 2.6.19.

Top of Page