charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Migration of bound arrays

From: Jozsef Bakosi <jbakosi AT lanl.gov>
To: Ronak Buch <rabuch2 AT illinois.edu>
Cc: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
Subject: Re: [charm] Migration of bound arrays
Date: Thu, 24 Jan 2019 11:03:49 -0700
Authentication-results: illinois.edu; spf=pass smtp.mailfrom=jbakosi AT lanl.gov; dkim=pass header.d=lanl.gov header.s=lanl; dmarc=pass header.from=lanl.gov

Hi Ronak,

Thanks for confirming the correctness. I have done some more digging and
found a potential problem, which though seems unrelated to bound arrays.

Consider the code attached. If you fill in the path to charmc in the
Makefile, the code should be compilable and the problem reproducible:

If quiescence detection is on and I run the code using LB, I get
quiescence triggered:

=============================================================================
$ ./charmrun +p4 ./bound_migrate 10 +balancer RandCentLB +LBDebug 1 +cs

Running on 4 processors: ./bound_migrate 10 +balancer RandCentLB +LBDebug 1
+cs
charmrun> /usr/bin/setarch x86_64 -R mpirun -np 4 ./bound_migrate 10
+balancer RandCentLB +LBDebug 1 +cs
Charm++> Running on MPI version: 3.1
Charm++> level of thread support used: MPI_THREAD_SINGLE (desired:
MPI_THREAD_SINGLE)
Charm++> Running in non-SMP mode: 4 processes (PEs)
Converse/Charm++ Commit ID:
Charm++> Using STL-based msgQ:
Charm++> Using randomized msgQ. Priorities will not be respected!
CharmLB> Verbose level 1, load balancing period: 0.5 seconds
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 hosts (2 sockets x 18 cores x 1 PUs = 36-way SMP)
Charm++> cpu topology info is gathered in 0.001 seconds.
CharmLB> RandCentLB created.
Running on 4 processors using 10 elements
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: Quiescence detected
...
=============================================================================

However, when LB is not used, it runs without error:

=============================================================================
$ ./charmrun +p4 ./bound_migrate 10

Running on 4 processors: ./bound_migrate 10
charmrun> /usr/bin/setarch x86_64 -R mpirun -np 4 ./bound_migrate 10
Charm++> Running on MPI version: 3.1
Charm++> level of thread support used: MPI_THREAD_SINGLE (desired:
MPI_THREAD_SINGLE)
Charm++> Running in non-SMP mode: 4 processes (PEs)
Converse/Charm++ Commit ID:
Charm++> Using STL-based msgQ:
Charm++> Using randomized msgQ. Priorities will not be respected!
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 hosts (2 sockets x 18 cores x 1 PUs = 36-way SMP)
Charm++> cpu topology info is gathered in 0.002 seconds.
Running on 4 processors using 10 elements
A 0 on 0
A 0 on 0
A 0 on 0
...
=============================================================================

Also, if one turns off quiescence detection in bound_migrate.C:

// CkStartQD(
// CkCallback( CkIndex_Main::quiescence(), thisProxy ) );

migration now works fine:

=============================================================================
$ ./charmrun +p4 ./bound_migrate 10 +balancer RandCentLB +LBDebug 1 +cs

Running on 4 processors: ./bound_migrate 10 +balancer RandCentLB +LBDebug 1
+cs
charmrun> /usr/bin/setarch x86_64 -R mpirun -np 4 ./bound_migrate 10
+balancer RandCentLB +LBDebug 1 +cs
Charm++> Running on MPI version: 3.1
Charm++> level of thread support used: MPI_THREAD_SINGLE (desired:
MPI_THREAD_SINGLE)
Charm++> Running in non-SMP mode: 4 processes (PEs)
Converse/Charm++ Commit ID:
Charm++> Using STL-based msgQ:
Charm++> Using randomized msgQ. Priorities will not be respected!
CharmLB> Verbose level 1, load balancing period: 0.5 seconds
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 hosts (2 sockets x 18 cores x 1 PUs = 36-way SMP)
Charm++> cpu topology info is gathered in 0.002 seconds.
CharmLB> RandCentLB created.
Running on 4 processors using 10 elements

CharmLB> RandCentLB: PE [0] step 0 starting at 0.503235 Memory: 1.640625 MB
CharmLB> RandCentLB: PE [0] strategy starting at 0.503726
Calling RandCentLB strategy
CharmLB> RandCentLB: PE [0] Memory: LBManager: 41 KB CentralLB: 1 KB
CharmLB> RandCentLB: PE [0] #Objects migrating: 9, LBMigrateMsg size: 0.00 MB
CharmLB> RandCentLB: PE [0] strategy finished at 0.503734 duration 0.000009 s
A 1 on 1
A 6 on 1
A 8 on 1
A 9 on 2
A 0 on 2
A 4 on 2
A 2 on 3
A 5 on 0
A 3 on 0
A 7 on 0
CharmLB> RandCentLB: PE [0] step 0 finished at 0.503937 duration 0.000702 s

CharmLB> RandCentLB: PE [0] step 1 starting at 1.003931 Memory: 1.640625 MB
CharmLB> RandCentLB: PE [0] strategy starting at 1.003990
Calling RandCentLB strategy
CharmLB> RandCentLB: PE [0] Memory: LBManager: 41 KB CentralLB: 1 KB
CharmLB> RandCentLB: PE [0] #Objects migrating: 8, LBMigrateMsg size: 0.00 MB
CharmLB> RandCentLB: PE [0] strategy finished at 1.003997 duration 0.000007 s
A 7 on 0
CharmLB> RandCentLB: PE [0] step 1 finished at 1.004095 duration 0.000164 s

...

All done
Charm Kernel Summary Statistics:
Proc 0: [8 created, 8 processed]
Proc 1: [0 created, 0 processed]
Proc 2: [0 created, 0 processed]
Proc 3: [0 created, 0 processed]
Total Chares: [8 created, 8 processed]
Charm Kernel Detailed Statistics (R=requested P=processed):

Create Mesgs Create Mesgs Create Mesgs
Chare for Group for Nodegroup for
PE R/P Mesgs Chares Mesgs Groups Mesgs Nodegroups
---- --- --------- --------- --------- --------- --------- ----------
0 R 8 0 10 191 0 0
P 8 0 10 203 0 0
1 R 0 0 0 80 0 0
P 0 0 10 78 0 0
2 R 0 0 0 76 0 0
P 0 0 10 70 0 0
3 R 0 0 0 75 0 0
P 0 0 10 74 0 0
[Partition 0][Node 0] End of program
=============================================================================

I'm not sure if this matters, but note I am running this with randomized
queues, on top of OpenMPI, and in non-SMP mode, with Charm++ built with:

$ ./build charm++ mpi-linux-x86_64 --enable-error-checking
--with-prio-type=int --enable-randomized-msgq --suffix randq-debug
--build-shared -j36 -w -g

I have bumped into this while trying to switch from migration using
non-blocking AtSync to blocking AtSync (using ResumeFromSync). It
appears that when quiescence detection is enabled and one uses blocking
AtSync migration, quiescence is always detected (right within the first
call to AtSync) and ResumeFromSync is never called. Non-blocking AtSync
migration appears fine with quiescence detection. Is this a bug or I'm
not using this correctly?

Eric Mikida: I suspect that this was probably the reason why I could not
switch to using ResumeFromSync in branch
https://github.com/quinoacomputing/quinoa/tree/ResumeFromSync.

Thanks,
Jozsef

On 01.10.2019 13:42, Ronak Buch wrote:
> Your code seems to be correct, this should migrate A and B together. The
> only potential issue that I can see is that the code you've provided
> doesn't use resumeFromSync, which should be used in order to guarantee that
> migrations have actually completed.
>
> Traditionally, m_a isn't pupped, it's set as a readonly instead, and then
> just used globally, but pupping it as you are doing should work.
>
> Waiting for resumeFromSync should make it safe. I believe resumeFromSync
> will only be called on A, so you can do some messaging to make B know that.
>
>
> All of the above are what I think should be true based on how I understand
> your sample code and description. I'll try to recreate a simple proof on
> concept to replicate your issue to do some additional checking and testing.
> If you have compileable sample code that you can share that exhibits this
> problem, that would also be appreciated.
>
> Thanks,
> Ronak
>
> On Tue, Jan 8, 2019 at 5:21 PM Jozsef Bakosi
> <jbakosi AT lanl.gov>
> wrote:
>
> > Hi folks,
> >
> > Consider the following:
> >
> > =========================================
> > class A : public CProxy_A {
> > public:
> > A() { usesAtSync = true; }
> > };
> >
> > class B : public CProxy_B {
> >
> > public:
> > B( CProxy_A& a ) : m_a( a ) {}
> >
> > void pup( PUP::er& p ) { p | m_a; }
> >
> > A* aptr() { return m_a[ thisIndex ].ckLocal(); }
> >
> > void step() { if (time to migrate) aptr()->AtSync(); }
> >
> > private:
> > CProxy_A m_a;
> > };
> >
> >
> > CProxy_A a = CProxy_A::ckNew();
> >
> > CkArrayOptions bound;
> > bound.bindTo( a );
> > CProxy_B b = CProxy_B::ckNew( bound );
> >
> > a[0].insert();
> > b[0].insert( a );
> >
> > a.doneInserting();
> > b.doneInserting();
> > =========================================
> >
> > In words: I have two bound chare arrays, A and B. In B I hold a proxy to
> > A and dereference it via aptr() to get a raw pointer to the bound object.
> >
> > B::step() is called in some iteration and which when it is time to
> > migrate, calls A::AtSync(). Calling aptr() has worked so far fine
> > without migration. When I migrate, however, I randomly get segfaults
> > from aptr().
> >
> > Questions:
> >
> > 1. Is this a correct way of migrating A and B together?
> >
> > 2. Is this a correct way of using usesAtSync and AtSync?
> >
> > 3. Is it mandatory to pup m_a as written above?
> >
> > 4. How do I know that calling aptr() is safe after migration? In other
> > words, how do I know that when I call B::aptr() A has already arrived?
> >
> > Thanks,
> > Jozsef
module a {

array [1D] A {
entry A( int maxit );
}

};
module b {

array [1D] B {
entry B();
}

}
mainmodule main {

extern module a;
extern module b;

readonly CProxy_Main mainProxy;

mainchare Main {
entry Main(CkArgMsg *m);
entry [reductiontarget] void done();
entry void quiescence();
};

};

#include <stdio.h>

#include "main.decl.h"
#include "a.decl.h"
#include "b.decl.h"

CProxy_Main mainProxy;

class A : public CBase_A {

  public:
    explicit A( int maxit ) : m_maxit(maxit), m_it(0) {
      usesAtSync = true;
      step();
    }

   explicit A( CkMigrateMessage* ) {}

    void pup( PUP::er &p ) override {
      p | m_maxit;
      p | m_it;
    }

    void ResumeFromSync() override {
      CkPrintf("A %d on %d\n",thisIndex,CkMyPe());
      step();
    }
  
  private:
    int m_maxit;
    int m_it;
    void step() {
      if (++m_it == m_maxit)
        contribute( CkCallback(CkReductionTarget(Main,done),
                    mainProxy) );
      else {
        AtSync();
      }
    }
};

class B : public CBase_B {
  public:
    explicit B() { }
    explicit B( CkMigrateMessage* ) {}
};

/*mainchare*/
class Main : public CBase_Main {
  public:
    Main(CkArgMsg* m) {
      int nelem=5;
      if (m->argc > 1) nelem=atoi(m->argv[1]);
      delete m;

      CkPrintf("Running on %d processors using %d elements\n",
               CkNumPes(),nelem);
      mainProxy = thisProxy;

//       CkStartQD(
//         CkCallback( CkIndex_Main::quiescence(), thisProxy ) );

      CkArrayOptions bound;
      CProxy_A a = CProxy_A::ckNew();
      //CProxy_B b = CProxy_B::ckNew( bound );

      for (int i=0; i<nelem; ++i) {
        a[i].insert( 10 );
        //b[i].insert();
      }
      a.doneInserting();
      //b.doneInserting();
    };

    void quiescence() {
      CkAbort("Quiescence detected");
    }

    void done()
    {
      CkPrintf("All done\n");
      CkExit();
    };
};

#include "a.def.h"
#include "b.def.h"
#include "main.def.h"

LB=-module CommonLB

CHARMC=<path-to-charm-install>/bin/charmc $(OPTS)

OBJS = bound_migrate.o

all: bound_migrate

bound_migrate: $(OBJS)
$(CHARMC) -language charm++ -module CommonLBs -o bound_migrate $(OBJS)

main.decl.h: main.ci
$(CHARMC) main.ci

a.decl.h: a.ci
$(CHARMC) a.ci

b.decl.h: b.ci
$(CHARMC) b.ci

clean:
rm -f *.decl.h *.def.h *.o bound_migrate charmrun

bound_migrate.o: bound_migrate.C main.decl.h a.decl.h b.decl.h
$(CHARMC) -c bound_migrate.C

[charm] Migration of bound arrays, Jozsef Bakosi, 01/08/2019
- <Possible follow-up(s)>
- Re: [charm] Migration of bound arrays, Ronak Buch, 01/10/2019
  - Re: [charm] Migration of bound arrays, Jozsef Bakosi, 01/24/2019
    - Re: [charm] Migration of bound arrays, Jozsef Bakosi, 01/24/2019