Skip to Content.
Sympa Menu

charm - Re: [charm] Restart from checkpoint files

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Restart from checkpoint files


Chronological Thread 
  • From: Jozsef Bakosi <jbakosi AT lanl.gov>
  • To: Eric Mikida <epmikida AT hpccharm.com>
  • Cc: "Mikida, Eric P" <mikida2 AT illinois.edu>, "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
  • Subject: Re: [charm] Restart from checkpoint files
  • Date: Wed, 26 Jun 2019 13:20:48 -0600
  • Authentication-results: illinois.edu; spf=pass smtp.mailfrom=jbakosi AT lanl.gov; dkim=pass header.d=lanl.gov header.s=lanl; dmarc=pass header.from=lanl.gov

On 06.24.2019 11:53, Eric Mikida wrote:
> Hey Jozsef,
>
> Yes deleting the message in the migration ctor of main does cause the error
> for me as well. I will look into that and make sure it gets added to
> documentation.

Thanks, Eric, for confirming this.

> As for running the test yourself, I’m not sure why you are seeing that
> error. It looks it may be using the wrong path to the testrun script. The
> test lives in tests/charm++/chkpt, so the path to testrun should just be
> ../../../bin/testrun. The path is setup in tests/common.mk.

I'm not sure either. I'm probably not running the test correctly. But
what you confirmed above is what I wanted to see, so this is all okay.

Now we can successfully do checkpoint/restart in both serial and
parallel: https://github.com/quinoacomputing/quinoa/pull/328.

Thanks for the help,
Jozsef

> > On Jun 24, 2019, at 10:36 AM, Jozsef Bakosi
> > <jbakosi AT lanl.gov>
> > wrote:
> >
> > Thanks, Eric,
> >
> > The problem was that I called delete on the migrate constructor argument
> > returning from the checkpoint.
> >
> > It appears that it is okay (advised?) to call delete in the main chare
> > constructor but NOT okay to do the same in the main chare migrate
> > constructor.
> >
> > In essence, my problem was equivalent to adding the delete in
> > charm/tests/charm++/chkpt/hello.C as follows:
> >
> > Main(CkArgMsg* m){
> > ... use m ...
> > delete m; // okay
> > ...
> > }
> >
> > Main(CkMigrateMessage *m) : CBase_Main(m) {
> > ... use m ...
> > delete m; // this should not be here
> > ...
> > }
> >
> > I'm having trouble running that example though for a different reason:
> >
> > $ make test
> > rm -rf log/
> > rm -fr log
> > ../../../../bin/testrun ./hello +p4
> > make: ../../../../bin/testrun: Command not found
> > make: *** [Makefile:24: test] Error 127
> >
> > Does adding the delete in this example yield the same runtime error for
> > you?
> >
> > Thanks,
> > Jozsef
> >
> > On 06.21.2019 19:06, Mikida, Eric P wrote:
> >> Hey Jozsef,
> >>
> >>
> >> Are you also compiling your application with -g? It looks like the only
> >> debug symbols showing up in the trace are from within the runtime system
> >> itself.
> >>
> >>
> >> If this is an error coming from your application it will come from a
> >> message being deleted twice. This could either occur from a message
> >> pointer being shared across multiple chares and then both try and delete
> >> it, or if you have an entry method marked [nokeep] but still delete the
> >> message yourself.
> >>
> >>
> >> Eric
> >>
> >> ________________________________
> >> From: Jozsef Bakosi
> >> <jbakosi AT lanl.gov>
> >> Sent: Saturday, June 15, 2019 7:58:19 AM
> >> To:
> >> charm AT lists.cs.illinois.edu
> >> Subject: [charm] Restart from checkpoint files
> >>
> >> Hi folks,
> >>
> >> Restarting from checkpoint files I'm getting:
> >>
> >> [0]CkRestartMain done. sending out callback.
> >> ------------- Processor 0 Exiting: Called CmiAbort ------------
> >> Reason: CmiFree reference count was zero-- is this a duplicate free?
> >> [0] Stack Traceback:
> >> [0:0] [0x76464e]
> >> [0:1] [0x761898]
> >> [0:2] [0x768b04]
> >> [0:3] CkFreeMsg+0x28 [0x60a398]
> >> [0:4] [0x5dab08]
> >> [0:5] [0x76430a]
> >> [0:6] [0x763c08]
> >> [0:7] [0x5dc45e]
> >> [0:8] [0x5d4a12]
> >> [0:9] __libc_start_main+0xeb [0x7fffec65309b]
> >> [0:10] [0x4d790a]
> >> --------------------------------------------------------------------------
> >> MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 DUP
> >> FROM 0 with errorcode 1
> >>
> >> I tried the link option -memory paranoid and the runtime option ++debug,
> >> as well as built Charm++ in debug mode with no luck getting more info.
> >>
> >> build charm++ mpi-linux-x86_64 --enable-error-checking
> >> --with-prio-type=int --enable-randomized-msgq --suffix randq-debug
> >> --build-shared -j36 -w -stdlib=libc++ -g
> >>
> >> How can I get more information what chare/group goes into error when
> >> returning from checkpoint?
> >>
> >> Thanks,
> >> Jozsef



Archive powered by MHonArc 2.6.19.

Top of Page