Skip to Content.
Sympa Menu

charm - Re: [charm] Restart from checkpoint files

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Restart from checkpoint files


Chronological Thread 
  • From: Jozsef Bakosi <jbakosi AT lanl.gov>
  • To: "Mikida, Eric P" <mikida2 AT illinois.edu>
  • Cc: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
  • Subject: Re: [charm] Restart from checkpoint files
  • Date: Mon, 24 Jun 2019 08:36:51 -0600
  • Authentication-results: illinois.edu; spf=pass smtp.mailfrom=jbakosi AT lanl.gov; dkim=pass header.d=lanl.gov header.s=lanl; dmarc=pass header.from=lanl.gov

Thanks, Eric,

The problem was that I called delete on the migrate constructor argument
returning from the checkpoint.

It appears that it is okay (advised?) to call delete in the main chare
constructor but NOT okay to do the same in the main chare migrate
constructor.

In essence, my problem was equivalent to adding the delete in
charm/tests/charm++/chkpt/hello.C as follows:

Main(CkArgMsg* m){
... use m ...
delete m; // okay
...
}

Main(CkMigrateMessage *m) : CBase_Main(m) {
... use m ...
delete m; // this should not be here
...
}

I'm having trouble running that example though for a different reason:

$ make test
rm -rf log/
rm -fr log
../../../../bin/testrun ./hello +p4
make: ../../../../bin/testrun: Command not found
make: *** [Makefile:24: test] Error 127

Does adding the delete in this example yield the same runtime error for
you?

Thanks,
Jozsef

On 06.21.2019 19:06, Mikida, Eric P wrote:
> Hey Jozsef,
>
>
> Are you also compiling your application with -g? It looks like the only
> debug symbols showing up in the trace are from within the runtime system
> itself.
>
>
> If this is an error coming from your application it will come from a
> message being deleted twice. This could either occur from a message pointer
> being shared across multiple chares and then both try and delete it, or if
> you have an entry method marked [nokeep] but still delete the message
> yourself.
>
>
> Eric
>
> ________________________________
> From: Jozsef Bakosi
> <jbakosi AT lanl.gov>
> Sent: Saturday, June 15, 2019 7:58:19 AM
> To:
> charm AT lists.cs.illinois.edu
> Subject: [charm] Restart from checkpoint files
>
> Hi folks,
>
> Restarting from checkpoint files I'm getting:
>
> [0]CkRestartMain done. sending out callback.
> ------------- Processor 0 Exiting: Called CmiAbort ------------
> Reason: CmiFree reference count was zero-- is this a duplicate free?
> [0] Stack Traceback:
> [0:0] [0x76464e]
> [0:1] [0x761898]
> [0:2] [0x768b04]
> [0:3] CkFreeMsg+0x28 [0x60a398]
> [0:4] [0x5dab08]
> [0:5] [0x76430a]
> [0:6] [0x763c08]
> [0:7] [0x5dc45e]
> [0:8] [0x5d4a12]
> [0:9] __libc_start_main+0xeb [0x7fffec65309b]
> [0:10] [0x4d790a]
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 DUP
> FROM 0 with errorcode 1
>
> I tried the link option -memory paranoid and the runtime option ++debug,
> as well as built Charm++ in debug mode with no luck getting more info.
>
> build charm++ mpi-linux-x86_64 --enable-error-checking --with-prio-type=int
> --enable-randomized-msgq --suffix randq-debug --build-shared -j36 -w
> -stdlib=libc++ -g
>
> How can I get more information what chare/group goes into error when
> returning from checkpoint?
>
> Thanks,
> Jozsef



Archive powered by MHonArc 2.6.19.

Top of Page