Skip to Content.
Sympa Menu

charm - Re: [charm] Restart from checkpoint files

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Restart from checkpoint files


Chronological Thread 
  • From: Eric Mikida <epmikida AT hpccharm.com>
  • To: Jozsef Bakosi <jbakosi AT lanl.gov>
  • Cc: "Mikida, Eric P" <mikida2 AT illinois.edu>, "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
  • Subject: Re: [charm] Restart from checkpoint files
  • Date: Mon, 24 Jun 2019 11:53:38 -0400
  • Authentication-results: illinois.edu; spf=none smtp.mailfrom=epmikida AT hpccharm.com; dkim=pass header.d=hpccharm-com.20150623.gappssmtp.com header.s=20150623; dmarc=none

Hey Jozsef,

Yes deleting the message in the migration ctor of main does cause the error
for me as well. I will look into that and make sure it gets added to
documentation.

As for running the test yourself, I’m not sure why you are seeing that error.
It looks it may be using the wrong path to the testrun script. The test lives
in tests/charm++/chkpt, so the path to testrun should just be
../../../bin/testrun. The path is setup in tests/common.mk.

Eric

> On Jun 24, 2019, at 10:36 AM, Jozsef Bakosi
> <jbakosi AT lanl.gov>
> wrote:
>
> Thanks, Eric,
>
> The problem was that I called delete on the migrate constructor argument
> returning from the checkpoint.
>
> It appears that it is okay (advised?) to call delete in the main chare
> constructor but NOT okay to do the same in the main chare migrate
> constructor.
>
> In essence, my problem was equivalent to adding the delete in
> charm/tests/charm++/chkpt/hello.C as follows:
>
> Main(CkArgMsg* m){
> ... use m ...
> delete m; // okay
> ...
> }
>
> Main(CkMigrateMessage *m) : CBase_Main(m) {
> ... use m ...
> delete m; // this should not be here
> ...
> }
>
> I'm having trouble running that example though for a different reason:
>
> $ make test
> rm -rf log/
> rm -fr log
> ../../../../bin/testrun ./hello +p4
> make: ../../../../bin/testrun: Command not found
> make: *** [Makefile:24: test] Error 127
>
> Does adding the delete in this example yield the same runtime error for
> you?
>
> Thanks,
> Jozsef
>
> On 06.21.2019 19:06, Mikida, Eric P wrote:
>> Hey Jozsef,
>>
>>
>> Are you also compiling your application with -g? It looks like the only
>> debug symbols showing up in the trace are from within the runtime system
>> itself.
>>
>>
>> If this is an error coming from your application it will come from a
>> message being deleted twice. This could either occur from a message
>> pointer being shared across multiple chares and then both try and delete
>> it, or if you have an entry method marked [nokeep] but still delete the
>> message yourself.
>>
>>
>> Eric
>>
>> ________________________________
>> From: Jozsef Bakosi
>> <jbakosi AT lanl.gov>
>> Sent: Saturday, June 15, 2019 7:58:19 AM
>> To:
>> charm AT lists.cs.illinois.edu
>> Subject: [charm] Restart from checkpoint files
>>
>> Hi folks,
>>
>> Restarting from checkpoint files I'm getting:
>>
>> [0]CkRestartMain done. sending out callback.
>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> Reason: CmiFree reference count was zero-- is this a duplicate free?
>> [0] Stack Traceback:
>> [0:0] [0x76464e]
>> [0:1] [0x761898]
>> [0:2] [0x768b04]
>> [0:3] CkFreeMsg+0x28 [0x60a398]
>> [0:4] [0x5dab08]
>> [0:5] [0x76430a]
>> [0:6] [0x763c08]
>> [0:7] [0x5dc45e]
>> [0:8] [0x5d4a12]
>> [0:9] __libc_start_main+0xeb [0x7fffec65309b]
>> [0:10] [0x4d790a]
>> --------------------------------------------------------------------------
>> MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 DUP
>> FROM 0 with errorcode 1
>>
>> I tried the link option -memory paranoid and the runtime option ++debug,
>> as well as built Charm++ in debug mode with no luck getting more info.
>>
>> build charm++ mpi-linux-x86_64 --enable-error-checking
>> --with-prio-type=int --enable-randomized-msgq --suffix randq-debug
>> --build-shared -j36 -w -stdlib=libc++ -g
>>
>> How can I get more information what chare/group goes into error when
>> returning from checkpoint?
>>
>> Thanks,
>> Jozsef




Archive powered by MHonArc 2.6.19.

Top of Page