Skip to Content.
Sympa Menu

charm - Re: [charm] Restart from checkpoint files

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Restart from checkpoint files


Chronological Thread 
  • From: "Mikida, Eric P" <mikida2 AT illinois.edu>
  • To: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>, Jozsef Bakosi <jbakosi AT lanl.gov>
  • Subject: Re: [charm] Restart from checkpoint files
  • Date: Fri, 21 Jun 2019 19:06:24 +0000
  • Accept-language: en-US
  • Authentication-results: illinois.edu; spf=pass smtp.mailfrom=mikida2 AT illinois.edu; dkim=pass header.d=uillinoisedu.onmicrosoft.com header.s=selector1-uillinoisedu-onmicrosoft-com; dmarc=pass header.from=illinois.edu

Hey Jozsef,


Are you also compiling your application with -g? It looks like the only debug symbols showing up in the trace are from within the runtime system itself.


If this is an error coming from your application it will come from a message being deleted twice. This could either occur from a message pointer being shared across multiple chares and then both try and delete it, or if you have an entry method marked [nokeep] but still delete the message yourself.


Eric


From: Jozsef Bakosi <jbakosi AT lanl.gov>
Sent: Saturday, June 15, 2019 7:58:19 AM
To: charm AT lists.cs.illinois.edu
Subject: [charm] Restart from checkpoint files
 
Hi folks,

Restarting from checkpoint files I'm getting:

[0]CkRestartMain done. sending out callback.
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: CmiFree reference count was zero-- is this a duplicate free?
[0] Stack Traceback:
  [0:0]   [0x76464e]
  [0:1]   [0x761898]
  [0:2]   [0x768b04]
  [0:3] CkFreeMsg+0x28  [0x60a398]
  [0:4]   [0x5dab08]
  [0:5]   [0x76430a]
  [0:6]   [0x763c08]
  [0:7]   [0x5dc45e]
  [0:8]   [0x5d4a12]
  [0:9] __libc_start_main+0xeb  [0x7fffec65309b]
  [0:10]   [0x4d790a]
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 DUP
FROM 0 with errorcode 1

I tried the link option -memory paranoid and the runtime option ++debug,
as well as built Charm++ in debug mode with no luck getting more info.

build charm++ mpi-linux-x86_64 --enable-error-checking --with-prio-type=int --enable-randomized-msgq --suffix randq-debug --build-shared -j36 -w -stdlib=libc++ -g

How can I get more information what chare/group goes into error when
returning from checkpoint?

Thanks,
Jozsef



Archive powered by MHonArc 2.6.19.

Top of Page