Skip to Content.
Sympa Menu

charm - [charm] Restart from checkpoint files

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

[charm] Restart from checkpoint files


Chronological Thread 
  • From: Jozsef Bakosi <jbakosi AT lanl.gov>
  • To: charm AT lists.cs.illinois.edu
  • Subject: [charm] Restart from checkpoint files
  • Date: Sat, 15 Jun 2019 05:58:19 -0600
  • Authentication-results: illinois.edu; spf=pass smtp.mailfrom=jbakosi AT lanl.gov; dkim=pass header.d=lanl.gov header.s=lanl; dmarc=pass header.from=lanl.gov

Hi folks,

Restarting from checkpoint files I'm getting:

[0]CkRestartMain done. sending out callback.
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: CmiFree reference count was zero-- is this a duplicate free?
[0] Stack Traceback:
[0:0] [0x76464e]
[0:1] [0x761898]
[0:2] [0x768b04]
[0:3] CkFreeMsg+0x28 [0x60a398]
[0:4] [0x5dab08]
[0:5] [0x76430a]
[0:6] [0x763c08]
[0:7] [0x5dc45e]
[0:8] [0x5d4a12]
[0:9] __libc_start_main+0xeb [0x7fffec65309b]
[0:10] [0x4d790a]
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 DUP
FROM 0 with errorcode 1

I tried the link option -memory paranoid and the runtime option ++debug,
as well as built Charm++ in debug mode with no luck getting more info.

build charm++ mpi-linux-x86_64 --enable-error-checking --with-prio-type=int
--enable-randomized-msgq --suffix randq-debug --build-shared -j36 -w
-stdlib=libc++ -g

How can I get more information what chare/group goes into error when
returning from checkpoint?

Thanks,
Jozsef



Archive powered by MHonArc 2.6.19.

Top of Page