Skip to Content.
Sympa Menu

charm - [charm] errors when running on multiple physical nodes

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

[charm] errors when running on multiple physical nodes


Chronological Thread 
  • From: Jakub Homola <jakub.homola AT vsb.cz>
  • To: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
  • Subject: [charm] errors when running on multiple physical nodes
  • Date: Mon, 21 Oct 2019 20:18:33 +0200
  • Authentication-results: illinois.edu; spf=none smtp.mailfrom=jakub.homola AT vsb.cz; dkim=pass header.d=vsb.cz header.s=default; dkim=fail header.d=vsb.cz header.s=default; dmarc=none
  • Importance: normal

Hello,

I am having trouble running charm++ program on multiple physical nodes in a network and would like to get some help with that.

 

When I run my charm++ program on a single physical node (2 sockets, 12 cores per socket) (charm creates 2 processes, one per socket), everything works fine with no errors. However, if I run the program on multiple physical nodes (specifying the ++nodelist file), sometimes one or more of three errors occur:

 

1.

Charmrun> error on request socket to node 3 'r2i0n14.ib0.smc.salomon.it4i.cz'--

Socket closed before recv.

// the “wrong” node index always appears to be the last one – when running on 2 physical nodes (4 charm nodes/processes) the error is on node 3, with 4 physical nodes (8 charm processes) the error is with node 7.

 

2.

------------- Processor 56 Exiting: Caught Signal ------------

Reason: segmentation violation

Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).

[56] Stack Traceback:

  [56:0]   [0x5d0ef2]

  [56:1] +0xf5f0  [0x2ab24fdd85f0]

  [56:2] _ZN5SaverC2EPdj+0x214  [0x4afe94]

  [56:3] _ZN13CkIndex_Saver21_call_Saver_marshall1EPvS0_+0xbf  [0x4cd94f]

  [56:4] CkDeliverMessageFree+0x21  [0x4f7801]

  [56:5]   [0x4eaf5d]

  [56:6] _Z15_processHandlerPvP11CkCoreState+0x58e  [0x4e922e]

  [56:7] CsdScheduleForever+0x7f  [0x5da22f]

  [56:8] CsdScheduler+0x1e  [0x5da1ae]

  [56:9]   [0x5ce487]

  [56:10]   [0x5cd410]

  [56:11] +0x7e65  [0x2ab24fdd0e65]

  [56:12] clone+0x6d  [0x2ab2519e788d]

Fatal error on PE 56> segmentation violation

// I tried running the program with ++debug, but it needed other things to be set which I didn’t really understand

 

3.

*** Error in `/path/to/program/PS_charm++.x': double free or corruption (fasttop): 0x0000000002336f20 ***

 

Mainly I would like to ask, where the issue could be? Is it in charm++, or in my code? And how can I fix it / work around that?

As I said, running the program on single physical node (so 2 processes) works just fine as expected.

The nodelist file should be OK, since when the errors don’t occur, the program runs on multiple nodes correctly.

The condition when the error occurs appears to be random. Just with higher chance of failing on more computational nodes. Also which of the three errors occurs seems to be random.

The charm++ was built using command “./build charm++ netlrts-linux-x86_64 icc  smp  -j  --with-production” generated with the ./build interactive option selector.

 

I would appreciate any help with this issue.

Thank you,

Jakub Homola




Archive powered by MHonArc 2.6.19.

Top of Page