Skip to Content.
Sympa Menu

charm - [charm] NAMD segfaulting in "setJcontext "

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

[charm] NAMD segfaulting in "setJcontext "


Chronological Thread 
  • From: Ted Packwood <malice AT cray.com>
  • To: <charm AT lists.cs.illinois.edu>
  • Subject: [charm] NAMD segfaulting in "setJcontext "
  • Date: Wed, 31 Aug 2016 13:54:23 -0500
  • Authentication-results: spf=none (sender IP is ) smtp.mailfrom=malice AT cray.com;
  • Spamdiagnosticmetadata: NSPM
  • Spamdiagnosticoutput: 1:99

Hello-

I'm trying to determine what is causing a failure in NAMD when built with
the Cray compiler on a Cray XC30.

The failure is in "
setJcontext" as you can see from the traceback below.

The charm++ build works fine with the include charm++ test  "jacobi3d"
(4ranks on 4 seperate Broadwell chips, +ppn6)

I built charm++ with:
./build charm++ mpi-crayxc craycc smp

NAMD was built with:
./config CRAY-XC-cce --charm-arch mpi-crayxc.cce-smp-craycc --with-fftw3 --without-tcl --charm-opts -save

And was run with just one rank on a Broadwell chip.

The intel compiler build of charm++ and NAMD works fine, so this appears to
be an issue with the Cray compiler.

I have a few questions:
1) Does anyone have an idea of what might cause this type of failure?
2) Any suggestions as to a possible solution, or build changes that might
solve the problem?
3) Is there a simple charm++ test which mimics the Jcontext usage that
NAMD requires that might cause a similar failure?  I'd prefer to try to reproduce
this with a smaller test than NAMD.  :)
4) If not, should I contact the NAMD folks instead?



Core was generated by `./namd2.XC30.IVB.kay.PE604.cce853-g-O0-flex_mp-strict.mpich743.libsci16091.fftw'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00000000212c43eb in raise (sig=sig@entry=6)
    at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
37      ../nptl/sysdeps/unix/sysv/linux/pt-raise.c: No such file or directory.
(gdb) where
#0  0x00000000212c43eb in raise (sig=sig@entry=6)
    at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1  0x0000000021417dc5 in abort () at abort.c:99
#2  0x0000000021173c72 in MPID_Abort ()
#3  0x0000000021123a75 in PMPI_Abort ()
#4  0x0000000020e75cd5 in LrtsAbort () at machine.c:1656
#5  <signal handler called>
#6  0x0000000020e71089 in setJcontext () at uJcontext.c:131
#7  0x0000000020e71100 in swapJcontext () at uJcontext.c:176
#8  0x0000000020e713b5 in CthResume () at libthreads-default.c:1669
#9  0x0000000020e78388 in CsdScheduleForever () at convcore.c:1901
#10 0x0000000020e78299 in CsdScheduler () at convcore.c:1837
#11 0x00000000200dd891 in BackEnd::suspend () at src/BackEnd.C:285
#12 0x0000000020b6e650 in ScriptTcl::suspend (this=0x41bcb1b0)
    at src/ScriptTcl.C:72
#13 0x0000000020b6e6ff in ScriptTcl::initcheck (this=0x41bcb1b0)
    at src/ScriptTcl.C:104
#14 0x0000000020b6e577 in ScriptTcl::run (this=0x41bcb1b0,
    scriptFile=0x7fffffff765e) at src/ScriptTcl.C:2076
#15 0x00000000200d6d49 in after_backend_init (argc=2, argv=0x7fffffff6678)
    at src/mainfunc.C:158
#16 0x00000000200dcff9 in slave_init (argc=2, argv=0x7fffffff6678)
    at src/BackEnd.C:140
#17 0x0000000020e745ec in ConverseRunPE$$CFE_id_d7e6ac3e_9d711be8 ()
    at machine-common-core.c:1293
#18 0x0000000020e71b9a in call_startfn$$CFE_id_d7e6ac3e_9d711be8 ()
    at machine-smp.c:415
#19 0x0000000020eeda44 in start_thread (arg=0x2aaaaad45700)
    at pthread_create.c:309
#20 0x00000000214756f9 in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111




Archive powered by MHonArc 2.6.19.

Top of Page