Skip to Content.
Sympa Menu

charm - Re: [charm] [ppl-local] Charm++ on a Cray XC30 - 'gni-crayxc' vs. 'mpi-crayxc' (with NAMD)

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [ppl-local] Charm++ on a Cray XC30 - 'gni-crayxc' vs. 'mpi-crayxc' (with NAMD)


Chronological Thread 
  • From: "Christian Tuma" <tuma AT zib.de>
  • To: "Yanhua Sun" <sun51 AT illinois.edu>
  • Cc: charm AT cs.uiuc.edu, "ppl-local AT cs.uiuc.edu" <ppl-local AT cs.uiuc.edu>
  • Subject: Re: [charm] [ppl-local] Charm++ on a Cray XC30 - 'gni-crayxc' vs. 'mpi-crayxc' (with NAMD)
  • Date: Thu, 12 Dec 2013 13:12:31 +0100
  • Importance: Normal
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Hi Yanhua,

thank you for your reply. Here we go ...

On Mon, December 9, 2013 21:07, Yanhua Sun wrote:
> gni-crayxc and mpi-crayxc should both work on Cray XC. Gni is using cray's
> lowest communication library while mpi-crayxc is using mpi as
> communication library.

I see. Thank you for pointing this out.

> Generally, gni-crayxc should perform better than mpi-crayxc.

I can confirm that. My observation was about 5% performance benefit
compared to 'mpi-crayxc'.

> About the hanging problem you mentioned. can you give me more information,
> to reproduce this run?

Sure. But in the meantime I realized that the problem might not be related
to the NAMD input. I think it is related to charm. I have built charm++
6.5.1 for 'gni-crayxc' using exactly the same Cray software environment as
for the successfully running 'mpi-crayxc' version. Then I did some
testing, as suggested, and encountered problems which I think should be
solved first before dealing with NAMD.

$ module swap PrgEnv-cray PrgEnv-gnu
$ ./build charm++ gni-crayxc -j8 --with-production

(OK)

$ cd tests/charm++/simplearrayhello
$ make
$ make test

(OK)

$ cd ../megatest
$ make pgm
$ make test

(OK - but only up to 4 tasks were actually tested)

Now I tried the same test with more tasks (all within an interactive batch
job):

$ aprun -T -n 32 ./pgm

This usually runs fine too, but sometimes problems are reported:

...
test 44: completed (0.01 sec)
test 45: initiated [multi groupsectiontest (ebohm)]
Charm++> Network progress engine appears to have stalled, possibly because
registered memory limits have been exceeded or are too low. Try adjusting
environment variables CHARM_UGNI_MEMPOOL_MAX and CHARM_UGNI_SEND_MAX
(current limits are 35791394 and 17895697).
[0] CmiAbort: Fatal> Deadlock detected.
[0] Stack Traceback:
...

When going to even more tasks, other error messages (not exactly
reproducable) appear. For example:

$ aprun -T -n 48 ./pgm
...
test 43: completed (0.00 sec)
test 44: initiated [multi multisectiontest (ebohm)]
[0] CmiAbort: Converse zero handler executed-- was a message corrupted?

[0] Stack Traceback:
...

$ aprun -T -n 48 ./pgm
...
test 44: completed (0.01 sec)
test 45: initiated [multi groupsectiontest (ebohm)]
[23] CmiAbort: Converse zero handler executed-- was a message corrupted?

[23] Stack Traceback:
...

$ aprun -T -n 96 ./pgm
...
test 20: completed (0.00 sec)
test 21: initiated [multisectiontest (ebohm)]
Charm++> Network progress engine appears to have stalled, possibly because
registered memory limits have been exceeded or are too low. Try adjusting
environment variables CHARM_UGNI_MEMPOOL_MAX and CHARM_UGNI_SEND_MAX
(current limits are 35791394 and 17895697).
[0] CmiAbort: Fatal> Deadlock detected.
[0] Stack Traceback:
[0:0] [0x4f2f57]
...
Fatal> Deadlock detected.
...

$ aprun -T -n 96 ./pgm
...
test 20: completed (0.00 sec)
test 21: initiated [multisectiontest (ebohm)]
_pmiu_daemon(SIGCHLD): [NID 00736] [c3-0c2s8n0] [Thu Dec 12 13:07:55 2013]
PE RANK 5 exit signal Segmentation fault
[NID 00736] 2013-12-12 13:07:28 Apid 305730: initiated application
termination
...

This is just to give you a few examples. They occur both with the official
6.5.1 version of charm as well as with version 6.6.0-rc1.

Any idea how these problems could be solved?


Thanks ...

Christian


--
Dr. Christian Tuma
Consultant, Supercomputing
Zuse Institute Berlin, Takustr. 7, 14195 Berlin, Germany
+49 30 84185132 |
tuma AT zib.de
| www.zib.de




  • Re: [charm] [ppl-local] Charm++ on a Cray XC30 - 'gni-crayxc' vs. 'mpi-crayxc' (with NAMD), Christian Tuma, 12/12/2013

Archive powered by MHonArc 2.6.16.

Top of Page