Skip to Content.
Sympa Menu

charm - Re: [charm] [ppl] Myrinet MX broken in Charm-6.3.2

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [ppl] Myrinet MX broken in Charm-6.3.2


Chronological Thread 
  • From: Thomas Albers <talbers AT binghamton.edu>
  • To: charm AT cs.uiuc.edu
  • Subject: Re: [charm] [ppl] Myrinet MX broken in Charm-6.3.2
  • Date: Thu, 16 Jun 2011 17:28:33 -0400
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Hello!

> There actually never was native x86_64 support for MX because InfiniBand
> and x86_64 took over the market at about the same time MX came out.

> Since it exists on other 64-bit platforms the port should be pretty
> easy, just copying conv-mach-mx.h and conv-mach-mx.sh to
> arch/net-linux_86_64 from another platform and figuring out what the
> right #defines are.

As I said, the idea is to use Open-MX instead of TCP/IP networking in
the hope that the lower latency translates into greater speed when
running NAMD.

I have:
#define CMK_USE_MX 1
#define CMK_NETPOLL 1
#define CMK_BARRIER_USE_COMMON_CODE 0

This fails immediately, with Open-MX versions 1.4.0 and 1.4.901:

ta@porsche
~/NAMD_2.8_Source/charm-6.3.2/tests/charm++/megatest $
./charmrun +p2 ./pgm
Charmrun> error 93620 attaching to node:
Socket closed before recv.

Perhaps more interesting is the behavior when using OpenMPI (1.4.3). The
test suite finishes most of the time, but sometimes one sees this:
ta@porsche
~/NAMD_2.8_Source/charm-6.3.2/mpi-linux-x86_64-smp-mpicxx/tests/charm++
$ mpirun --mca btl mx,sm,self -H
porsche,porsche,porsche,porsche,ferrari,ferrari,ferrari,ferrari,yamaha,yamaha,yamaha,yamaha,michelin,michelin,michelin,michelin,michelin,michelin
./pgm
...
test 36: initiated [multi marshall (olawlor)]
[16] Stack Traceback:
[16:0] CmiAbort+0x58 [0x58c26a]
[16:1] [0x58d8a2]
[16:2] CmiHandleMessage+0x2b [0x58d5c7]
[16:3] CsdScheduleForever+0x5c [0x5911d6]
[16:4] CsdScheduler+0xd [0x59124e]
[16:5] [0x58cc1c]
[16:6] [0x58d1c7]
[16:7] +0x6ac4 [0x7fa3a8942ac4]
[16:8] clone+0x6d [0x7fa3a6f793ed]
------------- Processor 16 Exiting: Called CmiAbort ------------
Reason: Converse zero handler executed-- was a message corrupted?

Does the behavior of Open-MX differ in some subtle way from Myrinet-MX?
How could I be helpful in tracking down the bug, and is it worth it?

Regards,
Thomas





Archive powered by MHonArc 2.6.16.

Top of Page