Skip to Content.
Sympa Menu

charm - Re: [charm] Issues trying to run mpi-coexist example

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Issues trying to run mpi-coexist example


Chronological Thread 
  • From: Sam White <white67 AT illinois.edu>
  • To: Jozsef Bakosi <jbakosi AT gmail.com>
  • Cc: Steve Petruzza <spetruzza AT sci.utah.edu>, charm <charm AT lists.cs.illinois.edu>
  • Subject: Re: [charm] Issues trying to run mpi-coexist example
  • Date: Wed, 22 Jun 2016 09:14:38 -0500

MPI interoperation is currently only supported on MPI, PAMILRTS, and GNI builds of Charm++. We will look into those hangs, but at least the second one appears to be a known issue with running interop in SMP mode that we are working on (https://charm.cs.illinois.edu/redmine/issues/903).

For your Mac, I would recommend building mpi-darwin-x86_64 and on the XC40 I would recommend gni-crayxc. The PAMI_CLIENTS environment variable is only applicable to PAMILRTS builds, you shouldn't need any extra flags for GNI.

-Sam

On Wed, Jun 22, 2016 at 9:12 AM, Jozsef Bakosi <jbakosi AT gmail.com> wrote:
Hi Steve,

Charm++ developers please correct me if I'm wrong...

To interoperate with MPI codes or libraries you need to build Charm++ on top of the MPI backend. On Mac I do that with:

$ build charm++ mpi-darwin-x86_64

On linux, I do

$ build charm++ mpi-linux-x86_64 mpicxx

Jozsef

On Wed, Jun 22, 2016 at 5:13 AM, Steve Petruzza <spetruzza AT sci.utah.edu> wrote:
Hi,
I am trying to use the MPI interoperability with Charm++, and I am starting using the example mpi-coexist.

I tried to build on my Mac (openmpi + multicore-darwin-x86_64-clang or netlrts-darwin-x86_64-smp-clang) but I cannot use the:
CharmLibInit(MPI_Comm userComm, int argc, char **argv);
because CMK_CONVERSE_MPI is set to 0 in mpi-interoperate.h

So it just tried to use the other CharmLibInit passing a 0 as userComm, but on the call it just crashes:
mpirun -np 4 ./multirun 
——————————————
[Steve:72383] *** Process received signal ***
[Steve:72383] Signal: Segmentation fault: 11 (11)
[Steve:72383] Signal code: Address not mapped (1)
[Steve:72383] Failing at address: 0x0
[Steve:72383] [ 0] 0   libsystem_platform.dylib            0x00007fff8b9b652a _sigtramp + 26
[Steve:72383] [ 1] 0   ???                                 0x0000000000000000 0x0 + 0
[Steve:72383] [ 2] [Steve:72384] *** Process received signal ***
[Steve:72384] Signal: Segmentation fault: 11 (11)
[Steve:72384] Signal code: Address not mapped (1)
[Steve:72384] Failing at address: 0x0
[Steve:72384] [ 0] 0   libsystem_platform.dylib            0x00007fff8b9b652a _sigtramp + 26
[Steve:72384] [ 1] 0   ???                                 0x0000000000000000 0x0 + 0
[Steve:72384] [ 2] [Steve:72385] *** Process received signal ***
[Steve:72385] Signal: Segmentation fault: 11 (11)
[Steve:72385] Signal code: Address not mapped (1)
[Steve:72385] Failing at address: 0x0
[Steve:72385] [ 0] 0   libsystem_platform.dylib            0x00007fff8b9b652a _sigtramp + 26
[Steve:72385] [ 1] 0   ???                                 0x0000000000000000 0x0 + 0
[Steve:72385] [ 2] [Steve:72382] *** Process received signal ***
[Steve:72382] Signal: Segmentation fault: 11 (11)
[Steve:72382] Signal code: Address not mapped (1)
[Steve:72382] Failing at address: 0x0
[Steve:72382] [ 0] 0   libsystem_platform.dylib            0x00007fff8b9b652a _sigtramp + 26
[Steve:72382] [ 1] 0   ???                                 0x0000000000000000 0x0 + 0
[Steve:72382] [ 2] 0   multirun                            0x0000000100118106 CmiAbortHelper + 38
[Steve:72382] [ 3] 0   multirun                            0x00000001001147a0 CmiSyncBroadcastAllFn + 0
[Steve:72382] [ 4] 0   multirun                            0x0000000100118106 CmiAbortHelper + 38
[Steve:72385] [ 3] 0   multirun                            0x00000001001147a0 CmiSyncBroadcastAllFn + 0
[Steve:72385] [ 4] 0   multirun                            0x0000000100111fd4 CharmLibInit + 36
[Steve:72385] [ 5] 0   multirun                            0x0000000100111fd4 CharmLibInit + 36
[Steve:72382] [ 5] 0   multirun                            0x0000000100118106 CmiAbortHelper + 38
[Steve:72383] [ 3] 0   multirun                            0x00000001001147a0 CmiSyncBroadcastAllFn + 0
[Steve:72383] [ 4] 0   multirun                            0x0000000100111fd4 CharmLibInit + 36
[Steve:72383] [ 5] 0   multirun                            0x0000000100001633 main + 147
[Steve:72383] [ 6] 0   multirun                            0x0000000100001574 start + 52
[Steve:72383] *** End of error message ***
0   multirun                            0x0000000100118106 CmiAbortHelper + 38
[Steve:72384] [ 3] 0   multirun                            0x00000001001147a0 CmiSyncBroadcastAllFn + 0
[Steve:72384] [ 4] 0   multirun                            0x0000000100111fd4 CharmLibInit + 36
[Steve:72384] [ 5] 0   multirun                            0x0000000100001633 main + 147
[Steve:72384] [ 6] 0   multirun                            0x0000000100001633 main + 147
[Steve:72385] [ 6] 0   multirun                            0x0000000100001574 start + 52
[Steve:72385] *** End of error message ***
0   multirun                            0x0000000100001633 main + 147
[Steve:72382] [ 6] 0   multirun                            0x0000000100001574 start + 52
[Steve:72382] *** End of error message ***
0   multirun                            0x0000000100001574 start + 52
[Steve:72384] *** End of error message ***
——————————————

I guess this has to do with the build. How can I build charm++ and this example on Mac in order to use the correct CharmLibInit with MPI_Comm?

Anyway I tried the same on a Cray XC40 node (built correctly using CharmLibInit with MPI_Comm), but:
If I run:
srun -N 1 -n 16 --hint=nomultithread --ntasks-per-socket=16 ./multirun 
——————————————
Charm++> Running on Gemini (GNI) with 16 processes
Charm++> static SMSG
Charm++> SMSG memory: 79.0KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> only comm thread send/recv messages
Charm++> Cray TLB page size: 2048K
Charm++> Running in SMP mode: numNodes 16,  1 worker threads per process
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: v6.7.0-202-g95e5ac0
Warning> using Isomalloc in SMP mode, you may need to run with '+isomalloc_sync'.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (64-way SMP).
——————————————

Here it hangs forever.

Then if I run:
srun -N 1 -n 16 --hint=nomultithread --ntasks-per-socket=16 ./multirun_time
——————————————
Charm++> Running on Gemini (GNI) with 16 processes
Charm++> static SMSG
Charm++> SMSG memory: 79.0KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> only comm thread send/recv messages
Charm++> Cray TLB page size: 2048K
Charm++> Running in SMP mode: numNodes 16,  1 worker threads per process
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: v6.7.0-202-g95e5ac0
Warning> using Isomalloc in SMP mode, you may need to run with '+isomalloc_sync'.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (64-way SMP).
Running Hi on 16 processors for 10 elements
Hi[1] from element 0
Hi[2] from element 1
Hi[3] from element 2
Hi[4] from element 3
Hi[5] from element 4
Hi[6] from element 5
Hi[7] from element 6
Hi[8] from element 7
Hi[9] from element 8
Hi[10] from element 9
——————————————

Also here it hangs forever.

Is there any parameter or flag I should add? (I tried already -envs PAMI_CLIENTS=MPI,Converse without success)

Thank you,
Steve





Archive powered by MHonArc 2.6.16.

Top of Page