Skip to Content.
Sympa Menu

charm - [charm] Issues trying to run mpi-coexist example

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

[charm] Issues trying to run mpi-coexist example


Chronological Thread 
  • From: Steve Petruzza <spetruzza AT sci.utah.edu>
  • To: charm AT lists.cs.illinois.edu
  • Subject: [charm] Issues trying to run mpi-coexist example
  • Date: Wed, 22 Jun 2016 14:13:57 +0300

Hi,
I am trying to use the MPI interoperability with Charm++, and I am starting using the example mpi-coexist.

I tried to build on my Mac (openmpi + multicore-darwin-x86_64-clang or netlrts-darwin-x86_64-smp-clang) but I cannot use the:
CharmLibInit(MPI_Comm userComm, int argc, char **argv);
because CMK_CONVERSE_MPI is set to 0 in mpi-interoperate.h

So it just tried to use the other CharmLibInit passing a 0 as userComm, but on the call it just crashes:
mpirun -np 4 ./multirun 
——————————————
[Steve:72383] *** Process received signal ***
[Steve:72383] Signal: Segmentation fault: 11 (11)
[Steve:72383] Signal code: Address not mapped (1)
[Steve:72383] Failing at address: 0x0
[Steve:72383] [ 0] 0   libsystem_platform.dylib            0x00007fff8b9b652a _sigtramp + 26
[Steve:72383] [ 1] 0   ???                                 0x0000000000000000 0x0 + 0
[Steve:72383] [ 2] [Steve:72384] *** Process received signal ***
[Steve:72384] Signal: Segmentation fault: 11 (11)
[Steve:72384] Signal code: Address not mapped (1)
[Steve:72384] Failing at address: 0x0
[Steve:72384] [ 0] 0   libsystem_platform.dylib            0x00007fff8b9b652a _sigtramp + 26
[Steve:72384] [ 1] 0   ???                                 0x0000000000000000 0x0 + 0
[Steve:72384] [ 2] [Steve:72385] *** Process received signal ***
[Steve:72385] Signal: Segmentation fault: 11 (11)
[Steve:72385] Signal code: Address not mapped (1)
[Steve:72385] Failing at address: 0x0
[Steve:72385] [ 0] 0   libsystem_platform.dylib            0x00007fff8b9b652a _sigtramp + 26
[Steve:72385] [ 1] 0   ???                                 0x0000000000000000 0x0 + 0
[Steve:72385] [ 2] [Steve:72382] *** Process received signal ***
[Steve:72382] Signal: Segmentation fault: 11 (11)
[Steve:72382] Signal code: Address not mapped (1)
[Steve:72382] Failing at address: 0x0
[Steve:72382] [ 0] 0   libsystem_platform.dylib            0x00007fff8b9b652a _sigtramp + 26
[Steve:72382] [ 1] 0   ???                                 0x0000000000000000 0x0 + 0
[Steve:72382] [ 2] 0   multirun                            0x0000000100118106 CmiAbortHelper + 38
[Steve:72382] [ 3] 0   multirun                            0x00000001001147a0 CmiSyncBroadcastAllFn + 0
[Steve:72382] [ 4] 0   multirun                            0x0000000100118106 CmiAbortHelper + 38
[Steve:72385] [ 3] 0   multirun                            0x00000001001147a0 CmiSyncBroadcastAllFn + 0
[Steve:72385] [ 4] 0   multirun                            0x0000000100111fd4 CharmLibInit + 36
[Steve:72385] [ 5] 0   multirun                            0x0000000100111fd4 CharmLibInit + 36
[Steve:72382] [ 5] 0   multirun                            0x0000000100118106 CmiAbortHelper + 38
[Steve:72383] [ 3] 0   multirun                            0x00000001001147a0 CmiSyncBroadcastAllFn + 0
[Steve:72383] [ 4] 0   multirun                            0x0000000100111fd4 CharmLibInit + 36
[Steve:72383] [ 5] 0   multirun                            0x0000000100001633 main + 147
[Steve:72383] [ 6] 0   multirun                            0x0000000100001574 start + 52
[Steve:72383] *** End of error message ***
0   multirun                            0x0000000100118106 CmiAbortHelper + 38
[Steve:72384] [ 3] 0   multirun                            0x00000001001147a0 CmiSyncBroadcastAllFn + 0
[Steve:72384] [ 4] 0   multirun                            0x0000000100111fd4 CharmLibInit + 36
[Steve:72384] [ 5] 0   multirun                            0x0000000100001633 main + 147
[Steve:72384] [ 6] 0   multirun                            0x0000000100001633 main + 147
[Steve:72385] [ 6] 0   multirun                            0x0000000100001574 start + 52
[Steve:72385] *** End of error message ***
0   multirun                            0x0000000100001633 main + 147
[Steve:72382] [ 6] 0   multirun                            0x0000000100001574 start + 52
[Steve:72382] *** End of error message ***
0   multirun                            0x0000000100001574 start + 52
[Steve:72384] *** End of error message ***
——————————————

I guess this has to do with the build. How can I build charm++ and this example on Mac in order to use the correct CharmLibInit with MPI_Comm?

Anyway I tried the same on a Cray XC40 node (built correctly using CharmLibInit with MPI_Comm), but:
If I run:
srun -N 1 -n 16 --hint=nomultithread --ntasks-per-socket=16 ./multirun 
——————————————
Charm++> Running on Gemini (GNI) with 16 processes
Charm++> static SMSG
Charm++> SMSG memory: 79.0KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> only comm thread send/recv messages
Charm++> Cray TLB page size: 2048K
Charm++> Running in SMP mode: numNodes 16,  1 worker threads per process
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: v6.7.0-202-g95e5ac0
Warning> using Isomalloc in SMP mode, you may need to run with '+isomalloc_sync'.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (64-way SMP).
——————————————

Here it hangs forever.

Then if I run:
srun -N 1 -n 16 --hint=nomultithread --ntasks-per-socket=16 ./multirun_time
——————————————
Charm++> Running on Gemini (GNI) with 16 processes
Charm++> static SMSG
Charm++> SMSG memory: 79.0KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> only comm thread send/recv messages
Charm++> Cray TLB page size: 2048K
Charm++> Running in SMP mode: numNodes 16,  1 worker threads per process
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: v6.7.0-202-g95e5ac0
Warning> using Isomalloc in SMP mode, you may need to run with '+isomalloc_sync'.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (64-way SMP).
Running Hi on 16 processors for 10 elements
Hi[1] from element 0
Hi[2] from element 1
Hi[3] from element 2
Hi[4] from element 3
Hi[5] from element 4
Hi[6] from element 5
Hi[7] from element 6
Hi[8] from element 7
Hi[9] from element 8
Hi[10] from element 9
——————————————

Also here it hangs forever.

Is there any parameter or flag I should add? (I tried already -envs PAMI_CLIENTS=MPI,Converse without success)

Thank you,
Steve



Archive powered by MHonArc 2.6.16.

Top of Page