Skip to Content.
Sympa Menu

charm - Re: [charm] Migration error with AMPI + ibverbs (+ SMP)

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Migration error with AMPI + ibverbs (+ SMP)


Chronological Thread 
  • From: Eric Bohm <ebohm AT illinois.edu>
  • To: <charm AT cs.uiuc.edu>
  • Subject: Re: [charm] Migration error with AMPI + ibverbs (+ SMP)
  • Date: Fri, 22 Mar 2013 13:50:08 -0500
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

How

On 03/22/2013 08:16 AM, Rafael Keller Tesser wrote:
Hello,

I don't think my previous message reached the list (I wasn't
subscribed to it), so I am including a copy at the end of this
message.

I ported a geophysics application to AMPI, in order to experiment with
its load balancing features. I wanted to run some tests using
Infiniband, but I was getting a migration error when using AMPI
compiled with the ibverbs option. This error didn't happen when
running without  ibverbs.

By looking at the compilation options used in the nightly regression
tests (http://charm.cs.illinois.edu/autobuild/cur/), I notice that the
ibverbs versions where being compiled with the option "-thread
context". So, I built Charm++, AMPI, and my application with this
option and now it seems to be working with Infiniband.

However, I'd like to know what this option means. I couldn't find this
information anywhere.

On another topic, I am also interested in using the smp module. At
first i was getting a migration error, then I found out I needed to
pass the option +CmiNoProcForComThread to the runtime.

So, now I can execute my application with ibverbs OR smp. But not with
ibverbs AND smp together! When I try to run with the application on
Charm++/AMPI built with ibverbs and smp, I get a segmentation
violation error. When I compile the program with "-memory paranoid",
the error disappears.

Commands used to build Charm++ and AMPI:
 ./build charm++ net-linux-x86_64 ibverbs smp -j16 --with-production
-thread context
 ./build AMPI net-linux-x86_64 ibverbs smp -j16 --with-production
-thread context

I am passing the following options to charmrun (on 4 nodes x 8 cores per node):
 ./charmrun ondes3d +p32 +vp 128 +mapping BLOCK_MAP ++remote-shell ssh
+setcpuaffinity +balancer GreedyLB +CmiNoProcForComThread

I also tested with the migration test program that came with Charm (in
the subdirectory tests/ampi/migration). I doesn't give a segmentaion
violation, but sometimes it hangs during migration. I included the
output below this message.

Any Idea on what the problem maybe?

--
Best regards,
   Rafael Keller Tesser

GPPD - Grupo de Processamento Paralelo e Distribuído
Instituto de Informática / UFRGS
Porto Alegre, RS - Brasil

-------------------
****Output of the migration test program (until it hangs):****
./charmrun ./pgm +p2 +vp4 +CmiNoProcForComThread
Charmrun> IBVERBS version of charmrun
Charmrun> started all node programs in 1.198 seconds.
Converse/Charm++ Commit ID:
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (8-way SMP).
Charm++> cpu topology info is gathered in 0.002 seconds.

begin migrating

begin migrating

begin migrating

begin migrating
Trying to migrate partition 1 from pe 0 to 1
Entering TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 0,
migrate_test is 0
Leaving TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 1,
migrate_test is 0
Done with step 0
Done with step 0
Done with step 0
Done with step 0
Trying to migrate partition 1 from pe 1 to 0
Entering TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 1,
migrate_test is 1
Leaving TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 0,
migrate_test is 1
Done with step 1
Done with step 1
Done with step 1
Done with step 1
Trying to migrate partition 1 from pe 0 to 1
Entering TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 0,
migrate_test is 0
Leaving TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 1,
migrate_test is 0
Done with step 2
Done with step 2
Done with step 2
Done with step 2
done migrating
done migrating
done migrating
done migrating
All tests passed
./charmrun ./pgm +p2 +vp20 +CmiNoProcForComThread
Charmrun> IBVERBS version of charmrun
Charmrun> started all node programs in 1.174 seconds.
Converse/Charm++ Commit ID:
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (8-way SMP).
Charm++> cpu topology info is gathered in 0.002 seconds.

begin migrating

begin migrating
Trying to migrate partition 1 from pe 0 to 1

begin migrating
Entering TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 0,
migrate_test is 0



--------------------------------------------------------------
***My previous message:****
From: Rafael Keller Tesser <rafael.tesser AT inf.ufrgs.br>
Date: Thu, Mar 21, 2013 at 10:38 AM
Subject: MIgration error With AMPI + ibverbs
To: charm AT cs.uiuc.edu


Hello,

I ported a geophysics application to AMPI, in order to experiment with
its load balancing features.

Without load-balancing the application runs without any error, on both
Gigabit Ethernet and Infiniband. With load-balancing, the application
runs fine on Gigabit Ethernet.
With the IBVERBS version of Charm, however, I am getting the following
error, during the first load-balancing step:

--
...
CharmLB> GreedyLB: PE [0] Memory: LBManager: 921 KB CentralLB: 87 KB
CharmLB> GreedyLB: PE [0] #Objects migrating: 247, LBMigrateMsg size: 0.02 MB
CharmLB> GreedyLB: PE [0] strategy finished at 55.669918 duration 0.007592 s
[0] Starting ReceiveMigration step 0 at 55.672409
Charmrun: error on request socket--
Socket closed before recv.

--

I send the full output in a file attached to this message (output.txt).

The error also happens with the AMPI migration test program that comes
with charm++ (located in tests/ampi/migration). The outputs are
attached to this message.

I get this error both with Charm-6.4.0 and with the development
version from the Git repository.

AMPI was built with:
./build charm++ net-linux-x86_64 ibverbs --with-production -j16
./build AMPI net-linux-x86_64 ibverbs --with-production -j16


Do you have any ideas on what may be causing this error?

--
Best regards,
   Rafael Keller Tesser

GPPD - Grupo de Processamento Paralelo e Distribuído
Instituto de Informática / UFRGS


_______________________________________________
charm mailing list
charm AT cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/charm




Archive powered by MHonArc 2.6.16.

Top of Page