Skip to Content.
Sympa Menu

charm - Re: [charm] Migration error with AMPI + ibverbs (+ SMP)

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Migration error with AMPI + ibverbs (+ SMP)


Chronological Thread 
  • From: "Jain, Nikhil" <nikhil AT illinois.edu>
  • To: Rafael Keller Tesser <rktesser AT inf.ufrgs.br>, "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
  • Subject: Re: [charm] Migration error with AMPI + ibverbs (+ SMP)
  • Date: Fri, 22 Mar 2013 20:59:54 +0000
  • Accept-language: en-US
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Answers inline:


>However, I'd like to know what this option means. I couldn't find this
>information anywhere.

MPI ranks in AMPI are executed as user-level threads, which are typically
more than 1 per core. We provide for different context switching schemes,
since depending on the machine, one or the other mechanism may work.
Some of the commonly used threads and corresponding context switching
mechanisms are-

1. qt - most commonly used our implementation of quick threads.
2. context - use OS provided context switching mechanism.
3. uJcontext - use our implementation based on setjmp (this is default
with net-ibverbs).


Rest of things you had mentioned, I am experimenting with them one by one
on net-ibverbs-smp, so that the answers I provide are well tested.
Meanwhile, can you provide me the content of the following file on your
system-
/proc/sys/kernel/randomize_va_space

Is it 0 or 1 or something else?

Thanks
Nikhil

>
>On another topic, I am also interested in using the smp module. At
>first i was getting a migration error, then I found out I needed to
>pass the option +CmiNoProcForComThread to the runtime.
>
>So, now I can execute my application with ibverbs OR smp. But not with
>ibverbs AND smp together! When I try to run with the application on
>Charm++/AMPI built with ibverbs and smp, I get a segmentation
>violation error. When I compile the program with "-memory paranoid",
>the error disappears.
>
>Commands used to build Charm++ and AMPI:
> ./build charm++ net-linux-x86_64 ibverbs smp -j16 --with-production
>-thread context
> ./build AMPI net-linux-x86_64 ibverbs smp -j16 --with-production
>-thread context
>
>I am passing the following options to charmrun (on 4 nodes x 8 cores per
>node):
> ./charmrun ondes3d +p32 +vp 128 +mapping BLOCK_MAP ++remote-shell ssh
>+setcpuaffinity +balancer GreedyLB +CmiNoProcForComThread
>
>I also tested with the migration test program that came with Charm (in
>the subdirectory tests/ampi/migration). I doesn't give a segmentaion
>violation, but sometimes it hangs during migration. I included the
>output below this message.
>
>Any Idea on what the problem maybe?
>
>--
>Best regards,
> Rafael Keller Tesser
>
>GPPD - Grupo de Processamento Paralelo e Distribuído
>Instituto de Informática / UFRGS
>Porto Alegre, RS - Brasil
>
>-------------------
>****Output of the migration test program (until it hangs):****
>./charmrun ./pgm +p2 +vp4 +CmiNoProcForComThread
>Charmrun> IBVERBS version of charmrun
>Charmrun> started all node programs in 1.198 seconds.
>Converse/Charm++ Commit ID:
>Charm++> scheduler running in netpoll mode.
>CharmLB> Load balancer assumes all CPUs are same.
>Charm++> Running on 1 unique compute nodes (8-way SMP).
>Charm++> cpu topology info is gathered in 0.002 seconds.
>
>begin migrating
>
>begin migrating
>
>begin migrating
>
>begin migrating
>Trying to migrate partition 1 from pe 0 to 1
>Entering TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 0,
>migrate_test is 0
>Leaving TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 1,
>migrate_test is 0
>Done with step 0
>Done with step 0
>Done with step 0
>Done with step 0
>Trying to migrate partition 1 from pe 1 to 0
>Entering TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 1,
>migrate_test is 1
>Leaving TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 0,
>migrate_test is 1
>Done with step 1
>Done with step 1
>Done with step 1
>Done with step 1
>Trying to migrate partition 1 from pe 0 to 1
>Entering TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 0,
>migrate_test is 0
>Leaving TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 1,
>migrate_test is 0
>Done with step 2
>Done with step 2
>Done with step 2
>Done with step 2
>done migrating
>done migrating
>done migrating
>done migrating
>All tests passed
>./charmrun ./pgm +p2 +vp20 +CmiNoProcForComThread
>Charmrun> IBVERBS version of charmrun
>Charmrun> started all node programs in 1.174 seconds.
>Converse/Charm++ Commit ID:
>Charm++> scheduler running in netpoll mode.
>CharmLB> Load balancer assumes all CPUs are same.
>Charm++> Running on 1 unique compute nodes (8-way SMP).
>Charm++> cpu topology info is gathered in 0.002 seconds.
>
>begin migrating
>
>begin migrating
>Trying to migrate partition 1 from pe 0 to 1
>
>begin migrating
>Entering TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 0,
>migrate_test is 0
>
>
>
>--------------------------------------------------------------
>***My previous message:****
>From: Rafael Keller Tesser
><rafael.tesser AT inf.ufrgs.br>
>Date: Thu, Mar 21, 2013 at 10:38 AM
>Subject: MIgration error With AMPI + ibverbs
>To:
>charm AT cs.uiuc.edu
>
>
>Hello,
>
>I ported a geophysics application to AMPI, in order to experiment with
>its load balancing features.
>
>Without load-balancing the application runs without any error, on both
>Gigabit Ethernet and Infiniband. With load-balancing, the application
>runs fine on Gigabit Ethernet.
>With the IBVERBS version of Charm, however, I am getting the following
>error, during the first load-balancing step:
>
>--
>...
>CharmLB> GreedyLB: PE [0] Memory: LBManager: 921 KB CentralLB: 87 KB
>CharmLB> GreedyLB: PE [0] #Objects migrating: 247, LBMigrateMsg size:
>0.02 MB
>CharmLB> GreedyLB: PE [0] strategy finished at 55.669918 duration
>0.007592 s
>[0] Starting ReceiveMigration step 0 at 55.672409
>Charmrun: error on request socket--
>Socket closed before recv.
>
>--
>
>I send the full output in a file attached to this message (output.txt).
>
>The error also happens with the AMPI migration test program that comes
>with charm++ (located in tests/ampi/migration). The outputs are
>attached to this message.
>
>I get this error both with Charm-6.4.0 and with the development
>version from the Git repository.
>
>AMPI was built with:
>./build charm++ net-linux-x86_64 ibverbs --with-production -j16
>./build AMPI net-linux-x86_64 ibverbs --with-production -j16
>
>
>Do you have any ideas on what may be causing this error?
>
>--
>Best regards,
> Rafael Keller Tesser
>
>GPPD - Grupo de Processamento Paralelo e Distribuído
>Instituto de Informática / UFRGS






Archive powered by MHonArc 2.6.16.

Top of Page