Skip to Content.
Sympa Menu

charm - Re: [charm] Migration error with AMPI + ibverbs (+ SMP)

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Migration error with AMPI + ibverbs (+ SMP)


Chronological Thread 
  • From: "Jain, Nikhil" <nikhil AT illinois.edu>
  • To: Rafael Keller Tesser <rktesser AT inf.ufrgs.br>, "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
  • Subject: Re: [charm] Migration error with AMPI + ibverbs (+ SMP)
  • Date: Fri, 22 Mar 2013 21:50:24 +0000
  • Accept-language: en-US
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Answers to remaining questions follows:



>On another topic, I am also interested in using the smp module. At
>first i was getting a migration error, then I found out I needed to
>pass the option +CmiNoProcForComThread to the runtime.

For net-verbs, SMP mode makes use of an extra communication thread in
addition to the worker threads that perform the real work. The syntax for
job run is-

./charmrun +p<NUM_WORKERS> ++ppn<WORKERS_PER_COMM_THREAD> ./pgm
<pgm_params> +vp<num_ranks> +isomalloc_sync +setcpuaffinity

NUM_WORKERS is the total number of worker threads that will do real work.
WORKER_PER_COMM_THREAD is the number of workers for which 1 comm thread
will be created. The sum of these two - number of worker threads and
number of comm threads determines the number of threads Charm creates. If
this number is greater than the total number of cores allocated for a job
run, you will need to pass +CmiNoProcForComThread so that threads share a
core. A real example - given a job allocation of 32 cores, I use the
following to run-

./charmrun +p30 ++ppn 15 ./pgm 10 +vp100 +isomalloc_sync +setcpuaffinity

This will create 2 processes each of which will create 16 threads - 15
workers and 1 comm thread. +isomalloc_sync is needed if the stack pointer
is randomized which interferes with AMPI working correctly.


>So, now I can execute my application with ibverbs OR smp. But not with
>ibverbs AND smp together! When I try to run with the application on
>Charm++/AMPI built with ibverbs and smp, I get a segmentation
>violation error. When I compile the program with "-memory paranoid",
>the error disappears.
>
>Commands used to build Charm++ and AMPI:
> ./build charm++ net-linux-x86_64 ibverbs smp -j16 --with-production
>-thread context
> ./build AMPI net-linux-x86_64 ibverbs smp -j16 --with-production
>-thread context
>
>I am passing the following options to charmrun (on 4 nodes x 8 cores per
>node):
> ./charmrun ondes3d +p32 +vp 128 +mapping BLOCK_MAP ++remote-shell ssh
>+setcpuaffinity +balancer GreedyLB +CmiNoProcForComThread
>
>I also tested with the migration test program that came with Charm (in
>the subdirectory tests/ampi/migration). I doesn't give a segmentaion
>violation, but sometimes it hangs during migration. I included the
>output below this message.

I tried these combination, and they execute well for me with -thread
context compilation, and +isomalloc_sync as a runtime parameter. Based on
content of your /proc/sys/kernel/randomize_va_space, +isomalloc_sync may
also be required for your run. Can you tell me the content of that file,
and also try with this flag.

--Nikhil

>Any Idea on what the problem maybe?
>
>--
>Best regards,
> Rafael Keller Tesser
>
>GPPD - Grupo de Processamento Paralelo e Distribuído
>Instituto de Informática / UFRGS
>Porto Alegre, RS - Brasil
>
>-------------------
>****Output of the migration test program (until it hangs):****
>./charmrun ./pgm +p2 +vp4 +CmiNoProcForComThread
>Charmrun> IBVERBS version of charmrun
>Charmrun> started all node programs in 1.198 seconds.
>Converse/Charm++ Commit ID:
>Charm++> scheduler running in netpoll mode.
>CharmLB> Load balancer assumes all CPUs are same.
>Charm++> Running on 1 unique compute nodes (8-way SMP).
>Charm++> cpu topology info is gathered in 0.002 seconds.
>
>begin migrating
>
>begin migrating
>
>begin migrating
>
>begin migrating
>Trying to migrate partition 1 from pe 0 to 1
>Entering TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 0,
>migrate_test is 0
>Leaving TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 1,
>migrate_test is 0
>Done with step 0
>Done with step 0
>Done with step 0
>Done with step 0
>Trying to migrate partition 1 from pe 1 to 0
>Entering TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 1,
>migrate_test is 1
>Leaving TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 0,
>migrate_test is 1
>Done with step 1
>Done with step 1
>Done with step 1
>Done with step 1
>Trying to migrate partition 1 from pe 0 to 1
>Entering TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 0,
>migrate_test is 0
>Leaving TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 1,
>migrate_test is 0
>Done with step 2
>Done with step 2
>Done with step 2
>Done with step 2
>done migrating
>done migrating
>done migrating
>done migrating
>All tests passed
>./charmrun ./pgm +p2 +vp20 +CmiNoProcForComThread
>Charmrun> IBVERBS version of charmrun
>Charmrun> started all node programs in 1.174 seconds.
>Converse/Charm++ Commit ID:
>Charm++> scheduler running in netpoll mode.
>CharmLB> Load balancer assumes all CPUs are same.
>Charm++> Running on 1 unique compute nodes (8-way SMP).
>Charm++> cpu topology info is gathered in 0.002 seconds.
>
>begin migrating
>
>begin migrating
>Trying to migrate partition 1 from pe 0 to 1
>
>begin migrating
>Entering TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 0,
>migrate_test is 0
>
>
>
>--------------------------------------------------------------
>***My previous message:****
>From: Rafael Keller Tesser
><rafael.tesser AT inf.ufrgs.br>
>Date: Thu, Mar 21, 2013 at 10:38 AM
>Subject: MIgration error With AMPI + ibverbs
>To:
>charm AT cs.uiuc.edu
>
>
>Hello,
>
>I ported a geophysics application to AMPI, in order to experiment with
>its load balancing features.
>
>Without load-balancing the application runs without any error, on both
>Gigabit Ethernet and Infiniband. With load-balancing, the application
>runs fine on Gigabit Ethernet.
>With the IBVERBS version of Charm, however, I am getting the following
>error, during the first load-balancing step:
>
>--
>...
>CharmLB> GreedyLB: PE [0] Memory: LBManager: 921 KB CentralLB: 87 KB
>CharmLB> GreedyLB: PE [0] #Objects migrating: 247, LBMigrateMsg size:
>0.02 MB
>CharmLB> GreedyLB: PE [0] strategy finished at 55.669918 duration
>0.007592 s
>[0] Starting ReceiveMigration step 0 at 55.672409
>Charmrun: error on request socket--
>Socket closed before recv.
>
>--
>
>I send the full output in a file attached to this message (output.txt).
>
>The error also happens with the AMPI migration test program that comes
>with charm++ (located in tests/ampi/migration). The outputs are
>attached to this message.
>
>I get this error both with Charm-6.4.0 and with the development
>version from the Git repository.
>
>AMPI was built with:
>./build charm++ net-linux-x86_64 ibverbs --with-production -j16
>./build AMPI net-linux-x86_64 ibverbs --with-production -j16
>
>
>Do you have any ideas on what may be causing this error?
>
>--
>Best regards,
> Rafael Keller Tesser
>
>GPPD - Grupo de Processamento Paralelo e Distribuído
>Instituto de Informática / UFRGS






Archive powered by MHonArc 2.6.16.

Top of Page