Skip to Content.
Sympa Menu

charm - Re: [charm] Using Charm AMPI

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Using Charm AMPI


Chronological Thread 
  • From: Sam White <samt.white AT gmail.com>
  • To: Leonardo Duarte <leo.duarte AT gmail.com>
  • Cc: Scott Field <sfield AT astro.cornell.edu>, Charm Mailing List <charm AT cs.illinois.edu>
  • Subject: Re: [charm] Using Charm AMPI
  • Date: Thu, 29 Oct 2015 12:31:56 -0500

Hi Leonardo,

This looks like a problem related to your run commands rather than a compiler problem. AMPI will run with both CCE and GCC the performance should generally be similar for similar levels of optimization (like any application).
If I load the craype-hugepages8M and PrgEnv-X (either cray or gnu) modules, then build AMPI with:

./build AMPI gni-crayxe smp -j8 --with-production -g

Then I can run an AMPI program 'pgm' on 2 nodes in SMP mode with 1 thread per core (16 cores/node, ignoring the hyperthreads), you would do:

aprun -n 2 ./pgm +ppn 15 +pemap 1-15 +commap 0 +vp 120

The +ppn and such options are necessary to tell AMPI (and Charm++'s runtime system on top of which AMPI runs) how many threads/processes to create. AMPI does take these options and you will need them to specify the number of threads per OS process to run with in SMP mode. The above command creates 2 OS processes with 15 worker threads per process and 1 dedicated communication thread per OS process. These threads each have 4 virtual processors (120 MPI ranks total).

Let me know if you have any other questions, and if this doesn't help, build Charm++ and your application with the '-g' option and include a stack trace of the failure if possible.

- Sam


On Thu, Oct 29, 2015 at 12:14 PM, Leonardo Duarte <leo.duarte AT gmail.com> wrote:
Hello Scott, thanks for your help.

I also used the swap to PrgEnv-gnu, the hugepages8M, rca, and the persistent to build charm. It worked but it was extremely slow.
A simple example runs in secs in my laptop with 2 processors (simulating 2 nodes) and runs in 10 min in 2 nodes of BW.
Of course that I was expecting it to be slower but not this much.
That's why I decided to use the PrgEnv-cray environment, it's the native language.

The AMPI does not support +ppn 30. It takes from the aprun parameters.

My startup line with only 2 nodes and only 1 worker threads per process is not wrong.
Since I was having trouble to run it, I simplified the example to understand better what was going on.

However, it's good to know that your application uses PrgEnv-gnu.
I was worried that mine was too slow because I was using it, or because I was missing something to build it.

I really want to make it work with PrgEnv-cray right now, but I don't know what I'm doing wrong.

Thanks for your answer!
Leonardo.

On Thu, Oct 29, 2015 at 11:00 AM, Scott Field <sfield AT astro.cornell.edu> wrote:
Hi Leonardo,

  I have a charm++ application running on blue waters, and hopefully some of this will carry over to AMPI.

  In addition to the default blue waters environment, I use 

module swap PrgEnv-cray PrgEnv-gnu/5.2.40
module load craype-hugepages2M
module load rca

  and my charm++ build includes the option "persistent". To launch the application I do

>>> aprun -n 2 -r 1 -N 1 -d 31 ./ExecutableName +ppn 30 +pemap 1-30 +commap 0

On startup, my charm++ output looks different from yours. In particular, I see 

"Charm++> Running in SMP mode: numNodes 2,  30 worker threads per process"

while yours reads 

"Charm++> Running in SMP mode: numNodes 2,  1 worker threads per process"

These differences may or may not explain the errors you see. Hopefully it helps. Good luck!

Scott


On Thu, Oct 29, 2015 at 1:58 AM, Leonardo Duarte <leo.duarte AT gmail.com> wrote:
Hello Everyone,

I'm a PhD student at the CEE department of UIUC and I would
really appreciate if anyone could help me with Charm.

I'm trying to run my code on Blue Waters and I'm using a library that uses Charm++ AMPI.
I was able to build and run everything correctly but extremely slow with PrgEnv-gnu.
Now I'm trying to use the native Cray environment.

I'm using this BW environment and modules:

PrgEnv-cray
module load craype-hugepages8M
module load rca

I built charm with this command line:

./build LIBS gni-crayxe craycc  smp  -j16  --with-production --build-shared -O3

My code is composed by a lot of shared libraries that are loaded dynamically by the application using dlopen, dlsym and etc.

I'm able to build my code using this command lines on my makefiles:

To compile code that do not use Charm:
CC -c -fPIC  -O2 -I../../core/include -I../../tecgraf/tops/include -o ../../obj/obj64/linear/Linux3/linear.o ../../plugins/behavior/linear/linear.cpp

To link code that do not use Charm:
CC -shared -Wl,-soname,liblinear.so.1 -o liblinear.so.1.0 ../../obj/obj64/linear/Linux3/linear.o -L../../tecgraf/tops/lib64/Linux3 -ltops -L../../bin/lib64/Linux3 -ltopsim

To compile code that uses Charm:
charmc -language model -c -fPIC -O2 -I../../core/include -I../../tecgraf/tops/include -I../../tecgraf/tops/include/vis -I../../../bin/charm/include -o ../../obj/obj64/parebepcg/Linux3/parebepcg.o ../../plugins/linearsystem/ebepcg/parebepcg.cpp

To link code that uses Charm:
charmc -shared -language ampi -Wl,-soname,libparebepcg.so.1 -o libparebepcg.so.1.0 ../../obj/obj64/parebepcg/Linux3/parebepcg.o -L../../tecgraf/tops/lib64/Linux3 -lpartops -ltopsrd -ltops -L../../bin/lib64/Linux3 -lpartopsim

To compile my app:
charmc -language model -c -fPIC  -O2 -I../../core/include -I../../tecgraf/tops/include -I../../tecgraf/tops/include/vis -I../../plugins -o ../../obj/obj64/partopsimapp/partopsimapp/Linux3/parmain.o ../../tests/app/parmain.cpp

To link my app:
charmc -language ampi -dynamic -o ../../bin/lib64/Linux3/partopsimapp ../../obj/obj64/partopsimapp/partopsimapp/Linux3/parmain.o -L../../tecgraf/tops/lib64/Linux3 -lpartops -ltopsrd -ltops -L../../bin/lib64/Linux3 -lpartopsim -lpartopsimlib -Wl, --no-as-needed -ldl

This is the error that I get:

_pmiu_daemon(SIGCHLD): [NID 16828] [c19-9c1s1n0] [Thu Oct 29 00:35:04 2015] PE RANK 0 exit signal Segmentation fault
[NID 16828] 2015-10-29 00:35:04 Apid 28607883: initiated application termination
_pmiu_daemon(SIGCHLD): [NID 16829] [c19-9c1s1n1] [Thu Oct 29 00:35:04 2015] PE RANK 1 exit signal Segmentation fault

I put some extra infos at the end of the email if you need.
I read a lot of things on the internet and I've been trying a lot but know I think I need some help.
Am I missing something? Is this the correct way handle it?
I really appreciate any suggestions.

Thank you.
Leonardo.

Extra infos

These are my environment variables:

echo $PATH
.:/u/psp/duarte/bin/lua5:/u/psp/duarte/bin/tolua5:/u/psp/duarte/bin/charm/gni-crayxe-smp-craycc/bin:/u/psp/duarte/bin/charm/gni-crayxe-persistent-smp/bin:/sw/xe/darshan/2.3.0/darshan-2.3.0_cle52/bin:/sw/admin/scripts:/sw/user/scripts:/sw/xe/altd/bin:/usr/local/gsi-openssh-6.2p2-2/bin:/opt/java/jdk1.7.0_45/bin:/usr/local/globus-5.2.4/bin:/usr/local/globus-5.2.4/sbin:/opt/moab/8.1/bin:/opt/moab/8.1/sbin:/opt/torque/5.0.2-bwpatch/sbin:/opt/torque/5.0.2-bwpatch/bin:/opt/cray/mpt/7.2.0/gni/bin:/opt/cray/rca/1.0.0-2.0502.53711.3.125.gem/bin:/opt/cray/alps/5.2.1-2.0502.9041.11.6.gem/sbin:/opt/cray/alps/5.2.1-2.0502.9041.11.6.gem/bin:/opt/cray/dvs/2.5_0.9.0-1.0502.1873.1.142.gem/bin:/opt/cray/xpmem/0.1-2.0502.55507.3.2.gem/bin:/opt/cray/dmapp/7.0.1-1.0502.9501.5.211.gem/bin:/opt/cray/pmi/5.0.6-1.0000.10439.140.3.gem/bin:/opt/cray/ugni/5.0-1.0502.9685.4.24.gem/bin:/opt/cray/udreg/2.3.2-1.0502.9275.1.25.gem/bin:/opt/cray/cce/8.3.10/cray-binutils/x86_64-unknown-linux-gnu/bin:/opt/cray/cce/8.3.10/craylibs/x86-64/bin:/opt/cray/cce/8.3.10/cftn/bin:/opt/cray/cce/8.3.10/CC/bin:/opt/cray/craype/2.3.0/bin:/opt/cray/eslogin/eswrap/1.1.0-1.020200.1231.0/bin:/opt/modules/3.2.10.3/bin:/u/psp/duarte/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games:/usr/lib/mit/bin:/usr/lib/mit/sbin:/usr/lib/qt3/bin:/opt/cray/bin

echo $LD_LIBRARY_PATH
.:/u/psp/duarte/topsim/bin/lib64/Linux3:/u/psp/duarte/topsim/bin/libd64/Linux3:/u/psp/duarte/bin/charm/gni-crayxe-smp-craycc/lib_so:/u/psp/duarte/bin/charm/gni-crayxe-smp-craycc/lib:/u/psp/duarte/bin/charm/gni-crayxe-persistent-smp/lib:/u/psp/duarte/lib:/sw/xe/darshan/2.3.0/darshan-2.3.0_cle52/lib:/usr/local/globus-5.2.4/lib64:/usr/local/globus/lib64


My app output:

Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> only comm thread send/recv messages
Charm++> Cray TLB page size: 8192K
Charm++> Running in SMP mode: numNodes 2,  1 worker threads per process
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: v6.6.1-0-g74a2cc5
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 2 unique compute nodes (32-way SMP).
*** Topsim 0.1.0 ***
[0] topParInit() registered
[0] TopParContext created: 0!
[0] topParInit() array created
[1] TopParContext created: 1!
[1] topParInit() registered
[1] topParInit() array created
[0] topParInit() done!
[1] topParInit() done!
[0] PARTOPS: Slave started at processor 0, node: 0, rank: 0.
[0] PARTOPS: MODEL CREATED! rank: 0
[1] PARTOPS: Slave started at processor 1, node: 1, rank: 0.
[1] PARTOPS: MODEL CREATED! rank: 0
Plugin loaded libparebepcg.so
Plugin loaded libpartreader.so
Plugin loaded libisotropic.so
Plugin loaded liblinear.so
Plugin loaded libparsimp.so
Plugin loaded libbrick.so
Plugin loaded libpartreader.so
Plugin loaded libparebepcg.so
Plugin loaded libparloadcontrol.so
Plugin loaded libparwriter.so
Plugin loaded libparsimp.so
Plugin loaded libparjacobi.so
Plugin loaded libbrick.so
Plugin loaded libparwriter.so
Plugin loaded liblinear.so
Plugin loaded libisotropic.so
Plugin loaded libparloadcontrol.so
Plugin loaded libparjacobi.so
Application 28607883 exit codes: 139
Application 28607883 resources: utime ~2s, stime ~2s, Rss ~15384, inblocks ~10927, outblocks ~18489
Thu Oct 29 00:35:04 CDT 2015

This is my PBS script

#!/bin/bash
### set the number of nodes
### set the number of PEs per node
#PBS -l nodes=2:ppn=1:xe
### set the wallclock time
#PBS -l walltime=00:20:00
### set the job name
#PBS -N topsim
### set the job stdout and stderr
#PBS -e topsim.err
#PBS -o topsim.out
### set email notification
#PBS -m bea
### In case of multiple allocations, select which one to charge
##PBS -A xyz

# NOTE: lines that begin with "#PBS" are not interpreted by the shell but ARE
# used by the batch system, wheras lines that begin with multiple # signs,
# like "##PBS" are considered "commented out" by the batch system
# and have no effect.

# If you launched the job in a directory prepared for the job to run within,
# you'll want to cd to that directory
# [uncomment the following line to enable this]
cd $PBS_O_WORKDIR

# Alternatively, the job script can create its own job-ID-unique directory
# to run within.  In that case you'll need to create and populate that
# directory with executables and perhaps inputs
# [uncomment and customize the following lines to enable this behavior]
# mkdir -p /scratch/sciteam/$USER/$PBS_JOBID
# cd /scratch/sciteam/$USER/$PBS_JOBID
# cp /scratch/job/setup/directory/* .

# To add certain modules that you do not have added via ~/.modules
. /opt/modules/default/init/bash # NEEDED to add module commands to shell
#module swap PrgEnv-cray PrgEnv-gnu
module add craype-hugepages8M
module add rca

#export CRAY_ROOTFS=DSL
echo $LD_LIBRARY_PATH

#export APRUN_XFER_LIMITS=1  # to transfer shell limits to the executable

### launch the application
### redirecting stdin and stdout if needed
### NOTE: (the "in" file must exist for input)

# used for timing
date

aprun -n2 -N1 ./partopsimapp ../../../tests/data/input/config/plugins_simp_parebepcg_jacobi_brick.lua ../../../tests/data/input/examples/CantSymm/CantSymm12_2.pos ../../../tests/data/output/CantSymm12_2_result.pos

# used for timing
date
### For more information see the man page for aprun






Archive powered by MHonArc 2.6.16.

Top of Page