charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

RE: [charm] CrayPat with Charm++

From: "Choi, Jaemin" <jchoi157 AT illinois.edu>
To: "Papatheodore, Thomas L." <papatheodore AT ornl.gov>, Ted Packwood <malice AT cray.com>
Cc: "charm AT cs.illinois.edu" <charm AT cs.illinois.edu>
Subject: RE: [charm] CrayPat with Charm++
Date: Thu, 13 Jul 2017 20:03:35 +0000
Accept-language: en-US
Authentication-results: illinois.edu; spf=pass smtp.mailfrom=jchoi157 AT illinois.edu

Looking at the output, it seems like the memory allocation failures could be
related to the memory pool allocation on the GNI layer.
Forwarding this to charm mailing list for inputs from others.

Jaemin Choi
Ph.D. Student, Research Assistant
Parallel Programming Laboratory
University of Illinois Urbana-Champaign

________________________________________
From: Papatheodore, Thomas L.
[papatheodore AT ornl.gov]
Sent: Thursday, July 13, 2017 12:37 PM
To: Choi, Jaemin; Miller, Philip B; Ted Packwood
Subject: Re: [charm] CrayPat with Charm++

Hey Jaemin-

I followed your suggestion of trying CrayPat with a non-CUDA build of
Charm++, but I get the same results. I grabbed a new clone of the 6.8.0 beta
and followed the same steps as outlined below with only 2 differences: (1) I
did not load the cudatoolkit and (2) I built with the command “./build
charm++ gni-crayxe perftools --with-production -j8”

Just like the previous attempt, the non-instrumented version works while the
instrumented version gives memory-related errors:

NON-INSTRUMENTED:

[tpapathe@titan-login5:
/lustre/atlas2/csc198/proj-shared/tpapathe/charm_v6.8.0_cpuOnly/charm/gni-crayxe-perftools/tests/charm++/simplearrayhello]$
aprun -n1 ./hello
Charm++> Running on Gemini (GNI) with 1 processes
Charm++> static SMSG
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0
means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> Cray TLB page size: 8192K
Charm++> Running in non-SMP mode: numPes 1
Converse/Charm++ Commit ID: v6.8.0-beta1-305-g2f4f0be
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (16-way SMP).
Running Hello on 1 processors for 5 elements
[0] Hello 0 created
[0] Hello 1 created
[0] Hello 2 created
[0] Hello 3 created
[0] Hello 4 created
[0] Hi[17] from element 0
[0] Hi[18] from element 1
[0] Hi[19] from element 2
[0] Hi[20] from element 3
[0] Hi[21] from element 4
All done
[Partition 0][Node 0] End of program
Application 15105644 resources: utime ~0s, stime ~0s, Rss ~8312, inblocks
~17746, outblocks ~51912

INSTRUMENTED:

[tpapathe@titan-login5:
/lustre/atlas2/csc198/proj-shared/tpapathe/charm_v6.8.0_cpuOnly/charm/gni-crayxe-perftools/tests/charm++/simplearrayhello]$
aprun -n1 ./hello+pat
CrayPat/X: Version 6.4.5 Revision 87dd5b8 01/23/17 15:37:24
Charm++> Running on Gemini (GNI) with 1 processes
Charm++> static SMSG
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0
means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> Cray TLB page size: 8192K
Charm++> Running in non-SMP mode: numPes 1
Converse/Charm++ Commit ID: v6.8.0-beta1-305-g2f4f0be
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (16-way SMP).
libhugetlbfs [nid02349:6739]: WARNING: New heap segment map at 0x10700000000
failed: Cannot allocate memory
libhugetlbfs [nid02349:6739]: WARNING: New heap segment map at 0x10700800000
failed: Cannot allocate memory
libhugetlbfs [nid02349:6739]: WARNING: New heap segment map at 0x10700800000
failed: Cannot allocate memory
libhugetlbfs [nid02349:6739]: WARNING: New heap segment map at 0x10700800000
failed: Cannot allocate memory
libhugetlbfs [nid02349:6739]: WARNING: New heap segment map at 0x10700800000
failed: Cannot allocate memory
libhugetlbfs [nid02349:6739]: WARNING: New heap segment map at 0x10700800000
failed: Cannot allocate memory
libhugetlbfs [nid02349:6739]: WARNING: New heap segment map at 0x10700800000
failed: Cannot allocate memory
libhugetlbfs [nid02349:6739]: WARNING: New heap segment map at 0x10700800000
failed: Cannot allocate memory
libhugetlbfs [nid02349:6739]: WARNING: New heap segment map at 0x10700800000
failed: Cannot allocate memory
libhugetlbfs [nid02349:6739]: WARNING: New heap segment map at 0x10700800000
failed: Cannot allocate memory
libhugetlbfs [nid02349:6739]: WARNING: New heap segment map at 0x10700800000
failed: Cannot allocate memory

Should CrayPat work with Charm++ (v6.8.0-beta) on Titan? Thanks again for
helping with this.

-Tom

From: "Choi, Jaemin"
<jchoi157 AT illinois.edu>
Date: Thursday, July 13, 2017 at 12:12 PM
To: "Papatheodore, Thomas L."
<papatheodore AT ornl.gov>,
"Miller, Philip B"
<mille121 AT illinois.edu>,
Ted Packwood
<malice AT cray.com>
Subject: RE: [charm] CrayPat with Charm++

Hi Tom,

There is a module in the Charm++ runtime that executes some CUDA calls at
startup time of a Charm++ program (regardless of the actual program using
CUDA or not), if Charm++ was built with the cuda option. So I believe the
warnings from running the instrumented program (aprun -n1 ./hello+pat) could
have been caused by this. The module also allocates some page-locked host
memory through cudaMallocHost() for use as a memory pool, so the libhugetlbfs
warning could be a result of that. Also, the warnings should be irrelevant to
building with/without SMP.

My suggestion is to get CrayPat working with a non-cuda (and probably
non-SMP) build of Charm++ first, preferably the development version (6.8.0
beta) since it's almost identical to the soon-to-be released 6.8.0 version.
Our team is currently making substantial changes to the Charm++ module that
supports GPUs (named GPU Manager), so we could work on getting CrayPat to
work with GPU support after 6.8.0 is officially released and our changes to
the module get merged in the mainline.

Thank you again for your time and effort in this matter.

Jaemin Choi
Ph.D. Student, Research Assistant
Parallel Programming Laboratory
University of Illinois Urbana-Champaign
________________________________
From: Papatheodore, Thomas L.
[papatheodore AT ornl.gov]
Sent: Wednesday, July 12, 2017 10:09 AM
To: Choi, Jaemin; Miller, Philip B; Ted Packwood
Subject: Re: [charm] CrayPat with Charm++

Hey all-

As far as I can tell, CrayPat does not appear to work properly with Charm++
version 6.8.0 on Titan (Cray XK7). That being said, my only experience with
Charm++ is trying to build it and test that it’s working properly, so please
let me know if it looks like I’ve done something incorrectly. This is what I
have done:

GRAB A COPY OF THE LATEST DEVELOPMENT VERSION OF CHARM++:

[tpapathe@titan-ext4:
/lustre/atlas2/csc198/proj-shared/tpapathe/charm_v6.8.0]$ git clone
http://charm.cs.illinois.edu/gerrit/charm.git

MOVE INTO CHARM DIRECTORY:

[tpapathe@titan-ext4:
/lustre/atlas2/csc198/proj-shared/tpapathe/charm_v6.8.0]$ cd charm/

CREATE conv-mach-perftools.h FILE:

[tpapathe@titan-ext4:
/lustre/atlas2/csc198/proj-shared/tpapathe/charm_v6.8.0/charm]$ touch
src/arch/common/conv-mach-perftools.h

PREPEND THE PE_PKGCONFIG_LIBS ENVIRONMENT VARIABLE (as suggested):

[tpapathe@titan-ext4:
/lustre/atlas2/csc198/proj-shared/tpapathe/charm_v6.8.0/charm]$ export
PE_PKGCONFIG_LIBS=cray-pmi:cray-ugni:$PE_PKGCONFIG_LIBS

BUILD CHARM++ (output of successful build attached in output_from_build):

[tpapathe@titan-ext4:
/lustre/atlas2/csc198/proj-shared/tpapathe/charm_v6.8.0/charm]$ ./build
charm++ gni-crayxe-cuda perftools --with-production -j8

THIS BUILD WAS PERFORMED WITH THE FOLLOWING MODULE LIST:

[tpapathe@titan-ext4:
/lustre/atlas2/csc198/proj-shared/tpapathe/charm_v6.8.0/charm]$ module list

Currently Loaded Modulefiles:

1) eswrap/1.3.3-1.020200.1278.0 6) craype-interlagos
11) hsi/5.0.2.p1 16) pmi/5.0.11
21) alps/5.2.4-2.0502.9774.31.12.gem 26) craype-hugepages8M

2) craype-network-gemini 7) lustredu/1.4
12) DefApps 17)
dmapp/7.0.1-1.0502.11080.8.74.gem 22) rca/1.0.0-2.0502.60530.1.63.gem
27) cudatoolkit/7.5.18-1.0502.10743.2.1

3) gcc/4.9.3 8) xalt/0.7.5
13) cray-libsci/16.11.1 18)
gni-headers/4.0-1.0502.10859.7.8.gem 23) atp/2.0.5
28) perftools

4) craype/2.5.9 9) module_msg/0.1
14) udreg/2.3.2-1.0502.10518.2.17.gem 19)
xpmem/0.1-2.0502.64982.5.3.gem 24) PrgEnv-gnu/5.2.82

5) cray-mpich/7.5.2 10) modulator/1.2.0
15) ugni/6.0-1.0502.10863.8.28.gem 20)
dvs/2.5_0.9.0-1.0502.2188.1.113.gem 25) perftools-base/6.4.5

MOVE INTO TEST DIRECTORY:

[tpapathe@titan-ext4:
/lustre/atlas2/csc198/proj-shared/tpapathe/charm_v6.8.0/charm]$ cd
gni-crayxe-cuda-perftools/tests/charm++/simplearrayhello

COMPILE TEST PROGRAM:

[tpapathe@titan-ext4:
/lustre/atlas2/csc198/proj-shared/tpapathe/charm_v6.8.0/charm/gni-crayxe-cuda-perftools/tests/charm++/simplearrayhello]$
make OPTS="-save"

INSTRUMENT TEST PROGRAM:

[tpapathe@titan-ext4:
/lustre/atlas2/csc198/proj-shared/tpapathe/charm_v6.8.0/charm/gni-crayxe-cuda-perftools/tests/charm++/simplearrayhello]$
pat_build -O apa ./hello

INFO: A maximum of 712 functions from group 'mpi' will be traced.

INFO: A maximum of 56 functions from group 'realtime' will be traced.

INFO: A maximum of 199 functions from group 'syscall' will be traced.

FROM AN INTERACTIVE NODE (with the same module list loaed), RUN THE
UN-INSTRUMENTED EXECUTABLE:

[tpapathe@titan-batch3:
/lustre/atlas2/csc198/proj-shared/tpapathe/charm_v6.8.0/charm/gni-crayxe-cuda-perftools/tests/charm++/simplearrayhello]$
aprun -n1 ./hello

Charm++> Running on Gemini (GNI) with 1 processes

Charm++> static SMSG

Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0
means no limit)

Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB

Charm++> Cray TLB page size: 8192K

Charm++> Running in non-SMP mode: numPes 1

Converse/Charm++ Commit ID: v6.8.0-beta1-299-gcbf50e5

CharmLB> Load balancer assumes all CPUs are same.

Charm++> Running on 1 unique compute nodes (16-way SMP).

Running Hello on 1 processors for 5 elements

[0] Hello 0 created

[0] Hello 1 created

[0] Hello 2 created

[0] Hello 3 created

[0] Hello 4 created

[0] Hi[17] from element 0

[0] Hi[18] from element 1

[0] Hi[19] from element 2

[0] Hi[20] from element 3

[0] Hi[21] from element 4

All done

EXIT HYBRID API

[Partition 0][Node 0] End of program

Application 15066895 resources: utime ~0s, stime ~2s, Rss ~1132800, inblocks
~10482, outblocks ~31188

FROM AN INTERACTIVE NODE (with the same module list loaed), RUN THE
INSTRUMENTED EXECUTABLE:

[tpapathe@titan-batch3:
/lustre/atlas2/csc198/proj-shared/tpapathe/charm_v6.8.0/charm/gni-crayxe-cuda-perftools/tests/charm++/simplearrayhello]$
aprun -n1 ./hello+pat

CrayPat/X: Version 6.4.5 Revision 87dd5b8 01/23/17 15:37:24

pat[WARNING][0]: Collection of accelerator performance data for sampling
experiments is not supported. To collect accelerator performance data
perform a trace experiment. See the intro_craypat(1) man page on how to
perform a trace experiment.

Charm++> Running on Gemini (GNI) with 1 processes

Charm++> static SMSG

Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0
means no limit)

Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB

Charm++> Cray TLB page size: 8192K

Charm++> Running in non-SMP mode: numPes 1

Converse/Charm++ Commit ID: v6.8.0-beta1-299-gcbf50e5

CharmLB> Load balancer assumes all CPUs are same.

Charm++> Running on 1 unique compute nodes (16-way SMP).

libhugetlbfs [nid03790:16907]: WARNING: New heap segment map at 0x10761800000
failed: Cannot allocate memory

libhugetlbfs [nid03790:16907]: WARNING: New heap segment map at 0x10761800000
failed: Cannot allocate memory

libhugetlbfs [nid03790:16907]: WARNING: New heap segment map at 0x10761800000
failed: Cannot allocate memory

…

…

THE INSTRUMENTED VERSION FAILS TO RUN SUCCESSFULLY (keeps printing the same
libhugetlbfs warning) AND I AM NOT SURE WHY, SO I HAVE THE FOLLOWING
QUESTIONS:

1. Why do I receive the WARNING about “Collection of accelerator
performance data” when I am not running a CUDA program? Can this version of
Charm++ not run serial codes because it was built with “cuda”?
2. Does this possibly have something to do with NOT building with “smp”?
3. Is the libhugetlbfs warning related to trying to allocate GPU memory?

IF I RUN THE SAME PROGRAM WITHOUT THE craype-hugepages8M MODULE LOADED, I GET
A DIFFERENT RESULT:

[tpapathe@titan-batch8:
/lustre/atlas2/csc198/proj-shared/tpapathe/charm_v6.8.0/charm/gni-crayxe-cuda-perftools/tests/charm++/simplearrayhello]$
aprun -n1 ./hello+pat

CrayPat/X: Version 6.4.5 Revision 87dd5b8 01/23/17 15:37:24

pat[WARNING][0]: Collection of accelerator performance data for sampling
experiments is not supported. To collect accelerator performance data
perform a trace experiment. See the intro_craypat(1) man page on how to
perform a trace experiment.

Charm++> Running on Gemini (GNI) with 1 processes

Charm++> static SMSG

Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0
means no limit)

Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB

Charm++> Cray TLB page size: 2048K

Charm++> Running in non-SMP mode: numPes 1

Converse/Charm++ Commit ID: v6.8.0-beta1-299-gcbf50e5

CharmLB> Load balancer assumes all CPUs are same.

Charm++> Running on 1 unique compute nodes (16-way SMP).

[NID 03790] 2017-07-12 10:53:43 Apid 15067006: OOM killer terminated this
process.

Application 15067006 exit signals: Killed

Application 15067006 resources: utime ~0s, stime ~0s, Rss ~10056, inblocks
~10886, outblocks ~33557

HAVE I DONE SOMETHING INCORRECTLY IN THE ABOVE STEPS?

SHOULD CRAYPAT BE ABLE TO WORK WITH CHARM++ VERSION 6.8.0?

SHOULD CRAYPAT BE ABLE TO WORK WITH CHARM++ VERSION 6.7.1?

SHOULD CRAYPAT BE ABLE TO WORK WITH EITHER VERSION OF CHARM++ WITH GPU
SUPPORT?

@Choi,
Jaemin<mailto:jchoi157 AT illinois.edu>:
DO YOU NEED TO RUN WITH VERSION 6.8.0?

FYI: I am trying to understand what functionality/compatibility exists to
inform users of Titan at OLCF. Please let me know what you think. Thanks you
again for all of your help.

-Tom

On 7/10/17, 3:00 PM, "Choi, Jaemin"
<jchoi157 AT illinois.edu>
wrote:

Running CrayPAT on Charm++/GPU codes would be ideal, but it's also okay
if it works on only Charm++ codes.

Thank you.

Jaemin Choi

Ph.D. Student, Research Assistant

Parallel Programming Laboratory

University of Illinois Urbana-Champaign

________________________________________

From: Papatheodore, Thomas L.
[papatheodore AT ornl.gov]

Sent: Monday, July 10, 2017 1:52 PM

To: Choi, Jaemin; Miller, Philip B; Ted Packwood

Subject: Re: [charm] CrayPat with Charm++

Ok, the uninstrumented overlapTestStream executable appears to run fine:

[tpapathe@titan-batch3:
/lustre/atlas2/csc198/proj-shared/tpapathe/charm/gni-crayxe-cuda-perftools/examples/charm++/cuda/overlapTestStream]$
aprun -n1 ./overlapTest 4 32

Charm++> Running on Gemini (GNI) with 1 processes

Charm++> static SMSG

Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0
means no limit)

Charm++> memory pool registered memory limit: 200000MB, send limit:
100000MB

Charm++> Cray TLB page size: 2048K

Charm++> Running in non-SMP mode: numPes 1

Converse/Charm++ Commit ID: v6.8.0-beta1-287-gd57c83d

CharmLB> Load balancer assumes all CPUs are same.

Charm++> Running on 1 unique compute nodes (16-way SMP).

[0] A

[0] 0.39 0.78 0.80 0.91 0.20 0.34 0.77 0.28 0.55 0.48 0.63 0.36 0.51 0.95
0.92 0.64 0.72 0.14 0.61 0.02 0.24 0.14 0.80 0.16 0.40 0.13 0.11 1.00 0.22
0.51 0.84 0.61

[0] 0.30 0.64 0.52 0.49 0.97 0.29 0.77 0.53 0.77 0.40 0.89 0.28 0.35 0.81
0.92 0.07 0.95 0.53 0.09 0.19 0.66 0.89 0.35 0.06 0.02 0.46 0.06 0.24 0.97
0.90 0.85 0.27

[0] 0.54 0.38 0.76 0.51 0.67 0.53 0.04 0.44 0.93 0.93 0.72 0.28 0.74 0.64
0.35 0.69 0.17 0.44 0.88 0.83 0.33 0.23 0.89 0.35 0.69 0.96 0.59 0.66 0.86
0.44 0.92 0.40

…

[3] 6.74 7.05 7.37 7.52 6.62 6.07 5.54 5.54 7.64 5.72 6.13 7.64 6.26 5.14
6.12 7.84 5.20 7.47 7.48 5.06 5.89 7.30 6.13 4.86 6.60 8.02 8.36 7.72 6.28
6.29 8.15 6.11

[3] 8.37 9.08 8.76 8.03 8.46 8.41 7.22 8.37 8.45 7.55 8.02 9.34 6.94 6.41
6.72 10.32 7.48 9.17 8.81 7.70 7.94 9.09 7.24 6.95 7.30 9.96 9.26 9.55 8.30
8.47 9.12 8.30

Elapsed time: 0.066500 s

EXIT HYBRID API

[Partition 0][Node 0] End of program

Application 15032622 resources: utime ~0s, stime ~2s, Rss ~1143352,
inblocks ~10499, outblocks ~31391

But the instrumented version still does not work correctly:

[tpapathe@titan-ext3:
/lustre/atlas2/csc198/proj-shared/tpapathe/charm/gni-crayxe-cuda-perftools/examples/charm++/cuda/overlapTestStream]$
pat_build -u -Dtrace-text-size=800 ./overlapTest

WARNING: Tracing non-group functions was limited to those 803 - 9318
bytes in size.

INFO: A total of 131 selected non-group functions were traced.

[tpapathe@titan-batch3:
/lustre/atlas2/csc198/proj-shared/tpapathe/charm/gni-crayxe-cuda-perftools/examples/charm++/cuda/overlapTestStream]$
aprun -n1 ./overlapTest+pat 4 32

CrayPat/X: Version 6.4.5 Revision 87dd5b8 01/23/17 15:37:24

Charm++> Running on Gemini (GNI) with 1 processes

Charm++> static SMSG

Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0
means no limit)

Charm++> memory pool registered memory limit: 200000MB, send limit:
100000MB

Charm++> Cray TLB page size: 2048K

Charm++> Running in non-SMP mode: numPes 1

Converse/Charm++ Commit ID: v6.8.0-beta1-287-gd57c83d

CharmLB> Load balancer assumes all CPUs are same.

Charm++> Running on 1 unique compute nodes (16-way SMP).

pat[WARNING][0]: abort process 12857 because of signal 11

Experiment data file written:

/lustre/atlas2/csc198/proj-shared/tpapathe/charm/gni-crayxe-cuda-perftools/examples/charm++/cuda/overlapTestStream/overlapTest+pat+12857-2343t.xf

_pmiu_daemon(SIGCHLD): [NID 02343] [c6-1c0s3n3] [Mon Jul 10 14:37:14
2017] PE RANK 0 exit signal Segmentation fault

Application 15032635 exit codes: 139

Application 15032635 resources: utime ~0s, stime ~1s, Rss ~137324,
inblocks ~11369, outblocks ~32714

@Choi,
Jaemin<mailto:jchoi157 AT illinois.edu>:
Just to be clear, you want to run CrayPat on Charm++/GPU codes, correct?

@Ted
Packwood<mailto:malice AT cray.com>:
Should CrayPat be able to run Charm++/GPU codes on a Cray XK7 (e.g. Titan)?

On 7/10/17, 2:21 PM, "Choi, Jaemin"
<jchoi157 AT illinois.edu>
wrote:

Looks like overlapTestStream needs the number of chares and the
matrix size as command line inputs.

Something like

aprun -n1 ./overlapTest 4 1024

should work.

Also, it might be better to comment out the #define DEBUG in
overlapTest.C if you do not want to see all the matrix values.

Thanks,

Jaemin Choi

Ph.D. Student, Research Assistant

Parallel Programming Laboratory

University of Illinois Urbana-Champaign

________________________________________

From: Papatheodore, Thomas L.
[papatheodore AT ornl.gov]

Sent: Monday, July 10, 2017 12:43 PM

To: Choi, Jaemin; Miller, Philip B

Cc: Ted Packwood

Subject: Re: [charm] CrayPat with Charm++

Here is the attempt at compiling and running overlapTestStream:

COMPILE:

[tpapathe@titan-ext3:
/lustre/atlas2/csc198/proj-shared/tpapathe/charm/gni-crayxe-cuda-perftools/examples/charm++/cuda/overlapTestStream]$
make OPTS="-save"

/opt/nvidia/cudatoolkit7.5/7.5.18-1.0502.10743.2.1/bin/nvcc -O3 -c
-use_fast_math -I/usr/local/cuda/include -I../../../../include -o
overlapTestCU.o overlapTest.cu

../../../../bin/charmc -save -language charm++ -o overlapTest
overlapTest.o overlapTestCU.o -L/usr/local/cuda/lib64 -lcuda -lcudart

RUN:

[tpapathe@titan-batch3:
/lustre/atlas2/csc198/proj-shared/tpapathe/charm/gni-crayxe-cuda-perftools/examples/charm++/cuda/overlapTestStream]$
aprun -n1 ./overlapTest

Charm++> Running on Gemini (GNI) with 1 processes

Charm++> static SMSG

Charm++> memory pool init block size: 8MB, total memory pool limit
0MB (0 means no limit)

Charm++> memory pool registered memory limit: 200000MB, send limit:
100000MB

Charm++> Cray TLB page size: 2048K

Charm++> Running in non-SMP mode: numPes 1

Converse/Charm++ Commit ID: v6.8.0-beta1-287-gd57c83d

CharmLB> Load balancer assumes all CPUs are same.

Charm++> Running on 1 unique compute nodes (16-way SMP).

The program just hangs at this point and never recovers…

On 7/10/17, 12:21 PM, "Choi, Jaemin"
<jchoi157 AT illinois.edu>
wrote:

Looks like there is something wrong with GPU Manager module in
the runtime.

Could you try the overlapTestStream example program instead while
I take a look at the issue?

That uses normal CUDA code instead of the GPU Manager version.

Jaemin Choi

Ph.D. Student, Research Assistant

Parallel Programming Laboratory

University of Illinois Urbana-Champaign

________________________________________

From: Papatheodore, Thomas L.
[papatheodore AT ornl.gov]

Sent: Monday, July 10, 2017 9:39 AM

To: Miller, Philip B

Cc: Ted Packwood; Choi, Jaemin

Subject: Re: [charm] CrayPat with Charm++

Yes, sorry about that. I first ran with instrumentation, then
realized I was crashing without instrumentation. Here is the output I
intended to include for the un-instrumented version:

[tpapathe@titan-login5:
/lustre/atlas2/csc198/proj-shared/tpapathe/charm/gni-crayxe-cuda-perftools/examples/charm++/cuda/vectorAdd]$
aprun -n1 ./vectorAdd

Charm++> Running on Gemini (GNI) with 1 processes

Charm++> static SMSG

Charm++> memory pool init block size: 8MB, total memory pool
limit 0MB (0 means no limit)

Charm++> memory pool registered memory limit: 200000MB, send
limit: 100000MB

Charm++> Cray TLB page size: 2048K

Charm++> Running in non-SMP mode: numPes 1

Converse/Charm++ Commit ID: v6.8.0-beta1-287-gd57c83d

CharmLB> Load balancer assumes all CPUs are same.

Charm++> Running on 1 unique compute nodes (16-way SMP).

Fatal CUDA Error invalid argument at cuda-hybrid-api.cu:458.

Return value 11 from
'cudaMemcpyAsync(CsvAccess(gpuManager).hostBuffers[index],
CsvAccess(gpuManager).devBuffers[index], size, cudaMemcpyDeviceToHost,
CsvAccess(gpuManager).data_out_stream)'.------------- Processor 0 Exiting:
Called CmiAbort ------------

Reason: Exiting!

VECTOR KERNELVECTOR KERNEL[0] Stack Traceback:

[0:0] [0x200b3295]

[0:1] [0x200ea967]

[0:2] [0x200e95a9]

[0:3] [0x200e9d47]

[0:4] [0x200e9937]

[0:5] [0x200beb63]

[0:6] [0x200bee0d]

[0:7] [0x200bcdba]

[0:8] [0x200066b9]

[0:9] __libc_start_main+0xe6 [0x2aaaaf197c36]

aborting job:

Exiting!

[0:10] [0x20006e3d]

Application 15026215 exit codes: 255

Application 15026215 resources: utime ~0s, stime ~2s, Rss
~1144904, inblocks ~10503, outblocks ~31257

From:
<unmobile AT gmail.com>
on behalf of Phil Miller
<mille121 AT illinois.edu>

Date: Monday, July 10, 2017 at 10:33 AM

To: "Papatheodore, Thomas L."
<papatheodore AT ornl.gov>

Cc: Ted Packwood
<malice AT cray.com>,
"Choi, Jaemin"
<jchoi157 AT illinois.edu>

Subject: Re: [charm] CrayPat with Charm++

On Mon, Jul 10, 2017 at 9:26 AM, Papatheodore, Thomas L.
<papatheodore AT ornl.gov<mailto:papatheodore AT ornl.gov>>
wrote:

RUN UN-INSTRUMENTED PROGRAM FROM INTERACTIVE COMPUTE NODE:

[tpapathe@titan-login5:
/lustre/atlas2/csc198/proj-shared/tpapathe/charm/gni-crayxe-cuda-perftools/examples/charm++/cuda/vectorAdd]$
aprun -n1 ./vectorAdd+pat

CrayPat/X: Version 6.4.5 Revision 87dd5b8 01/23/17 15:37:24

Charm++> Running on Gemini (GNI) with 1 processes

Charm++> static SMSG

Charm++> memory pool init block size: 8MB, total memory pool
limit 0MB (0 means no limit)

Charm++> memory pool registered memory limit: 200000MB, send
limit: 100000MB

Charm++> Cray TLB page size: 2048K

Charm++> Running in non-SMP mode: numPes 1

Converse/Charm++ Commit ID: v6.8.0-beta1-287-gd57c83d

CharmLB> Load balancer assumes all CPUs are same.

Charm++> Running on 1 unique compute nodes (16-way SMP).

pat[WARNING][0]: abort process 11610 because of signal 11

Experiment data file written:

/lustre/atlas2/csc198/proj-shared/tpapathe/charm/gni-crayxe-cuda-perftools/examples/charm++/cuda/vectorAdd/vectorAdd+pat+11610-2351t.xf

_pmiu_daemon(SIGCHLD): [NID 02351] [c6-1c0s7n3] [Mon Jul 10
10:02:29 2017] PE RANK 0 exit signal Segmentation fault

Application 15025560 exit codes: 139

Application 15025560 resources: utime ~0s, stime ~1s, Rss
~138756, inblocks ~11390, outblocks ~32755

[tpapathe@titan-login5:
/lustre/atlas2/csc198/proj-shared/tpapathe/charm/gni-crayxe-cuda-perftools/examples/charm++/cuda/vectorAdd]$

Is there something obvious here that I am doing incorrectly? The
hello example program appears to work correctly, but the CrayPat profiling on
it does not. The vectorAdd example program does not appear to work correctly
even without profiling. If you have any further advice, I would greatly
appreciate it. Thank you for your help.

I can't comment on the CrayPat issues at the moment, but it looks
like the crash in vectorAdd was in the instrumented version.

Re: [charm] CrayPat with Charm++, (continued)