charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

[charm] performance data

From: Noel Keen <ndkeen AT lbl.gov>
To: Charm Mailing List <charm AT cs.illinois.edu>
Subject: [charm] performance data
Date: Tue, 12 Mar 2013 10:41:47 -0700
List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Hello,

I'm running the UTS benchmark implemented in Charm++.

I'm trying to get some performance data on a Cray XT5 (hopper at NERSC).

I'd like to get a handle on the expense of chare creation and

chare migration. Across nodes is obviously the more

interesting question, but perf data within a node is still useful.

Aggregate is fine. Would also need to know the size of chares

as the cost for sending 100 small ones would surely be smaller

than 100 large ones.

I've taken the following paths:

a) I tried using TAU with charm++. I was able to get function-level

profile data, but that isn't exactly what I was looking for.

b) I tried using projections. From what I've read, I'm not positive

the data this records is what I need, but worth a try. For a 24-way job,

it wrote 24 large files. I was unable to get the software to display

anything useful, and it failed on several occasions -- even when

I copied the data back to a local mac (with plenty of memory)

to run projections here.

c) I tried looking at the source (the UTS bnchmark itself). If I can find

the functions where the chare creation and migration are happening,

then I can get the wall-clock time of these functions. I've

had a hard time trying to figure out where things are happening --

due largely, I'm sure, to my newness to charm++.

d) Within the help message that is displayed with -help, I discovered

the +cs option which printed out some very useful information.

For example, a 24-way run (1 24-way SMP node) resulted:

Charm Kernel Summary Statistics:

Proc 0: [784059 created, 788045 processed]

Proc 1: [790225 created, 788017 processed]

Proc 2: [791345 created, 787951 processed]

Proc 3: [784477 created, 788034 processed]

Proc 4: [783329 created, 787978 processed]

Proc 5: [790449 created, 787968 processed]

Proc 6: [792147 created, 787997 processed]

Proc 7: [783082 created, 788036 processed]

Proc 8: [790157 created, 787981 processed]

Proc 9: [790956 created, 788058 processed]

Proc 10: [785116 created, 788028 processed]

Proc 11: [781910 created, 788038 processed]

Proc 12: [781778 created, 787898 processed]

Proc 13: [799731 created, 787939 processed]

Proc 14: [787941 created, 787940 processed]

Proc 15: [791902 created, 787972 processed]

Proc 16: [789399 created, 788106 processed]

Proc 17: [791508 created, 787987 processed]

Proc 18: [788229 created, 787943 processed]

Proc 19: [776874 created, 787929 processed]

Proc 20: [787656 created, 788026 processed]

Proc 21: [794203 created, 787929 processed]

Proc 22: [791115 created, 788010 processed]

Proc 23: [784265 created, 788043 processed]

Total Chares: [18911853 created, 18911853 processed]

I take this to mean that processor 0 created 784059 chares and processed 788045.

Obviously there is some chare migration happening as chares

are being created on some procs and processed on others.

I also found the +stats options, but this appears only available on ethernet

networks. ? For a 2-way small problem on my local mac, I see the following type

of data for each "node" (processor):

***********************************

Net Statistics For Node 0

Interrupts: 0 Processed: 0

Total Msgs Sent: 3464 Total Bytes Sent: 553360

Total Msgs Recv: 3364 Total Bytes Recv: 538060

***********************************

[Num] SENDTO RESEND RECV ACKSTO ACKSFRM PKTACK

===== ====== ====== ==== ====== ======= ======

[0] 0 0 0 0 0 0

[1] 3464 1 3366 673 693 3462

[TOTAL] 3464 1 3366 673 693 3462

***********************************

Here we have the bytes sent/recvd which for a large enough problem (that was understood), we could obtain

the average size of a chare.

I don't see this option on the Cray version, but I do see +print_stats, which produces a text file for every processor.

An example of data written to this file for processor 0:

hopper01% mo counters/statistics.24.0

Node[0] SMSG time in buffer [total:291.920375 max:0.113420 Average:0.000388](milisecond)

Node[0] Smsg Msgs [Total:751635 Data:751635 Lmsg_Init:0 ACK:0 BIG_MSG_ACK:0 Direct_put_done:0 Persistent_put_done:0]

Node[0] SmsgSendCalls [Total:751635 Data:751635 Lmsg_Init:0 ACK:0 BIG_MSG_ACK:0 Direct_put_done:0 Persistent_put_done:0]

Node[0] Rdma Transaction [count (GET/PUT):0 0 calls (GET/PUT):0 0]

Node[0] Rdma time from control arrives to rdma init [Total:0.000000 MAX:0.000000 Average:-nan](milisecond)

Node[0] Rdma time from init to rdma done [Total:0.000000 MAX:0.000000 Average:-nan](milisecond)

count total(s) max(s) average(us)

PumpNetworkSmsg: 1470537 0.649746 0.000287 0.441843

PumpRemoteTransactions: 1470537 0.280214 0.000005 0.190552

PumpLocalTransactions(RDMA): 0 0.000000 0.000000 -nan

SendBufferMsg (SMSG): 1470537 0.276000 0.000011 0.187686

SendRdmaMsg: 1470537 0.273049 0.000005 0.185680

I'm still trying to make sense of what this is telling me.

Can you offer any suggestions? Am I missing something obvious or is this

considered a research question that depends on the problem, size, machine, etc?

much thanks

Noel

[charm] performance data, Noel Keen, 03/12/2013