Skip to Content.
Sympa Menu

charm - [charm] performance data

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

[charm] performance data


Chronological Thread 
  • From: Noel Keen <ndkeen AT lbl.gov>
  • To: Charm Mailing List <charm AT cs.illinois.edu>
  • Subject: [charm] performance data
  • Date: Tue, 12 Mar 2013 10:41:47 -0700
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Hello,

I'm running the UTS benchmark implemented in Charm++.
I'm trying to get some performance data on a Cray XT5 (hopper at NERSC).
I'd like to get a handle on the expense of chare creation and
chare migration.  Across nodes is obviously the more
interesting question, but perf data within a node is still useful.
Aggregate is fine.  Would also need to know the size of chares
as the cost for sending 100 small ones would surely be smaller
than 100 large ones.

I've taken the following paths:

a) I tried using TAU with charm++.  I was able to get function-level
profile data, but that isn't exactly what I was looking for. 

b) I tried using projections.  From what I've read, I'm not positive
the data this records is what I need, but worth a try.  For a 24-way job,
it wrote 24 large files. I was unable to get the software to display
anything useful, and it failed on several occasions -- even when
I copied the data back to a local mac (with plenty of memory)
to run projections here.

c) I tried looking at the source (the UTS bnchmark itself).  If I can find
the functions where the chare creation and migration are happening,
then I can get the wall-clock time of these functions.  I've
had a hard time trying to figure out where things are happening --
due largely, I'm sure, to my newness to charm++.

d) Within the help message that is displayed with -help, I discovered
the +cs option which printed out some very useful information.

For example, a 24-way run (1 24-way SMP node) resulted:

Charm Kernel Summary Statistics:
Proc 0: [784059 created, 788045 processed]
Proc 1: [790225 created, 788017 processed]
Proc 2: [791345 created, 787951 processed]
Proc 3: [784477 created, 788034 processed]
Proc 4: [783329 created, 787978 processed]
Proc 5: [790449 created, 787968 processed]
Proc 6: [792147 created, 787997 processed]
Proc 7: [783082 created, 788036 processed]
Proc 8: [790157 created, 787981 processed]
Proc 9: [790956 created, 788058 processed]
Proc 10: [785116 created, 788028 processed]
Proc 11: [781910 created, 788038 processed]
Proc 12: [781778 created, 787898 processed]
Proc 13: [799731 created, 787939 processed]
Proc 14: [787941 created, 787940 processed]
Proc 15: [791902 created, 787972 processed]
Proc 16: [789399 created, 788106 processed]
Proc 17: [791508 created, 787987 processed]
Proc 18: [788229 created, 787943 processed]
Proc 19: [776874 created, 787929 processed]
Proc 20: [787656 created, 788026 processed]
Proc 21: [794203 created, 787929 processed]
Proc 22: [791115 created, 788010 processed]
Proc 23: [784265 created, 788043 processed]
Total Chares: [18911853 created, 18911853 processed]

I take this to mean that processor 0 created 784059 chares and processed 788045.
Obviously there is some chare migration happening as chares
are being created on some procs and processed on others.  


I also found the +stats options, but this appears only available on ethernet
networks. ?  For a 2-way small problem on my local mac, I see the following type 
of data for each "node" (processor):

***********************************
Net Statistics For Node 0
Interrupts: 0   Processed: 0
Total Msgs Sent: 3464   Total Bytes Sent: 553360
Total Msgs Recv: 3364   Total Bytes Recv: 538060
***********************************
[Num]   SENDTO  RESEND  RECV    ACKSTO  ACKSFRM PKTACK
=====   ======  ======  ====    ======  ======= ======
[0]     0       0       0       0       0       0
[1]     3464    1       3366    673     693     3462
[TOTAL] 3464    1       3366    673     693     3462
***********************************

Here we have the bytes sent/recvd which for a large enough problem (that was understood), we could obtain
the average size of a chare.

I don't see this option on the Cray version, but I do see +print_stats, which produces a text file for every processor.
An example of data written to this file for processor 0:

hopper01% mo counters/statistics.24.0 
Node[0] SMSG time in buffer     [total:291.920375       max:0.113420    Average:0.000388](milisecond)
Node[0] Smsg  Msgs      [Total:751635    Data:751635     Lmsg_Init:0     ACK:0   BIG_MSG_ACK:0 Direct_put_done:0         Persistent_put_done:0]
Node[0] SmsgSendCalls   [Total:751635    Data:751635     Lmsg_Init:0     ACK:0   BIG_MSG_ACK:0 Direct_put_done:0         Persistent_put_done:0]

Node[0] Rdma Transaction [count (GET/PUT):0 0    calls (GET/PUT):0 0]
Node[0] Rdma time from control arrives to rdma init [Total:0.000000     MAX:0.000000     Average:-nan](milisecond)
Node[0] Rdma time from init to rdma done [Total:0.000000        MAX:0.000000     Average:-nan](milisecond)

                             count      total(s)        max(s)  average(us)
PumpNetworkSmsg:              1470537   0.649746        0.000287        0.441843
PumpRemoteTransactions:       1470537   0.280214        0.000005        0.190552
PumpLocalTransactions(RDMA):  0 0.000000        0.000000        -nan
SendBufferMsg (SMSG):         1470537   0.276000        0.000011        0.187686
SendRdmaMsg:                  1470537   0.273049        0.000005        0.185680


I'm still trying to make sense of what this is telling me.

Can you offer any suggestions?  Am I missing something obvious or is this
considered a research question that depends on the problem, size, machine, etc?


much thanks

Noel



  • [charm] performance data, Noel Keen, 03/12/2013

Archive powered by MHonArc 2.6.16.

Top of Page