charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Scalability issues using large chare array

From: Steve Petruzza <spetruzza AT sci.utah.edu>
To: Phil Miller <mille121 AT illinois.edu>
Cc: charm <charm AT lists.cs.illinois.edu>
Subject: Re: [charm] Scalability issues using large chare array
Date: Tue, 2 Aug 2016 16:09:58 +0300

Thanks Phil,

the threads/procs allocation parameters for Cray machines is clear now.

Unfortunately the different combinations, unsurprisingly, did not solve the crash that I am getting, but I can say that more processes and less threads perform better in my application.

About the application, here is a simplified version of the .ci:

mainmodule charm_dataflow {

readonly CProxy_Main mainProxy;

mainchare Main {

entry Main(CkArgMsg *m);

entry void done(int id);

};

array [1D] GenericTask {

entry GenericTask(void);

entry void initialize(TaskInfo info);

entry void addInput(TaskId source, buffer input);

};

Where the TaskId is an uint_32, the buffer is a vector<char> and the TaskInfo is a structure (small) that contains some metadata.

Basically the Main chare in its Main function creates an array of Generic_Task as following:

CProxy_GenericTask all_tasks = CProxy_GenericTask::ckNew(N);

where N could vary from 6K to 1M using a proportional number of cores for the execution.

This tasks are originally defined in a graph that contains all the information necessary to the tasks for their execution (e.g. how to process the input and to which task eventually send the output).

After the declaration of those with the ckNew (that I assume is lightweight) I call the initialization function on all of them to set some metadata (passing a TaskInfo struct), so that all the chare objects know what to do when they will receive their inputs (via addInput() ).

On a subset of this chare objects I add the initial inputs calling the function addInput:

all_tasks[task_index].addInput(main_id, but);

After this point every chare will be able to perform some operations and produce a result for another chare based on the TaskInfo and the input received. The task that will receive the output will be, of course, in the same global proxy (thisProxy.[output_task_id].addInput(outgoing_id, out_buf);).

At some point there will be some leaf tasks that will not call any other tasks and the application will terminate (from the done function of the Main chare).

So far this approach scales well until 8K cores and an array of 60K chare tasks. But when I double these 2 numbers I get a seg fault after some time of execution.

One potential problem is that most of these chare tasks will not be active simultaneously but only when they receive their input. I expect that less than 30% of the tasks will run simultaneously. Can all this idle tasks produce an overhead for the RTS?

A complete opposite approach could be to create the tasks on the fly when necessary (means non-array chare objects), but I have to ensure that any task can call a function of any other created chare object (knowing the ID). Is there any way to do such a thing? To be able to query a global proxy of the existing chare objects where I can dynamically add, remove chare objects and call a function instead of creating all of them at the beginning in P0 (as I am doing now).

I tried to profile the execution for a working run at 1K cores trying to look at the memory usage, but unfortunately Projections gives me an empty list for “Maximum Memory Usage”. How can I check on this? (use the system machine profiling?) Should I enable any flag? The rest of the Projections tools work fine for me.

Thank you,

Steve

On 01 Aug 2016, at 20:21, Phil Miller <mille121 AT illinois.edu> wrote:

And here's the response to the other part of your message

On Mon, Aug 1, 2016 at 7:44 AM, Steve Petruzza <spetruzza AT sci.utah.edu> wrote:
In my application I have a single chare array in the main chare that creates thousands of chare tasks that eventually will execute some tasks and communicate between them (not all simultaneously).

By the way if I run some stats I see something like the following:

Charm Kernel Summary Statistics:
Proc 0: [11 created, 11 processed]
Proc 1: [0 created, 0 processed]
Proc 2: [0 created, 0 processed]
Proc 3: [0 created, 0 processed]

… all the others 0,0

Charm Kernel Detailed Statistics (R=requested P=processed):

Create   Mesgs     Create   Mesgs     Create   Mesgs
Chare     for     Group     for     Nodegroup for
PE   R/P Mesgs     Chares   Mesgs     Groups   Mesgs     Nodegroups
---- --- --------- --------- --------- --------- --------- ----------
0  R     11     0   14     1     8   1024
  P     11   7732   14     2     8     0
1  R   0     0     0     1     0     0
  P   0     0   14     2     0     1
2  R   0     0     0     2     0     0
  P   0     0   14     3     0     0
3  R   0     0     0     2     0     0
  P   0     0   14     3     0     0

… all the others like PE 1,2,3…

Is the chare 0 processing all the messages? Why? This does not look scalable.
Infact when I go over 120K chares it crashes with segfault (_pmiu_daemon(SIGCHLD): [NID 16939] [c5-0c2s5n1] [Mon Aug  1 03:12:58 2016] PE RANK 975 exit signal Segmentation fault).

Am I building or running improperly?
How can I make sure that the chares are spread on more nodes and procs in order to avoid crazy memory allocation on a few nodes?
Is there any strong coupling between the chare who creates a chare array and their actual execution nodes/procs? If I create more (smaller) chare arrays in the main chare at different execution times, instead of one large at the beginning, could it change anything?

To really understand what's going on, it would be helpful to see a bit of your code (particular, the outline of the .ci file(s) and the methods that are calling ckNew to instantiate additional objects). From what you've written already, here's what I'm comfortable saying.

A chare array is an enumerated collection of chare objects. Instances of a chare array inherently have their elements distributed over the many PEs in a parallel job. The baseline distribution of these objects can be changed from its default by providing an 'array map' from element index to home PE. The dynamic distribution of these objects is adjusted by the periodic dynamic load balancing strategies that users can request.

Non-array chare objects are created as 'seed' messages. The recipient of that message is either specified in the ckNew call, or defaults to the PE running the code making the call. These seed messages can be redistributed by 'seed balancing' strategies that users can choose to link. By default, no such strategy is enabled, because it would create overhead for the many applications that don't use it.

It's not clear whether the issue here is one of application design, or system usage. Perhaps the chare array elements should be called upon to work in parallel, without spawning ancillary chare objects beyond them. Or perhaps the chare objects are a sensible part of the design, and having them all created on PE 0 is a serial bottleneck that needs to be spread out.

Please follow up with more details, so we can help you further.

Phil

[charm] Scalability issues using large chare array, Steve Petruzza, 08/01/2016
- Re: [charm] Scalability issues using large chare array, Phil Miller, 08/01/2016
- Re: [charm] Scalability issues using large chare array, Phil Miller, 08/01/2016
  - Re: [charm] Scalability issues using large chare array, Steve Petruzza, 08/02/2016
    - Re: [charm] Scalability issues using large chare array, Phil Miller, 08/03/2016
      - Re: [charm] Scalability issues using large chare array, Steve Petruzza, 08/04/2016
        
        Re: [charm] Scalability issues using large chare array, Phil Miller, 08/04/2016
        
        Re: [charm] Scalability issues using large chare array, Steve Petruzza, 08/08/2016