Skip to Content.
Sympa Menu

charm - Re: [charm] Scalability issues using large chare array

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Scalability issues using large chare array


Chronological Thread 
  • From: Steve Petruzza <spetruzza AT sci.utah.edu>
  • To: Phil Miller <mille121 AT illinois.edu>
  • Cc: charm <charm AT lists.cs.illinois.edu>
  • Subject: Re: [charm] Scalability issues using large chare array
  • Date: Thu, 4 Aug 2016 19:37:49 +0300

Thanks Phil for this very useful and interesting answer. 

I removed all the loops in the main chare that were calling first the initialization and then setting the input. I put everything in some global variables that are small enough and now all the chare object in the array access this global data and do the initialization in the constructor. Does this sound better? 

I agree with those suggestions, that makes totally sense to me, but unfortunately the end result did not change and I still get segfault after the “initialization” phase. I tried to debug from a core dump done using:
export MPICH_NEMESIS_NETMOD=tcp
ulimit -c unlimited

but when I open with gdb the content seems like containing only 1 thread infos, that is very fishy.
Is there a better way to investigate the issue? 

Just to add an info, during the initialization phase each task load a chunk of data (~400KB at the scale where I get the error) that then exchange with other tasks at some point.

The application is sensible to corrupted data, so basically if a message is corrupted most probably will segfault. Could it happen that the RTS truncate or corrupt some data at this scale?
Otherwise if is a memory problem how can I check the memory usage? I tried with projection but it gives me an empty list for “Maximum Memory Usage”. Should I set any flag to profile the memory?

About your curiosity the application is based on a data flow already defined in terms of graph and routines, so I am trying to scale it up with Charm++ without rewriting it “from scratch”. So far it is working well until 8K cores.

Thanks again,
Steve




On 04 Aug 2016, at 05:53, Phil Miller <mille121 AT illinois.edu> wrote:

Hi Steve,

Thanks for the detailed follow up. I'm glad the process launch issue is straightened out. We've got some new code that should automatically handle that eventually coming down the pipeline.

Before digging into the deeper issues, the answer to your inquiry "After the declaration of those with the ckNew (that I assume is lightweight)": yes, bulk construction of chare array elements is an efficient distributed operation.

Your description of the application design gives me a pretty strong intuition about what might be going wrong. Please confirm or refute my assumption: All of the calls to GenericTask::initialize are being made in a loop over the N objects from your main chare.

If that assumption is true, then the serial bottleneck on PE 0 can easily enough lead to resource exhaustion in the runtime's network buffers for sends and receipt acknowledgements, causing the crashes you've seen. Even if we were to re-engineer the network layer to be more robust, this would still be a serious scalability pitfall in the application.

Generically, the solution to this bottleneck is to distribute the work so that many processors contribute to it in parallel. There are at least a few different ways to accomplish that:
- Make the data structure from which the mainchare determines what to pass to GenericTask::initialize available to all of the chare array elements. You could do this by assigning it to a 'readonly' variable that the runtime system will distribute to every PE before it starts constructing array elements. The readonly approach is nice if the graph is useful to keep available after initialization. You could also do this by broadcasting the graph structure to the chare array.
- If the structure is expected to be large enough that its full memory footprint being replicated on each node might be troublesome, you could instead create a chare group in which the instance object on each PE would read a corresponding section of the input task graph, and then make the corresponding initialize calls. This would be a fully scalable distributed approach.

Separately, your remark "where N could vary from 6K to 1M using a proportional number of cores for the execution" runs slightly against what we've found to be best practice for obtaining optimal performance in Charm++ programs. Generally, the principle that seems to work is that task grain size should be selected to obtain high sequential efficiency (e.g. small enough to be cache friendly, large enough to cover overheads, and ideally at some measured sweet spot), and then the count of objects should consequently result from dividing the problem as a whole into pieces of that size. Until approaching strong scaling limits, the number of processors in use needn't play a significant role. That said, if your current design ends up working well for you, don't let me steer you toward extraneous work just for the sake of 'purity' or whatever.

As to your 'complete opposite approach" of creating tasks on the fly: yes, chare arrays can work that way, too, using the 'demand creation' feature. Basically, certain entry methods can be marked as signalling to the runtime that calls to them should spur default construction of any recipient elements that don't already exist. I'm not sure it would be that helpful for the application design you've described. Regardless, more details can be found in the Charm++ manual in the chapter on 'More Chare Array Features' (http://charm.cs.illinois.edu/manuals/html/charm++/13.html)

Finally, stepping back from all of that, I'm somewhat curious - where does the task graph in this story come from? Explicitly described and represented task graphs seem to appear in many other emerging dynamic parallel programming models, but they've never played a substantial role in any Charm++ application I'm aware of. Rather, objects have been designed to represent elements of the problem decomposition, and individual tasks they need to perform over their lifetime get encoded into their compiled control flow, driven by messages as they arrive. This isn't at all meant to criticize your design. I genuinely want to learn what makes this structure so compelling, versus "the way we've always done things".

Phil

On Tue, Aug 2, 2016 at 8:09 AM, Steve Petruzza <spetruzza AT sci.utah.edu> wrote:
Thanks Phil,
the threads/procs allocation parameters for Cray machines is clear now. 

Unfortunately the different combinations, unsurprisingly, did not solve the crash that I am getting, but I can say that more processes and less threads  perform better in my application.

About the application, here is a simplified version of the .ci:

mainmodule charm_dataflow {
  readonly CProxy_Main mainProxy;
  
  mainchare Main {
    entry Main(CkArgMsg *m);
    entry void done(int id);
  };

  array [1D] GenericTask {
    entry GenericTask(void);
    entry void initialize(TaskInfo info);
    entry void addInput(TaskId source, buffer input);
  };          
};

Where the TaskId is an uint_32, the buffer is a vector<char> and the TaskInfo is a structure (small) that contains some metadata.

Basically the Main chare in its Main function creates an array of Generic_Task as following:
CProxy_GenericTask  all_tasks = CProxy_GenericTask::ckNew(N);

where N could vary from 6K to 1M using a proportional number of cores for the execution.
This tasks are originally defined in a graph that contains all the information necessary to the tasks for their execution (e.g. how to process the input and to which task eventually send the output).

After the declaration of those with the ckNew (that I assume is lightweight) I call the initialization function on all of them to set some metadata (passing a TaskInfo struct), so that all the chare objects know what to do when they will receive their inputs (via addInput() ). 

On a subset of this chare objects I add the initial inputs calling the function addInput:
all_tasks[task_index].addInput(main_id, but);

After this point every chare will be able to perform some operations and produce a result for another chare  based on the TaskInfo and the input received. The task that will receive the output will be, of course, in the same global proxy (thisProxy.[output_task_id].addInput(outgoing_id, out_buf);).
At some point there will be some leaf tasks that will not call any other tasks and the application will terminate (from the done function of the Main chare).

So far this approach scales well until 8K cores and an array of 60K chare tasks. But when I double these 2 numbers I get a seg fault after some time of execution.
One potential problem is that most of these chare tasks will not be active simultaneously but only when they receive their input. I expect that less than 30% of the tasks will run simultaneously. Can all this idle tasks produce an overhead for the RTS?

A complete opposite approach could be to create the tasks on the fly when necessary (means non-array chare objects), but I have to ensure that any task can call a function of any other created chare object (knowing the ID). Is there any way to do such a thing? To be able to query a global proxy of the existing chare objects where I can dynamically add, remove chare objects and call a function instead of creating all of them at the beginning in P0 (as I am doing now).

I tried to profile the execution for a working run at 1K cores trying to look at the memory usage, but unfortunately Projections gives me an empty list for “Maximum Memory Usage”. How can I check on this? (use the system machine profiling?) Should I enable any flag? The rest of the Projections tools work fine for me.

Thank you,
Steve




On 01 Aug 2016, at 20:21, Phil Miller <mille121 AT illinois.edu> wrote:

And here's the response to the other part of your message

On Mon, Aug 1, 2016 at 7:44 AM, Steve Petruzza <spetruzza AT sci.utah.edu> wrote:
In my application I have a single chare array in the main chare that creates thousands of chare tasks that eventually will execute some tasks and communicate between them (not all simultaneously).
 
By the way if I run some stats I see something like the following:

Charm Kernel Summary Statistics:
Proc 0: [11 created, 11 processed]
Proc 1: [0 created, 0 processed]
Proc 2: [0 created, 0 processed]
Proc 3: [0 created, 0 processed]

… all the others 0,0

Charm Kernel Detailed Statistics (R=requested P=processed):

         Create    Mesgs     Create    Mesgs     Create    Mesgs
         Chare     for       Group     for       Nodegroup for
PE   R/P Mesgs     Chares    Mesgs     Groups    Mesgs     Nodegroups
---- --- --------- --------- --------- --------- --------- ----------
   0  R         11         0        14         1         8      1024
      P         11      7732        14         2         8         0
   1  R          0         0         0         1         0         0
      P          0         0        14         2         0         1
   2  R          0         0         0         2         0         0
      P          0         0        14         3         0         0
   3  R          0         0         0         2         0         0
      P          0         0        14         3         0         0

… all the others like PE 1,2,3…

Is the chare 0 processing all the messages? Why? This does not look scalable. 
Infact when I go over 120K chares it crashes with segfault (_pmiu_daemon(SIGCHLD): [NID 16939] [c5-0c2s5n1] [Mon Aug  1 03:12:58 2016] PE RANK 975 exit signal Segmentation fault).

Am I building or running improperly?
How can I make sure that the chares are spread on more nodes and procs in order to avoid crazy memory allocation on a few nodes? 
Is there any strong coupling between the chare who creates a chare array and their actual execution nodes/procs? If I create more (smaller) chare arrays in the main chare at different execution times, instead of one large at the beginning, could it change anything?

To really understand what's going on, it would be helpful to see a bit of your code (particular, the outline of the .ci file(s) and the methods that are calling ckNew to instantiate additional objects). From what you've written already, here's what I'm comfortable saying.

A chare array is an enumerated collection of chare objects. Instances of a chare array inherently have their elements distributed over the many PEs in a parallel job. The baseline distribution of these objects can be changed from its default by providing an 'array map' from element index to home PE. The dynamic distribution of these objects is adjusted by the periodic dynamic load balancing strategies that users can request.

Non-array chare objects are created as 'seed' messages. The recipient of that message is either specified in the ckNew call, or defaults to the PE running the code making the call. These seed messages can be redistributed by 'seed balancing' strategies that users can choose to link. By default, no such strategy is enabled, because it would create overhead for the many applications that don't use it.

It's not clear whether the issue here is one of application design, or system usage. Perhaps the chare array elements should be called upon to work in parallel, without spawning ancillary chare objects beyond them. Or perhaps the chare objects are a sensible part of the design, and having them all created on PE 0 is a serial bottleneck that needs to be spread out.

Please follow up with more details, so we can help you further.

Phil






Archive powered by MHonArc 2.6.16.

Top of Page