Skip to Content.
Sympa Menu

charm - Re: [charm] [ppl] Many individual chares vs chare array

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [ppl] Many individual chares vs chare array


Chronological Thread 
  • From: Jozsef Bakosi <jbakosi AT gmail.com>
  • To: Jonathan Lifflander <jliffl2 AT illinois.edu>
  • Cc: "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
  • Subject: Re: [charm] [ppl] Many individual chares vs chare array
  • Date: Fri, 10 Jul 2015 14:24:47 -0600
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Ok, I'm not sure if I was very clear in my question, but I think you are implicitly answering it. Let me re-phrase it to see if I understand it:

Only those calls to a single host from array elements get aggregated that originate via a contribute call, i.e., an explicit reduction to a reduction target. Those entry method calls that are simply calls from workers to the host are not aggregated.

Considering Phil's answer about TRAM (in the other email) this blurs a picture a bit, because it is possible to use TRAM to aggregate non-reduction calls from array elements (that can even take the network topology into account). Another (future) way will be via the [aggregate] keyword.

Correct?

If so, it seems like I will now have to write a new reduction type with a custom reduction function that can aggregate std::vectors...based on which then I have another question: I only see using POD data types used as CkReductionMsg passed to a custom reduction function in the manual as well as in the barnes-charm example. What I need to do, though, is to aggregate vectors of different sizes. I'm basically estimating a histogram to which each worker chare contributes a different number of counters that are collected on a host for a final histogram, I believe this is similar to the histogram_group example. Actually, now that I take a closer look at that example, it is pretty much what I need, since in there the HistogramMerger is a chare group (one instance per PE), collecting contributions from the Histogram chare array (potentially many more than one per PE).

Well, thanks for the help. You guys are awesome!

J

On Fri, Jul 10, 2015 at 1:45 PM, Jonathan Lifflander <jliffl2 AT illinois.edu> wrote:
The "reductiontarget" keyword enables the annotated entry method to be
the target of a reduction (the method that is called when the
reduction is finished). It does not change how aggregation happens
under the hood.

In order to get a reduction tree + aggregation, you need to use a chare array:

array [1d] { ... }

When you call contribute from the elements of the chare array, a
reduction tree is used along with local aggregation of individual
contributions inside the node.

Calling contribute from an array requires that all the members
contribute. You will need to create a section if that is not the case.

Thanks,

Jonathan

On 10 July 2015 at 09:51, Jozsef Bakosi <jbakosi AT gmail.com> wrote:
> Follow-up quesion:
>
> Does the aggregation happen only with reductions, via, e.g., contribute(
> CkCallback( CkReductionTarget( Host, hostfn ), hostinstance )), or also with
> simply calling the non-reductiontarget member function, hostfn_noreduct(),
> from the workers? The host .ci file in that case would be
>
> chare Host {
>       entry [reductiontarget] void hostfn();
>       entry void hostfn_noreduct();
> };
>
> Again, I suspect, the non-reduct will not aggregate, but I might be wrong.
> Can you calirfy?
>
> On Thu, Jul 9, 2015 at 1:01 PM, Jozsef Bakosi <jbakosi AT gmail.com> wrote:
>>
>> Thanks Phil, that's interesting. I guess that (at least partially)
>> explains (I hope) the pretty unsatisfactory weak scaling behavior I'm
>> getting with a simple particle (i.e., Monta Carlo) code.
>>
>> Thanks for the clarification,
>> J
>>
>> On Thu, Jul 9, 2015 at 12:35 PM, Phil Miller <mille121 AT illinois.edu>
>> wrote:
>>>
>>> You're exactly right - reductions locally combine the contributions of
>>> all chare array elements on each PE, and then in each process, and transmit
>>> a single message up a process tree to the root. At large machine scales,
>>> this isn't just faster, it's the difference between the code running at all,
>>> and crashing due to a message overload.
>>>
>>> On Thu, Jul 9, 2015 at 1:10 PM, Jozsef Bakosi <jbakosi AT gmail.com> wrote:
>>>>
>>>> Hi folks,
>>>>
>>>> I suspect I know the answer to this question but I'd like some
>>>> clarification on it.
>>>>
>>>> What is the main difference between creating (a potentially large number
>>>> of) individual chares and those calling back to a single host proxy or
>>>> creating the workers instead as a chare array and using reduction. I assume
>>>> the latter will do some kind of message aggregation under the hood (i.e.,
>>>> using a tree) and collect messages (in the form of an entry method
>>>> arguments) from individual array elements and send only aggregated messages
>>>> to the single host. Is this correct? If so, I guess, I should get better
>>>> performance...
>>>>
>>>> Thanks,
>>>> Jozsef
>>>>
>>>> _______________________________________________
>>>> charm mailing list
>>>> charm AT cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/charm
>>>>
>>>
>>
>
>
> _______________________________________________
> charm mailing list
> charm AT cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/charm
>
> _______________________________________________
> ppl mailing list
> ppl AT cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/ppl
>




Archive powered by MHonArc 2.6.16.

Top of Page