charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [EXTERNAL] Re: All-to-all or redn+bcast

From: Jozsef Bakosi <jbakosi AT lanl.gov>
To: Eric Mikida <epmikida AT hpccharm.com>
Cc: charm AT lists.cs.illinois.edu
Subject: Re: [charm] [EXTERNAL] Re: All-to-all or redn+bcast
Date: Mon, 17 May 2021 15:18:35 -0600
Authentication-results: ppops.net; spf=pass smtp.mailfrom=jbakosi AT lanl.gov; dkim=pass header.d=lanl.gov header.s=lanl

Thanks, Eric, for clarifying this. Indeed, this is consistent with our
tests: we implemented the change and while at the relatively small
scales it wasn't much worse, it also wasn't better either and did not
yield more overlap.

Jozsef

On 05.17.2021 17:15, Eric Mikida wrote:
> Hi Jozsef,
>
> I would be shocked if indiviual redn+p2p would ever outperform a single
> redn+broadcast to all chares in the array. A couple points:
>
> In over decomposition, it’s even worse than going from log(n) to n in terms
> of messages. For the broadcast, its log(p) where p is the number of PEs,
> and for the individual, it is n where n is the number of array elements. On
> top of that, it is also n reductions as compared to one reduction.
>
> Lastly, I don’t even think there is going to be much if any benefit from
> overlap. You are serializing the sends, so one PE is initiating N
> reductions, which could be a very large number as you scale up. So that
> already adds one extra bottleneck. And secondly, there is already overlap
> present in the single redn+broadcast scheme. Once the reduction completes,
> the broadcast is initiated, but broadcasts in charm are not synchronized.
> So the array elements will be able to act upon the `solve` entry method
> invocation as soon as they receive it, even if other members of the array
> are still waiting for it.
>
> Eric
>
> > On May 14, 2021, at 11:51 AM, Jozsef Bakosi
> > <jbakosi AT lanl.gov>
> > wrote:
> >
> > Hi folks,
> >
> > I wanted to know your expert opinion on the following.
> >
> > We have an all-to-all, computing a min of single scalar real value,
> > among many chares intended to be running at large scales. This amounts
> > to our single synchronization point within a time step.
> >
> > I wonder if replacing the single all-to-all with a reduction + broadcast
> > targeting each chare may allow for more overlap. I believe a single
> > all-to-all is implemented as a redn+bcast to/from a single chare, and
> > the complexity of what I'm suggesting is probably worse, nevertheless
> > worth asking.
> >
> > In code, with DG being a chare array, I'm suggesting to replace
> >
> > contribute( sizeof(double), &mindt, CkReduction::min_double,
> > CkCallback(CkReductionTarget(DG,solve), thisProxy) );
> >
> > with
> >
> > for all DG chares i
> > contribute( sizeof(double), &mindt, CkReduction::min_double,
> > CkCallback(CkReductionTarget(DG,solve), thisProxy[i]) );
> > end
> >
> > Would this allow for more overlap by removing the global sync or I would
> > throw the baby out with the bathwater because I am replacing the log(n)
> > algorithmic/parallel complexity with n due to the for loop?
> >
> > Thanks,
> > Jozsef
> > --
> > Jozsef Bakosi, PhD, LANL CCS-2, o:505-665-0950, c:505-695-4523

[charm] All-to-all or redn+bcast, Jozsef Bakosi, 05/14/2021
- Re: [charm] All-to-all or redn+bcast, Eric Mikida, 05/17/2021
  - Re: [charm] [EXTERNAL] Re: All-to-all or redn+bcast, Jozsef Bakosi, 05/17/2021