charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] All-to-all or redn+bcast

From: Eric Mikida <epmikida AT hpccharm.com>
To: Jozsef Bakosi <jbakosi AT lanl.gov>
Cc: charm AT lists.cs.illinois.edu
Subject: Re: [charm] All-to-all or redn+bcast
Date: Mon, 17 May 2021 17:15:02 -0400
Authentication-results: ppops.net; spf=none smtp.mailfrom=epmikida AT hpccharm.com; dkim=pass header.d=hpccharm-com.20150623.gappssmtp.com header.s=20150623

Hi Jozsef,

I would be shocked if indiviual redn+p2p would ever outperform a single
redn+broadcast to all chares in the array. A couple points:

In over decomposition, it’s even worse than going from log(n) to n in terms
of messages. For the broadcast, its log(p) where p is the number of PEs, and
for the individual, it is n where n is the number of array elements. On top
of that, it is also n reductions as compared to one reduction.

Lastly, I don’t even think there is going to be much if any benefit from
overlap. You are serializing the sends, so one PE is initiating N reductions,
which could be a very large number as you scale up. So that already adds one
extra bottleneck. And secondly, there is already overlap present in the
single redn+broadcast scheme. Once the reduction completes, the broadcast is
initiated, but broadcasts in charm are not synchronized. So the array
elements will be able to act upon the `solve` entry method invocation as soon
as they receive it, even if other members of the array are still waiting for
it.

Eric

> On May 14, 2021, at 11:51 AM, Jozsef Bakosi
> <jbakosi AT lanl.gov>
> wrote:
>
> Hi folks,
>
> I wanted to know your expert opinion on the following.
>
> We have an all-to-all, computing a min of single scalar real value,
> among many chares intended to be running at large scales. This amounts
> to our single synchronization point within a time step.
>
> I wonder if replacing the single all-to-all with a reduction + broadcast
> targeting each chare may allow for more overlap. I believe a single
> all-to-all is implemented as a redn+bcast to/from a single chare, and
> the complexity of what I'm suggesting is probably worse, nevertheless
> worth asking.
>
> In code, with DG being a chare array, I'm suggesting to replace
>
> contribute( sizeof(double), &mindt, CkReduction::min_double,
> CkCallback(CkReductionTarget(DG,solve), thisProxy) );
>
> with
>
> for all DG chares i
> contribute( sizeof(double), &mindt, CkReduction::min_double,
> CkCallback(CkReductionTarget(DG,solve), thisProxy[i]) );
> end
>
> Would this allow for more overlap by removing the global sync or I would
> throw the baby out with the bathwater because I am replacing the log(n)
> algorithmic/parallel complexity with n due to the for loop?
>
> Thanks,
> Jozsef
> --
> Jozsef Bakosi, PhD, LANL CCS-2, o:505-665-0950, c:505-695-4523

[charm] All-to-all or redn+bcast, Jozsef Bakosi, 05/14/2021
- Re: [charm] All-to-all or redn+bcast, Eric Mikida, 05/17/2021
  - Re: [charm] [EXTERNAL] Re: All-to-all or redn+bcast, Jozsef Bakosi, 05/17/2021