charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Timing execution at scale

From: Steve Petruzza <spetruzza AT sci.utah.edu>
To: Phil Miller <phil AT hpccharm.com>
Cc: charm <charm AT lists.cs.illinois.edu>
Subject: Re: [charm] Timing execution at scale
Date: Thu, 18 Aug 2016 17:18:52 +0300

Thanks Phil,

when you say “Message passing between those threads by pointers rather than copies” does it mean also that the runtime avoid marshalling (PUP operations) between threads?

Anyway I got some speed up on the XC40 with SMP using less processes and more threads.

But, instead, on an XK7 (Titan) built with:

./build charm++ gni-crayxe smp -j16 --with-production

but also tried

./build charm++ gemini_gni-crayxe-persistent-smp -optimize

The SMP version has still worse performance, for example trying with (4096 cores, 256 nodes):

a) aprun -n 256 -N 1 ./charm_application +ppn15 (121 sec)

b) aprun -n 512 -N 2 ./charm_application +ppn7 (81 sec)

c) aprun -n 1024 -N 4 ./charm_application +ppn3 (77 sec), here I go over the number of physical cores

The output header looks like this :

(b)

Charm++> Running on Gemini (GNI) with 512 processes
Charm++> static SMSG
Charm++> SMSG memory: 2528.0KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> only comm thread send/recv messages
Charm++> Cray TLB page size: 2048K
Charm++> Running in SMP mode: numNodes 512, 7 worker threads per process
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: v6.7.0-281-g8d5cdd9
Warning> using Isomalloc in SMP mode, you may need to run with '+isomalloc_sync'.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 256 unique compute nodes (16-way SMP).

I tried also adding +isomalloc_sync but get similar performance.

On the XC40 the same application in SMP (1 proc per socket) gets ~10% speed up.

Here (XK7) instead is 2 or 3 times slower than the no-SMP.

Thank you,

Steve

On 16 Aug 2016, at 19:08, Phil Miller <phil AT hpccharm.com> wrote:

Hi Steve,

The expected benefits of the SMP mode generally arise from having multiple worker threads grouped in fewer processes:
- Message passing between those threads by pointers rather than copies
- Less pressure on available pinned/registered memory used to pass data between the network and the application processes
- Less contention for the interface to the network hardware, since there will be fewer threads acting as comm threads
- Better work specialization of cores, since worker threads never run code for network communication (reducing instruction cache pressure)

Our recommended baseline configuration for SMP runs is one process per socket, with a core set aside for the communication thread, and a worker for every remaining core. For systems and applications that exhibit substantial noise sensitivity, we recommend leaving another core idle for OS processes to run on.

In your case, I believe that configuration would look like

#SBATCH --ntasks-per-node=2

srun ./charm_application +ppn 15 <app_args>

Or if noise is an issue, drop the 15 to 14.

The mention of gemini vs aries is an artifact of gemini being built first. The code is the same between both. Thanks for pointing that out, though.

Phil

On Tue, Aug 16, 2016 at 12:55 AM, Steve Petruzza <spetruzza AT sci.utah.edu> wrote:
Hi Phil,

the srun command is very simple (just change the number of cores and nodes):
#SBATCH --nodes=256
#SBATCH --ntasks-per-node=32

srun ./charm_application <app_args>

Here is the non-SMP:

Charm++> Running on Gemini (GNI) with 32768 processes
Charm++> static SMSG
Charm++> SMSG memory: 63488.0KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> Cray TLB page size: 2048K
Charm++> Running in non-SMP mode: numPes 32768
Converse/Charm++ Commit ID: v6.7.1-0-gbdf6a1b
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1024 unique compute nodes (64-way SMP).

and the SMP:
Charm++> Running on Gemini (GNI) with 32768 processes
Charm++> static SMSG
Charm++> SMSG memory: 63488.0KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> only comm thread send/recv messages
Charm++> Cray TLB page size: 8192K
Charm++> Running in SMP mode: numNodes 32768,  1 worker threads per process
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: v6.7.1-0-gbdf6a1b
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1024 unique compute nodes (64-way SMP).

In SMP I don’t add any parameter (+ppn) because 1 worker thread per process should map all the physical processors available in the run.

One weird thing for me is that it prints out “gemini” even if it should be a “aries” network for this machine, is that correct?
I build charm++ like this:
./build charm++ gni-crayxc smp -j16 --with-production
(or without smp)

The machine is Shaheen II (https://cug.org/proceedings/cug2015_proceedings/includes/files/pap129.pdf)

Thank you,
Steve

On 15 Aug 2016, at 23:53, Phil Miller <phil AT hpccharm.com> wrote:

On Wed, Aug 10, 2016 at 8:33 AM, Steve Petruzza <spetruzza AT sci.utah.edu> wrote:
Thank you guys,

After the last modification (split and reduce) I got some little speedup. The 10% calls were not really a huge bottleneck, probably because they are not simultaneous. The actual issue at scale was instead related to my code and a particular dataset that I was using. In different conditions the scaling continued well (over 1M chares :D ).

I'm glad to hear that issue has been worked out.

I actually noticed something interesting. My application has better performance in non-SMP mode rather than SMP. At scale actually the performance gap decrease and the two seem like converging at some point.

How would you explain that? Has to do with the CPU affinity?
For example in SMP mode I know that I have (in my case) 1 worker thread per process (default), this means as many as the cores (--ntasks-per-node) per node, right? Plus the communication threads, I guess.
What is happening instead with non-SMP mode? How many threads per node are around per node?

Could you share the following with us, to help us understand the arrangement of threads and processes in your current runs:
- The aprun command lines you're using to launch
- The initial few lines of output from Charm++ startup that describe the hardware and thread configuration
- A link to the specifications of the particular machine you're running on

We're working on code to make getting this right much easier and more automatic, but we're not there yet.

Phil

Thank you,
Steve

On 08 Aug 2016, at 19:07, Phil Miller <phil AT hpccharm.com> wrote:

[re-adding the mailing list, since there is nothing private in this discussion, and it should be in the public archive so that others can see it later]

By three global chare arrays and the questions around them, Steve refers to a few distinct but related things:
1. The collections of distributed objects instantiated by calls to CProxy_Foo1::ckNew(N1), CProxy_Foo2::ckNew(N2), etc.
2. The proxy objects returned by each of the above ckNew calls
3. The readonly proxy object variables, that get initialized with the ckNew calls, and subsequently broadcast to all PEs

The difference between a collection itself and its proxy is exactly analogous to that between a dynamically sized object or array of objects in memory, and a pointer to that object/array. The object may be arbitrarily large, and there may be arbitrarily many instances in the collection. The proxy to them is always just a few integers that represent a globally valid handle used to look up objects in the collection, just like a pointer can be dereferenced and subscripted to access the objects it points to.

Knowing that the proxies are small fixed-size objects, sticking them in gloablly-broadcast readonly variables becomes much less worrisome. The difference of having just 1, vs 3 or 10 or 20, is a difference of a few dozen bytes broadcast - a negligible cost.

As for the design shift from a single chare array to three arrays handling input, computation, and output tasks, that sounds like an excellent choice. Especially so if the 'done' calls are only coming from the output tasks, and so they could just do a reduction over their entire array. Timing each of the other phases also then just becomes a reduction from the elements of the corresponding arrays. In that arrangement, they'll present no interference at all with ongoing execution, either.

Regarding the GNI error messages, that may be the result of hitting PE0 with all of the done messages as point-to-point sends from the individual chares (as Sanjay also suggests). Trying the pure reduction-based designs should probably come before investigating that too intensely, since again, the 10% of objects all sending to one place is a huge scalability pitfall anyway.

Phil

On Mon, Aug 8, 2016 at 10:27 AM, Kale, Laxmikant V <kale AT illinois.edu> wrote:

No extra overhead for 3 chare arrays.

(I don’t understand wht you mean as by “read only”.. a chare array is not read only. But do you mean their content doesn’t’ change? That’s probably irrelevant).

Yes, but with this scheme you can collect data from those chare arrays seaparately.
What do you do with those ints coming back? If its any commutative/associative op, you should use a reduction. Even if you are not, you can do a reduction with a reduction type called : CkReduction::concat

Would the input array need to send done messages too? Do they still have an int payload? Anyway, you can use an appropriate type of reduction for it.

Laxmikant (Sanjay) Kale   http://charm.cs.uiuc.edu

Professor, Computer Science     kale AT illinois.edu

201 N. Goodwin Avenue           Ph:  (217) 244-0094

Urbana, IL  61801-2302

From: Steve Petruzza <spetruzza AT sci.utah.edu>
Date: Monday, August 8, 2016 at 9:31 AM
To: Laxmikant Kale <kale AT illinois.edu>
Cc: Phil Miller <phil AT hpccharm.com>
Subject: Re: [charm] Timing execution at scale

The done message has only an “int” payload.

I am considering to split the main char array into 3 (global) arrays, input tasks, computing tasks, and output tasks.

This way I can easily manage the timing of initialization, computing and finalisation phase.

Would it add any overhead to have 3 global char arrays (/*read only*/) instead of a single one? All of the three will be created in the main chare.

Of course tasks from the computation phase, for example, will call eventually tasks form the finalisation phase. Initialization and finalisation phase tasks are generally about 10% of the total amount of tasks.

Does it make sense?

Thank you,

Steve

On 08 Aug 2016, at 17:05, Kale, Laxmikant V <kale AT illinois.edu> wrote:

That looks like a GNI layer memory (running out of pinned memory?) error. Could be due to barrage of messages to 0 (how big are the done messages?) or could be due to another leak in the charm system (but less likely, because we have applications that run for days). It maybe worthwhile running on another layer (charm on mpi on cray, or charm on another non-cray machine of your choice).

Quiescence detection may be worth using, if your “done” message have no payload (or even of they do have small payload, it could be collected via some other group reduction after quiescence.

In any case, we shouldn’t have that error, or need to find out what resource is running out and why.

I think its best to have Phil continue to advice you (and may another person if Phil asks someone else).

Laxmikant (Sanjay) Kale   http://charm.cs.uiuc.edu

Professor, Computer Science     kale AT illinois.edu

201 N. Goodwin Avenue           Ph:  (217) 244-0094

Urbana, IL  61801-2302

From: Steve Petruzza <spetruzza AT sci.utah.edu>
Date: Monday, August 8, 2016 at 8:45 AM
To: Laxmikant Kale <kale AT illinois.edu>
Cc: Phil Miller <phil AT hpccharm.com>
Subject: Re: [charm] Timing execution at scale

Thank you Kale,

I described briefly the application on the other thread ("Scalability issues using large chare array”).

No the N tasks are not independent, they call a few others methods’ (all in the same global proxy). The only large-range call I make is for the timing.

I can tell you that it reaches 64K cores using almost 600K tasks (in the same chare array) strong-scaling well (taking the timing with this 10%global "done" call).

When I go further up to 128K with over 1M tasks I get:

[0] GNI_GNI_MemRegister mem buffer; err=GNI_RC_ERROR_RESOURCE

or (non SMP)

[0] GNI_GNI_MemRegister mem buffer; err=GNI_RC_ERROR_RESOURCE
------- Partition 0 Processor 0 Exiting: Called CmiAbort ------
Reason: GNI_RC_CHECK

I wonder if the two things could be related.

On the side I will investigate if there is some task indexing/prefixing issue in my code that could be related to over 1 million tasks…

But surely there is still some resources distribution/requesting issue.

For the timing “issue” do you think that dumping the time value on a file (1 file per proc) from this 10% procs and then post process would be better?

Thank you,

Steve

On 08 Aug 2016, at 16:10, Kale, Laxmikant V <kale AT illinois.edu> wrote:

If you are running this on 100+ cores, yes, the 10% tasks sending done messages will be a bottleneck at the processor running the main chare. Basically, even assuming you are not doing any significant computation for each done message, it will take a microsecond or so to process it.

Creating a section for one time use is not a good solution (at least for now. We have a distributed section creation feature in the pipleline).

If all the 100% chares can participate in the done function (some with no data) and the data being collected from them is reducible (sum of numbers, or even collecting a set of solutions), it may be worthwhile using a reduction over the entire array.

As a tangential thought:

Do you have N tasks that are independent of each other? (no lateral communication from task I to task j?) . For such master-slave or search applications, you should use singleton chares and seed balancers.

I am also wondering if we should take this conversation off charm mailing list for now, and keep it between a few of us. We can summarize to the mailing list later. (I dropped the mailing list).

It will be helpful to see some sort of skeleton of your application.

Laxmikant (Sanjay) Kale   http://charm.cs.uiuc.edu

Professor, Computer Science     kale AT illinois.edu

201 N. Goodwin Avenue           Ph:  (217) 244-0094

Urbana, IL  61801-2302

From: Steve Petruzza <spetruzza AT sci.utah.edu>
Reply-To: Steve Petruzza <spetruzza AT sci.utah.edu>
Date: Monday, August 8, 2016 at 4:18 AM
To: charm <charm AT lists.cs.illinois.edu>
Subject: [charm] Timing execution at scale

Hi,

In my application I have a global chare array with N tasks, where N can vary from 1K to 1M.

At the moment I am timing the execution from the main chare, where a subset of tasks (10%) at the end will call the “done" function of the main chare.

Do you think that this 10% calls could create a considerable overhead to the timing?

The alternative, I think, could be using a reduction operation on a ProxySection, but creating such a proxy section (it has to be global created by the main chare) with 10% of the tasks plus the reduction operation would not create anyway a bigger overhead (at least in memory)?

If necessary, is there any other alternative to get the timing of a specific part of the execution precisely? Projections could help (at scale)?

Steve

[charm] Timing execution at scale, Steve Petruzza, 08/08/2016
- Message not available
  - Message not available
    - Message not available
      - Message not available
        
        Message not available
        
        Re: [charm] Timing execution at scale, Phil Miller, 08/08/2016
        
        Re: [charm] Timing execution at scale, Steve Petruzza, 08/10/2016
        Re: [charm] Timing execution at scale, Phil Miller, 08/15/2016
        Re: [charm] Timing execution at scale, Steve Petruzza, 08/16/2016
        Re: [charm] Timing execution at scale, Phil Miller, 08/16/2016
        Re: [charm] Timing execution at scale, Steve Petruzza, 08/18/2016