charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Profiling and tuning charm++ applications

From: "Kale, Laxmikant V" <kale AT illinois.edu>
To: Alexander Frolov <alexndr.frolov AT gmail.com>, "Bohm, Eric J" <ebohm AT illinois.edu>, "Buch, Ronak Akshay" <rabuch2 AT illinois.edu>, "Kale, Laxmikant V" <kale AT illinois.edu>
Cc: Core RTS Group <core-ppl AT cs.illinois.edu>, "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
Subject: Re: [charm] Profiling and tuning charm++ applications
Date: Thu, 23 Jul 2015 22:42:24 +0000
Accept-language: en-US
List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

You can do “communication profile” with time as x-axis, to see how many messages per unit time the system (or each core) is processing. If thats in 10s of microseconds, you have a grain size issue (which I suspect, in this case).

The white stacked bar indicates idle time, and the black above that is overhead. (do a time profile to see how much of this is initialization). You can see that 0 has very little idle time.. You may want to reduce the load on core 0.

I think at this time, it may be better to drop the rest of the charm mailing list out of conversation. Continue discussion with a few of us.. (maybe keep core-ppl in for now)

---

Laxmikant (Sanjay) Kale         http://charm.cs.uiuc.edu

Professor, Computer Science     kale AT illinois.edu

201 N. Goodwin Avenue           Ph:  (217) 244-0094

Urbana, IL  61801-2302          

From: Alexander Frolov <alexndr.frolov AT gmail.com>
Date: Thursday, July 23, 2015 at 5:14 PM
To: Eric Bohm <ebohm AT illinois.edu>
Cc: "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
Subject: Re: [charm] Profiling and tuning charm++ applications

Eric,

thank you very much for your answers!

I will try verbs-linux-x86_64-smp build version of charm++ runtime.

I attached an example of profile which I am observing on my application. This profile is received for single node with 16 mpi processes running (problem size is small and just sufficient to collect trace w/o flushes). As one can see that average utilization is 30%. This looks for me terribly low.

On Thu, Jul 23, 2015 at 9:34 PM, Eric Bohm <ebohm AT illinois.edu> wrote:

Hi Alex,

On 07/23/2015 05:29 AM, Alexander Frolov wrote:

Hello Eric,

On Wed, Jul 22, 2015 at 7:50 PM, Eric Bohm <ebohm AT illinois.edu> wrote:

Hello Alex,

Charm++ applications can easily reach peak utilization. However, there are a number of factors which may be affecting your performance. The MPI target for Charm++ is one of the simplest to build, but it is unlikely to be the one that gives the best performance. For single node scalability you will probably experience better performance using a different target. Try multicore-linux64.

I am targeting on scaling it on infiniband cluster, single smp performance is not that interesting, that's why I am using it with mpicc. By the way, do charm++ runtime support combination of multicore and mpi?

Yes. If you add smp to the build line, you will have a version which allows multiple worker threads in a process along with a distinct communication thread. Typical usage would be to indicate the number of worker threads via the +ppn parameter. That number should be chosen to be one fewer than the number of execution threads so that the communication thread can use the remaining resource.

FYI: Best performance on infiniband is usually in the verbs-linux-x86_64-smp build.

It is difficult to diagnose your specific problem in the abstract, however the most common cause for poor single core utilization is overly fine granularity in simulation decomposition. Experiencing a substantial drop from 1 to 2 cores suggests a load imbalance issue may also be present, however I recommend you examine compute granularity first. A modest increase in work per chare is likely to help. The Projections tool can be used to evaluate the current situation.

Thank you for your suggestion. It is true that my application is very fine-grained. Unfortunately, even modest increase in granularity requires reimplementing and even rethinking of algorithm. But I will try it anyway.

What I do not understand is low utilization of cpu cores, which as I think should not be connected to charm++ application (and even runtime), but depend only on time mpi-proccesses been running on cpus.

If you examine the time profile graph of your performance it will distinguish between time allocated by your entry methods (various colors) and time spent handling message packing/unpacking (black at top).

Do you mean black which is background ("on top"?) or black portions inside bars? In my case I have insignificant packing/unpacking overhead (according to usage profile).

The combination of the total is the overall utilization. If the messaging overhead portion is a substantial fraction of the overall, then you have a granularity problem.

If refactoring for smaller granularity is very difficult, you may wish to look into using the TRAM library (see the manual appendices) as it will aggregate messages in a way that helps reduce the overhead of processing many tiny messages with tiny execution granularity.

Thank you. I will look into TRAM library.

Regarding process switching, you can force affinity by appending the +setcpuaffinity flag, and specifically choose bindings by using the +pemap L[-U[:S[.R]+O arguments. See section C.2.2 of the manual (http://charm.cs.illinois.edu/manuals/html/charm++/manual.html) for details.

I am using mpirun (which is actually a script of task manager). The custom task manager on the system I use does not support another ways of running applications.

Thank you!

The + arguments are parsed by your application as a function of building it via the charm library, not parsed by mpirun. Cpuaffinity can be set that way.

Thanks.

On 07/22/2015 11:24 AM, Alexander Frolov wrote:
Hi!

I am profiling my application with projections and found out that usage profile is terribly low (~45%) for the cases when 2 and more cores are used (at the moment I am investigating scalability inside of single smp node). For single pe the usage profile is about 65% (does not look good as well).

I would suppose that something wrong with mpi environment (for eg. mpi-process are continuously switched between cores). But maybe the problem in charm++ configuration?

Has anybody met with similar behavior of charm++ applications? That is there is no scalability when it is expected...

Any suggestions would be very appreciated! :-)

Hardware:

x2 Intel(R) Xeon(R) CPU E5-2690 with 65868940 kB of memory

System software:

icpc version 14.0.1, impi/4.1.0.030

Charm++ runtime:

./build charm++ mpi-linux-x86_64 mpicxx -verbose 2>&1

ps. I checked with --with-production option but it does not improved the situation significantly.

Best,

Alex
_______________________________________________
charm mailing list
charm AT cs.uiuc.eduhttp://lists.cs.uiuc.edu/mailman/listinfo/charm

[charm] Profiling and tuning charm++ applications, Alexander Frolov, 07/22/2015
- Re: [charm] Profiling and tuning charm++ applications, Eric Bohm, 07/22/2015
  - Re: [charm] Profiling and tuning charm++ applications, Alexander Frolov, 07/23/2015
    - Re: [charm] Profiling and tuning charm++ applications, Eric Bohm, 07/23/2015
      - Re: [charm] Profiling and tuning charm++ applications, Alexander Frolov, 07/23/2015
        
        Re: [charm] Profiling and tuning charm++ applications, Kale, Laxmikant V, 07/23/2015