charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] MPI Build of Charm++ on Linux Cluster

From: "Kale, Laxmikant V" <kale AT illinois.edu>
To: "Wang, Felix Y." <wang65 AT llnl.gov>, "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
Subject: Re: [charm] MPI Build of Charm++ on Linux Cluster
Date: Wed, 25 Jul 2012 12:22:42 +0000
Accept-language: en-US
List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Clusters like cab are very much supported (and are in use for apps such as NAMD) by Charm++. We will look at the issues that you describe.

BTW: is the net-smp build ok? Running Charm++ on top of MPI is not the best option in terms of performance, especially if an implementation on top of a lower layer (udp, ibvrbs) is available.

Laxmikant (Sanjay) Kale         http://charm.cs.uiuc.edu

Professor, Computer Science     kale AT illinois.edu

201 N. Goodwin Avenue           Ph:  (217) 244-0094

Urbana, IL  61801-2302          FAX: (217) 265-6582

On 7/24/12 5:49 PM, "Wang, Felix Y." <wang65 AT llnl.gov> wrote:

Hello PPL,

I was wondering if you could give me a bit of insight as to why the MPI build of Charm++ on a linux cluster ('cab' at LLNL) experiences large gaps in the communication phases whereas on a bluegene/p architecture ('udawn' at LLNL), there are no such gaps and the computation proceeds as expected. See attached images for an illustration of what I mean by these gaps. The code that is being run in both these instances are identical, and is of a port of a shock hydrodynamics proxy application, LULESH, that is mostly computation heavy with a few instances where communication across domains is necessary.

I have tested both the net build, and the mpi-smp build on the same cluster, cab, and have found that these effects that you see with the mpi build are nonexistent, although there are other problems with much more than preferred idle time. In the case of the mpi-smp build, every once in a while, one of the PE's gets 'stuck', essentially blocking all ongoing communication for a reason that I have yet to understand.

Essentially, I'm wondering if anyone has noticed this kind of behavior on other clusters before, and what is needed to get rid of the behavior (e.g. Some sort of configuration parameter). Currently, I have tested the 'mpi' version of my code port with different compilers (both icc and gcc) as well as different versions of mvapich (1 and 2). Could it be the case that these implementations simply are not supported well by Charm++? What is the configuration that is typically used by the PPL group to test the MPI build of Charm++ during development?

Thanks,

--- Felix

[charm] MPI Build of Charm++ on Linux Cluster, Wang, Felix Y., 07/24/2012
- Re: [charm] MPI Build of Charm++ on Linux Cluster, Kale, Laxmikant V, 07/25/2012
  - Re: [charm] [ppl] MPI Build of Charm++ on Linux Cluster, Chao Mei, 07/25/2012