charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Question about I/O characteristics of Charm++ ported applications

From: Qian Sun <sqianfly AT gmail.com>
To: Phil Miller <mille121 AT illinois.edu>
Cc: Charm Mailing List <charm AT cs.illinois.edu>, Akhil langer <akhilanger AT gmail.com>
Subject: Re: [charm] Question about I/O characteristics of Charm++ ported applications
Date: Thu, 10 Mar 2016 09:07:37 -0500

Thank you, Phil.

Actually, I thought AMR was imbalanced because I referred to a paper "A Distributed Dynamic Load Balancer for Iterative Applications". This paper used charm++ miniapp of AMR and LeanMD in the experiments as load imbalanced applications. Specifically, in Figure 2 and Figure 7, which shows the imbalanced load distributed/time per step for a run of AMR. As a result, I thought the computation time for each processor would be different, which might lead to varies in the output time.

I was wondering how can I get those imbalanced scenario and how can I tell if the computation is imbalanced or not?

Also, I observed that there is a variance in the write time for data blocks. I was wondering does the different data sizes cause the imbalanced situation rather than computation?

Thank you!

Regards,

Qian

On Wed, Mar 9, 2016 at 10:59 PM, Phil Miller <mille121 AT illinois.edu> wrote:

This is quite reasonable, since the computation is tightly coupled. Blocks are dependent on each other to make progress from one iteration to the next. Blocks in one part of the problem domain could get a few steps ahead of blocks far away, but this tends not to happen naturally because execution is message driven and opportunistic.

On Mar 9, 2016 9:11 PM, "Qian Sun" <sqianfly AT gmail.com> wrote:
Thank you, Akhil.

Which timing info can tell if it is imbalanced or not?

Actually, I added some timing before the LOGGER you added, to figure out when a particular data start to be write to file. I was expecting the if the computation is quite imbalanced, then the time that start to do output varies a lot. But it seems that the timings are quite close(iteration=50, first write time is 36.8s, last write time is 38.5s), which means all the data blocks are almost write at the same time. Is that reasonable result?

Thank you!

Regards,
Qian

On Wed, Mar 9, 2016 at 9:42 PM, Akhil langer <akhilanger AT gmail.com> wrote:
Qian,

With 256 total cores, you can run with #processes 256, mesh size 256, block size 16, and try different depths - {5, 6, 7, 8}. Let me know if that works.

Regards,
Akhil

On Wed, Mar 9, 2016 at 8:22 PM, Qian Sun <sqianfly AT gmail.com> wrote:
Hi Akhil,

Thank you so much for your help in the code changes and the job script!!

I tried that in BGQ and got it works! Since I expected an imbalanced scenario, I commented the balancer flags.

I had a question about the parameters for the imbalanced scenario, do you have any suggestions on the parameters I should use in order to get imbalanced scenario faster, such as #node, #process, #iteration, #depth, etc.? For example, if I use 64 nodes and 4 cores per node, then how many iterations, or how much depths, or how large of data domain should I use in order to get some significant imbalanced scenario?

Thank you!

Regards,
Qian

On Mon, Mar 7, 2016 at 10:52 PM, Akhil langer <akhilanger AT gmail.com> wrote:
Qian,

I have updated the git repo to output mesh data. Compile with -DLOGGER flag to get the output. Output files will be dumped in folder named out (the folder must exist before the program is run).

Below is a sample BG/Q script but be careful to not run a large mesh (large grid_size or depth) with DLOGGER on, it will create a lot of files. It creates one file per mesh block (chare) per iteration.

#!/bin/sh
PPN=64
np=16384
niter=100
ldbfreq=6
gsize=256
bsize=16
lbqual=50
for d in 7 8

do
runjob -p $PPN -n $np --block $COBALT_PARTNAME --verbose=INFO --envs BG_SHAREDMEMSIZE=32MB --envs PAMID_VERBOSE=1 : ../advection $d $bsize $niter $ldbfreq $gsize +balancer DistributedLB +LBSyncResume +LBCommOff
done

On Sun, Mar 6, 2016 at 7:43 PM, Qian Sun <sqianfly AT gmail.com> wrote:
Thank you, Akhil.

I tried to uncomment AMR output, which I found in function Advection::compute(). However, when I recompile it, I got the following error:

Advection.C: In member function 'void Advection::compute()':
Advection.C:933: error: 'write_rectilinear_mesh' was not declared in this scope.

I was wondering if you have any ideas or suggestion about this error?

Also, if I want to run the AMR application in large-scale machine, such as BGQ or Titan(UGNI interconnect). Is there any template job-script that I can refer to?

Thanks!

Regards,
Qian

On Sat, Mar 5, 2016 at 3:45 PM, Akhil langer <akhilanger AT gmail.com> wrote:
Qian,

You can try AMR with load balancing turned off/on to see if it has the desired imbalance in I/O. The benchmark only writes and does no reads. The code that outputs data every iteration uses regular file I/O (and not CkIO) and is currently commented in the code.

Code: git clone https://charm.cs.illinois.edu/gerrit/benchmarks/amr

Regards,
Akhil

On Fri, Mar 4, 2016 at 4:37 PM, Qian Sun <sqianfly AT gmail.com> wrote:
Hi Phil,

Thank you for your detailed reply!

I think load balancing can be a reason that contribute a lot to the situation you described -- each task making similar progress.

I was wondering if AMR, the mini-app of Charm++, can exhibit some imbalanced computation among their tasks, thus would has a certain I/O patterns?

Thank you!

Regards,
Qian

On Thu, Mar 3, 2016 at 9:25 PM, Phil Miller <unmobile AT gmail.com> wrote:
Hi Qian,

This is an interesting question you've posed. I don't think there are any current Charm++ applications that have this characteristic. The reason for this is that while the parallel decomposition of these applications involves fine grain tasks, the overall computation they perform is very tightly coupled. Each of those tasks does their own little piece of the work, but none of them has computed anything meaningful on their own. Thus, data to write is formed as a result of all of the tasks making similar progress (eg finishing the same number of steps).

Similarly, for input, each task might read independently, but they're going to interact and rearrange that data with input read by other tasks before they compute anything interesting.

I expect that many sorts of data analytics applications would be more likely to have the characteristics you're looking for. I'm thinking of anything built using a framework based on MapReduce, graph processing, NLP applications with independent queries, etc. Large parallel computations, with substantial chunks that may use common data, but don't interact with each other. There isn't much practical experience with these sorts of applications on Charm++.

I hope that helps answer your question, and maybe points in a useful direction.

Phil

On Mar 3, 2016 8:11 PM, "Qian Sun" <sqianfly AT gmail.com> wrote:
Hi Ronak,

Thank you so much for your reply!

The I/O behavior I am looking for targeting temporal characteristics of both write and read.

As far as I know, most of current scientific simulations/applications are using SPMD parallelism. Thus their write behavior is typically similar -- all of the processes will write a portion of data concurrently at the end of each iteration. But for Charm++ ported application, such as EpiSimdemics you mentioned, for a single iterations, is it possible that some of the tasks might complete first and generate a subset of output files earlier than others? Theoretically, I thought it is a reasonable behavior, because those parallel tasks are independent with each other, as well as their completion time. But I am not sure if there is a case in the real Charm++ ported applications.

Similarly, for read, an task-based application might not need to wait for all of the input files become available. For example, 10 tasks are scheduled to run at the beginning, but only 4 of them get their input files available. Then these 4 tasks can start to run regardless of the left 6 tasks. Is there any Charm++ ported application performing in this way?

Thanks a lot.

Regards,
Qian

On Thu, Mar 3, 2016 at 7:15 PM, Ronak Buch <rabuch2 AT illinois.edu> wrote:
Hi Qian,

There are many Charm++ applications that read or write "large" amounts of data, including OpenAtom, ChaNGa, NAMD, and EpiSimdemics among others. Often, these are dependent on the size of the simulation they are trying to run. Depending on the size and frequency of their simulation, they can often be on the scale of tens of gigabytes.

Regarding I/O behaviors, can you be a bit more specific about what exactly you're looking for?

It is certainly possible that the same applications will generate output data at different times (for example, as far as we can recall, EpiSimdemics does this). There is also Charm++ library for parallel I/O, CkIO, that can facilitate large scale asynchronous I/O patterns.

Additionally, a few members of the PPL group had a published a paper regarding asynchronous I/O a few years ago that you may find relevant: http://charm.cs.illinois.edu/newPapers/11-32/paper.pdf

Thanks,
Ronak

On Thu, Mar 3, 2016 at 11:16 AM, Qian Sun <sqianfly AT gmail.com> wrote:

Hi,

I am looking for task-based scientific applications. I noticed that there are some applications have been ported to Charm++, such as NAMD, as well as some MiniApps, such as AMR.

Actually, I am more interested in the I/O behavior of those Charm++ ported applications. Is there any Charm++ ported application generates a large amount of data as final output, or takes a large amount of data as input?

If so, which applications have this characteristics? and what is their I/O behaviors? I was wondering is it possible that tasks in the same applications will generate output data at different times? (Just as a typical DAG-based workflow, some tasks may finish and output a portion of data first, others may require longer time)

Any suggestions or comments would be appreciated! Thank you so much!

Regards,
Qian

Re: [charm] Question about I/O characteristics of Charm++ ported applications, (continued)