Skip to Content.
Sympa Menu

charm - Re: [charm] Scalable creation of chare array elements, passing a different portion of a potentially large array to their constructors

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Scalable creation of chare array elements, passing a different portion of a potentially large array to their constructors


Chronological Thread 
  • From: Jozsef Bakosi <jbakosi AT gmail.com>
  • To: Xiang Ni <xiangni2 AT illinois.edu>
  • Cc: "kale AT illinois.edu" <kale AT illinois.edu>, "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
  • Subject: Re: [charm] Scalable creation of chare array elements, passing a different portion of a potentially large array to their constructors
  • Date: Fri, 13 Nov 2015 06:19:51 -0700

Hi Xiang,

Yes. Ultimately, I wanted to do very much the same what you describe but I wanted to take a smaller step first by doing the partitioning in serial as that part was simpler that way. Now I realized that reading the mesh distributed and partitioning it parallel avoids the problem of having to deal with the element partition IDs in a large chunk, so now I'm working on doing it in parallel.

Thanks for the piece of code. It helped me reinforce the workability of the path I was thinking about going down on.

Jozsef

On Thu, Nov 5, 2015 at 2:08 PM, Xiang Ni <xiangni2 AT illinois.edu> wrote:
Hi Jozsef,

I have encountered the similar problem as you described in parallelizing a cloth simulation code. The input is triangular mesh, in order to give the right partition to each chare array element, the main chare performs mesh partition and then sends each individual message to each chare array element. We found this scheme is very time consuming. So recently we changed the scheme to let each processor (done by the implementation of a meshReader group) reads part of the mesh and then calls ParMetis to perform parallel mesh partition and then each meshReader chare sends the data to the destination chare array objects (a many-to-many operation). The new scheme is much faster.

The skeleton of the code works like this:

fseek(f, sizeof(int)+sizeof(int)*3*startLine, SEEK_SET);
//start reading file
eptr[0] = 0;
for(int i=0;i<chunk;i++)
{
  fread(&eind[i*3], sizeof(int), 3, f);
  eptr[i+1] = (i+1)*3;
}

// call ParMetis
MPI_Comm comm = MPI_COMM_WORLD;
ParMETIS_V3_PartMeshKway(elemdist, eptr, eind, NULL,  &wgtflag, &numflag, &n_conn, &n_commnodes, &nparts, tpwgts, &ubvec, options, &edgecut, part, &comm);

//send data to destination chare array element
std::map<int, std::vector<int> > elemToSend;
for(int i = 0; i < chunk; i++)
{
  int id = part[i];
  assert(id<numChares);
  elemToSend[id].push_back(i);
  elemToSend[id].push_back(eind[i*3]);
  elemToSend[id].push_back(eind[i*3+1]);
  elemToSend[id].push_back(eind[i*3+2]);
}

int numToSend = elemToSend.size();
for(auto it = elemToSend.begin(); it != elemToSend.end(); it++)
{
  std::vector<int> & info = it->second;
  int nElems = info.size()/4;
  int idx = it->first;
  RecvElemMsg * msg = new (nElems, nElems*3)RecvElemMsg(nElems);
  for(int i = 0; i < nElems; i++)
  {
    msg->elemID[i] = info[i*4];
    msg->vertices[i*3] = info[i*4+1];
    msg->vertices[i*3+1] = info[i*4+2];
    msg->vertices[i*3+2] = info[i*4+3];
  }

  clothArray[idx].recvElemInfo(msg);
}

Hope this helps and potentially the process of read, partition and send the mesh elements could be a Charm++ library to be used easily.

Thanks,
Xiang

On Mon, Nov 2, 2015 at 1:01 PM, Jozsef Bakosi <jbakosi AT gmail.com> wrote:
Hi folks,

After reading the papers at http://charm.cs.illinois.edu/research/msa and the thesis at https://charm.cs.illinois.edu/newPapers/05-05/paper.pdf, though may not exactly have been designed for my use case, it looks like what I need could also be done using multiphase shared arrays (MSA).

In particular, the thesis above says: "MSA also allows us to set a cap on the memory used per processor. Thus it can be used to read in a huge amount of data on one processor but store it simply and efficiently across other processors." This seems like exactly what I need. Even if this is not very efficient, since I only need to do this once during setup, I think this should work and would be better than saving many small files and re-reading them by chares. In fact, I would already be happy i f creating many chares, initialized by chunks of the large data in alldata, would survive without blowing out of communication buffers with, say 100K+ CPUs.

Does anyone think MSA is NOT okay for this purpose?

Jozsef





Archive powered by MHonArc 2.6.16.

Top of Page