charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] MPI_Barrier(MPI_COMM_WORLD) equivalent

From: Evghenii Gaburov <e-gaburov AT northwestern.edu>
To: "Kale, Laxmikant V" <kale AT illinois.edu>
Cc: "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
Subject: Re: [charm] MPI_Barrier(MPI_COMM_WORLD) equivalent
Date: Fri, 23 Sep 2011 03:33:24 +0000
Accept-language: en-US
List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Hi,

Thanks for prompt reply!

> I think, in your context, quiescence detection is the most elegant
> solution. I am curious: Why do you not like it?
Mostly because I have not seen it in examples that ship with Charm++, and
production codes such as NAMD or ChaNGa.
I suppose there is a reason not to use it for massively parallel programs.

> If you have multiple modules active at the time this exchange is
> happening, and messaging form the other modules should continue across
> this "import-export" activity, that would be one reason why QD is not a
> good solution. But that¹s not the case for you.
Indeed, I only have one active Charm++ module, which recursively calls other
function in same class. It seems that CkWaitQD() works fine.
May main worry is that it may have a huge latency when I go to large number
of procs, where only a small fraction of chares will need to be active.

> May be the non-threaded version (CkStartQD with a callback) is better
> suited?
I have not tried CkStartQD with a callback yet. What is the difference
between the two i.e. CkStartQD(Callback) and CkWaitWD()?
I mean what is the difference between two codes:

CkWaitQD();
myCharesArray.doStuff();

and

CkStartQD(CkCallback(CkIndex_myCharesArray::doStuff(), myCharesArray_Proxy));

If I've understood the documentation right, the former requires threaded
entry, the latter does not seems to. Anything more? Also I did notice, that
NAMD does use CkStartQD.

> Incidentally, how can you use MPI_Barrier in your MPI code? You wouldn't
> know when to call it, since you don't know another process is not sending
> a request your way next.
Good point. Now, I've looked at my code, I do not actually issue
MPI_Barrier(..), instead I issue MPI_alltoall to establish point-to-point
communication
between the MPI tasks, before actually sending data. After this MPI_Alltoall
(see http://www.open-mpi.org/community/lists/users/2011/09/17322.php) every
MPI task will know how much data is going to arrive from a remote task. If it
zero, no MPI_Send/MPI_Recv is issued in order to reduce latency.

Below is a complete code that describes what I am doing (if it is of any
help). From the MainChare (which is [threaded]) it is is called in the
following way:

--------
systemProxy.localMesh_tesselate(-2, CkCallbackResumeThread());
CkWaitQD();
systemProxy.localMesh_build(CkCallbackResumeThread((void*&)msg));
----------

Thanks!

Cheers,
Evghenii

#include "fvmhd3d.h"

namespace fvmhd3d
{
struct export_list_sort
{
bool operator() (const pair<int, int> &lhs, const pair<int, int> &rhs)
{
return lhs.first < rhs.first;
}
};

void System::localMesh_tesselate(const int ngbPass, CkCallback &cb)
{
localMesh_completeCb = cb;
assert(ngbPass > 0);
localMesh_tesselateII(active_list, std::make_pair(thisIndex, -ngbPass));
}

void System::localMesh_complete()
{
contribute(0, 0, CkReduction::concat, localMesh_completeCb);
}

void System::localMesh_tesselateII(const CkVec<int> &site_input, const
pair<int,int> recvData)
{
const int nrecv_inp = site_input.size();
const int recvIndex = recvData.first;
const int ngbPass = std::abs(recvData.second);
const int firstCall = recvData.second < 0 ? true : false;

assert(recvIndex >= 0);
assert(recvIndex < numElements);

CkVec<Particle> return_list;
CkVec< int > site_in;
return_list.reserve(nrecv_inp);
site_in .reserve(nrecv_inp);
for (int irecv = 0; irecv < nrecv_inp; irecv++)
{
const int iId = site_input[irecv];
assert(iId >= 0);
assert(iId < local_n);
if (ngb_list_hash[iId].use(recvIndex))
{
ngb_list_hash_used.push_back(iId);
return_list.push_back(ptcl_list[iId]);
site_in .push_back( iId );
}
}

const int nrecv = site_in.size();

if (nrecv > 0)
{
if (recvIndex != thisIndex)
systemProxy[recvIndex].localMesh_insertPtcl(return_list,
std::make_pair(ngbPass, thisIndex));
else if (!firstCall) localMesh_insertPtcl(return_list,
std::make_pair(ngbPass, thisIndex));
}

if (ngbPass > 0 && nrecv > 0)
{
std::vector< pair<int, int> > export_list;
export_list.reserve(nrecv);

std::map< std::pair<int, int>, bool> remote_map;

for (int irecv = 0; irecv < nrecv; irecv++)
{
const int i = site_in[irecv];
assert(i >= 0);
assert(i < local_n);
const Neighbours< pair<int, int> > &ngb = ngb_list[i];
const int nj = ngb.size();
for (int j = 0; j < nj; j++)
{
const int jIndex = ngb[j].first;
const int jId = ngb[j].second;
if (jIndex == thisIndex)
{
assert(jId >= 0);
assert(jId < local_n);
if (!ngb_list_hash[jId].is_used(recvIndex))
export_list.push_back(ngb[j]);
}
else
{
const int size0 = remote_map.size();
remote_map[ngb[j].make_pair()] = true;
if ((int)remote_map.size() == size0+1)
export_list.push_back(ngb[j]);
else
assert((int)remote_map.size() == size0);
}
}
}

std::sort(export_list.begin(), export_list.end(), export_list_sort());

const int nexport = export_list.size();
CkVec<int> sites2export;
sites2export.reserve(nexport);
int iElement = export_list[0].first;
for (int i = 0; i < nexport; i++)
{
if (iElement != export_list[i].first)
{
assert(iElement >= 0);
assert(iElement < numElements);
if (sites2export.size() > 0)
{
if (thisIndex != iElement) systemProxy[iElement].
localMesh_tesselateII(sites2export, std::make_pair(recvIndex, ngbPass-1));
else /* systemProxy[iElement] */
localMesh_tesselateII(sites2export, std::make_pair(recvIndex, ngbPass-1));
sites2export.clear();
}
iElement = export_list[i].first;
}
sites2export.push_back(export_list[i].second);
}
if (sites2export.size() > 0)
{
if (thisIndex != iElement) systemProxy[iElement].
localMesh_tesselateII(sites2export, std::make_pair(recvIndex, ngbPass-1));
else /* systemProxy[iElement] */
localMesh_tesselateII(sites2export, std::make_pair(recvIndex, ngbPass-1));
}
}

if (firstCall)
contribute(0, 0, CkReduction::concat,
CkCallback(CkIndex_System::localMesh_complete(), systemProxy));
} /* end System::localMesh_tesselateII(..) */

void System::localMesh_insertPtcl(const CkVec<Particle> &ptcl_in, const
pair<int,int> recvData)
{
const int recvIndex = recvData.second;

if (recvIndex != thisIndex)
{
const vec3 box_centre = proc_domains[thisIndex].centre();
const vec3 box_hsize = proc_domains[thisIndex].hsize();

const int nrecv = ptcl_in.size();
for (int i = 0; i < nrecv; i++)
{
Particle p = ptcl_in[i];
vec3 dr = p.pos - box_centre;
vec3 dr_min = dr.abseach() - box_hsize;
dr_min += dr_min.abseach();
dr_min *= 0.5;

if (dr_min.x > box_hsize.x)
{
if (dr.x > 0) p.pos.x -= global_domain_size.x;
else p.pos.x += global_domain_size.x;
}
if (dr_min.y > box_hsize.y)
{
if (dr.y > 0) p.pos.y -= global_domain_size.y;
else p.pos.y += global_domain_size.y;
}
if (dr_min.z > box_hsize.z)
{
if (dr.z > 0) p.pos.z -= global_domain_size.z;
else p.pos.z += global_domain_size.z;
}
ptcl_list.push_back(p);
chare_import_list.push_back(recvIndex);
}
}
else
{
const int nrecv = ptcl_in.size();
for (int i = 0; i < nrecv; i++)
ptcl_active.push_back(ptcl_in[i]);
}
} /* end System::localMesh_insertPtcl(..) */

void System::localMesh_build(CkCallback &cb)
{
assert(cell_list.size() == active_list.size());
if (active_list.size() == 0)
assert((int)ptcl_list.size() == local_n);

for (int i = local_n; i < (const int)ptcl_list.size(); i++)
ptcl_active.push_back(ptcl_list[i]);

ptcl_list.resize(local_n);

std::vector<TPoint> Tpoints;
std::vector< int > Tpoints_idx;
Tpoints .resize(ptcl_active.size());
Tpoints_idx.resize(ptcl_active.size());

n_in_DT = 0;
T.clear();
face_list.clear();

for (int i = 0; i < (const int)ptcl_active.size(); i++)
{
const vec3 &pos = ptcl_active[i].pos;
Tpoints [i] = TPoint(pos.x, pos.y, pos.z);
Tpoints_idx[i] = i;
if (i < (int)active_list.size())
cell_list[i].ngb.clear();
}

std::vector<std::ptrdiff_t> indices;
indices.reserve(Tpoints.size());
std::copy(
boost::counting_iterator<std::ptrdiff_t>(0),
boost::counting_iterator<std::ptrdiff_t>(Tpoints.size()),
std::back_inserter(indices));
CGAL::spatial_sort(indices.begin(), indices.end(),
Search_traits_3(&(Tpoints[0])), CGAL::Hilbert_sort_median_policy());

Tvtx_list.resize(ptcl_active.size());

DT::Vertex_handle hint1;
for (std::vector<std::ptrdiff_t>::iterator it = indices.begin(); it !=
indices.end(); it++)
{
hint1 = T.insert(Tpoints[*it], hint1);
assert(hint1 != TVertex_handle());
hint1->info() = std::make_pair(n_in_DT, Tpoints_idx[*it]);
assert(hint1->info().second >= 0);
assert(hint1->info().second < (int)ptcl_active.size());
Tvtx_list[hint1->info().second] = hint1;
n_in_DT++;
}

std::vector<TEdge> Tedges;
std::vector<TCell_handle> Tcells;
std::vector<bool> sites_ngb_used(n_in_DT, false);
std::vector< std::pair<int, int> > site_ngb_list;

const int nactive = active_list.size();
for (int i = 0; i < nactive; i++)
{
const TVertex_handle &vi = Tvtx_list[i];
assert(vi != TVertex_handle());
assert(vi->info().second == i);

const vec3 &ipos = ptcl_active[i].pos;

Tedges.clear();
Tcells.clear();
site_ngb_list.clear();
T.incident_cells(vi, std::back_inserter(Tcells));

const int ncells = Tcells.size();
for (int icell = 0; icell < ncells; icell++)
{
const TCell_handle &ci = Tcells[icell];

int idx = -1;
for (int iv = 0; iv < 4; iv++)
if (ci->vertex(iv) == vi)
idx = iv;

int iadd = 0;
for (int iv = 0; iv < 4; iv++)
{
if (iv == idx) continue;

const TVertex_handle &v = ci->vertex(iv);
assert(!T.is_infinite(v));
assert (v != TVertex_handle());

const int id = v->info().first;
assert(id >= 0);
assert(id < n_in_DT);
if (sites_ngb_used[id]) continue;

iadd++;
sites_ngb_used[id] = true;
site_ngb_list.push_back(v->info());
Tedges.push_back(TEdge(ci, idx, iv));
}
assert(iadd < 4);
}

const int nngb = site_ngb_list.size();
for (int j = 0; j < nngb; j++)
sites_ngb_used[site_ngb_list[j].first] = false;

assert(!Tedges.empty());
real r2max = 0.0;

const int NMAXEDGE = 1024;
static std::vector<vec3> vertex_list[NMAXEDGE];

int nedge = 0;
// for each Delaunay edge trace it dual Voronoi face
for (std::vector<TEdge>::iterator edge_it = Tedges.begin(); edge_it !=
Tedges.end(); edge_it++)
{
const TCell_circulator cc_end = T.incident_cells(*edge_it);
TCell_circulator cc(cc_end);
vertex_list[nedge].clear();

// compute face vertices
do
{
if (T.is_infinite(cc))
{
assert(false);
}
else
{
const TPoint c = T.dual(cc);
const vec3 centre(c.x(), c.y(), c.z());
r2max = std::max(r2max, (centre - ipos).norm2());

vertex_list[nedge].push_back(centre);
}

cc++;
} while (cc != cc_end);
nedge++;
assert(nedge < NMAXEDGE);
} // for edge < nedge

real rmax = 2.01*std::sqrt(r2max);

ptcl_active[i].rmax = rmax;

real cell_volume = 0.0;
vec3 cell_centroid = 0.0;
int edge = 0;
for (std::vector<TEdge>::iterator edge_it = Tedges.begin(); edge_it !=
Tedges.end(); edge_it++, edge++)
{
assert(edge < nedge);
const int nvtx = vertex_list[edge].size();

const int i1 =
edge_it->get<0>()->vertex(edge_it->get<1>())->info().second;
const int i2 =
edge_it->get<0>()->vertex(edge_it->get<2>())->info().second;

Face face;
face.s1 = (i1 == i) ? i1 : i2;
face.s2 = (i1 == i) ? i2 : i1;
assert(face.s1 == i);
assert(face.s2 != i);

int face_id = -1;
const int nj = cell_list[i].ngb.size();
for (int j = 0; j < nj; j++)
if (face_list[cell_list[i].ngb[j]].ngb<false>(i) == face.s2)
{
face_id = cell_list[i].ngb[j];
break;
}

if (face_id != -1)
{
assert(face_id >= 0);
assert(face_id < (const int)face_list.size());
face = face_list[face_id];
}
else
{
vec3 c = 0.0;
for (int j = 0; j < nvtx; j++)
c += vertex_list[edge][j];
c *= 1.0/(real)nvtx;

face.n = 0.0;
face.centroid = 0.0;
real area1 = 0.0;
const real third = 1.0/3.0;
vec3 v1 = vertex_list[edge].back() - c;
for (int j = 0; j < nvtx; j++)
{
const vec3 v2 = vertex_list[edge][j] - c;

const vec3 norm3 = v1.cross(v2);
const real area3 = norm3.abs();
const vec3 c3 = c + (v1 + v2) * third;

face.n += norm3;
face.centroid += c3 * area3;
area1 += area3;

v1 = v2;
}

const real SMALLDIFF1 = 1.0e-10;
const real area0 = area1;
const real L1 = std::sqrt(area0);
const real L2 = (face.centroid - ipos).abs();
const real area = (L1 < SMALLDIFF1*L2) ? 0.0 : area0;

const real nabs = face.n.abs();
if (area > 0.0 && nabs > 0.0)
{
const real iarea = 1.0/area;
face.centroid *= iarea;
face.n *= 0.5;

if ((face.centroid - ipos)*face.n < 0.0)
face.n *= -1.0;

const int jid = face.s2;
assert(jid >= 0);
assert(jid < (const int)ptcl_active.size());
}
else
{
face.n = 0.0;
}
}

if (face.area() == 0.0) continue;

if (face_id == -1)
{
face_list.push_back(face);
cell_list[face.s1].ngb.push_back(face_list.size() - 1);
if (face.s2 < local_n)
cell_list[face.s2].ngb.push_back(face_list.size() - 1);
}

const vec3 cv = ipos - face.centroid;
vec3 v1 = vertex_list[edge].back() - face.centroid;
const real fourth = 1.0/4.0;
for (int j = 0; j < nvtx; j++)
{
const vec3 v2 = vertex_list[edge][j] - face.centroid;
const vec3 c4 = face.centroid + (v1 + v2 + cv) * fourth;
const real vol4 = std::abs(v1.cross(v2) * cv);
cell_volume += vol4;
cell_centroid += c4 * vol4;
v1 = v2;
}

} // for edge < nedge
cell_centroid *= 1.0/cell_volume;
cell_volume *= 1.0/6.0;

cell_list[i].centroid = cell_centroid;
cell_list[i].Volume = cell_volume;

}

double volume_loc = 0;
for (int i = 0; i < (const int)active_list.size(); i++)
volume_loc += cell_list[i].Volume;

contribute(sizeof(double), &volume_loc, CkReduction::sum_double, cb);
} /* end System::localMeshbuild(..) */
}

>
> --
> Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu
> <http://charm.cs.uiuc.edu/>
> Professor, Computer Science
> kale AT illinois.edu
> 201 N. Goodwin Avenue Ph: (217) 244-0094
> Urbana, IL 61801-2302 FAX: (217) 265-6582
>
>
>
>
>
>
> On 9/22/11 9:15 PM, "Evghenii Gaburov"
> <e-gaburov AT northwestern.edu>
> wrote:
>
>> Dear All,
>>
>> As a new user, who is porting his MPI code to Charm++, I have the
>> following question:
>>
>> I have a snippet of the code that requests data from remote chares, and
>> these chares need to send data to the requesting chare. There is no way
>> to know how many messages a give chare receives from remote chares with a
>> request to export data. In other words, a given chare may need to export
>> (different) data to many remote chares that request this, and this chare
>> does not know how many remote chares request the data.
>>
>> For logical consistency it is not possible to proceed with further
>> computations unless all data requested is imported/exported. This leads
>> me to issue with a global barrier, an equivalent if which,
>> MPI_Barrier(MPI_COMM_WORLD), I use in my MPI code (there does not seem to
>> be a way around such a global barrier, since this step established
>> communication graph between MPI tasks, or for Charm++ between chares,
>> which later use point-to-point communication).
>>
>> Regretfully, I fail to find the most optimal way to issue such a barrier.
>> Using CkCallbackResumeThread() won't work because the calling code (from
>> MainChare that is a [threaded] entry) also sends messages to other remote
>> chares with request to import/export data, and those themselves
>> recursively send Msg to other remote chares to export data until close
>> condition is satisified. (the depth of recursion is 2 or 3 calls to same
>> function).
>>
>> Now I use CkWaitQD() in the MainChare as a global synchronization point.
>> I was wondering if there is a more elegant solution to issue a barrier so
>> that all previous issued message completed before proceeding further.
>>
>> Thanks!
>>
>> Cheers,
>> Evghenii
>>
>>
>>
>> _______________________________________________
>> charm mailing list
>> charm AT cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/charm
>

Re: [charm] [ppl] MPI_Barrier(MPI_COMM_WORLD) equivalent, (continued)