Skip to Content.
Sympa Menu

charm - Re: [charm] messages not being received

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] messages not being received


Chronological Thread 
  • From: Robert Steinke <rsteinke AT uwyo.edu>
  • To: Lukasz Wesolowski <wesolwsk AT illinois.edu>
  • Cc: "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
  • Subject: Re: [charm] messages not being received
  • Date: Tue, 7 Oct 2014 11:42:41 -0600
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

I figured out the problem. There are duplicates in the neighbor lists so when I received an initialization message from that neighbor I only marked the first one as initialized and it was never getting past the initialize neighbor SDAG code. Thanks for the suggestions. They helped me think about where it could be getting hung up.

Bob

On 10/06/2014 06:11 PM, Lukasz Wesolowski wrote:
Hi Bob,

Please tell us for which entry method in your code the message is not
being received.

Also, if you could answer the following questions, I think they would
help in tracking down the problem:

1. Have you verified that the missing messages are actually getting
sent (i.e. that all the elements that should receive a message are
actually being sent one)?
2. If the target entry method corresponds to an SDAG when clause, is
it possible that the message has been received at the level of the
runtime system, but the corresponding when statement is not being
reached (e.g. due to unsatisfied when clauses earlier in the code)?

While I would not rule out the possibility of messages getting lost, I
do not think it is very likely.

Lukasz

On Mon, Oct 6, 2014 at 6:08 PM, Robert Steinke
<rsteinke AT uwyo.edu>
wrote:
I've been working on my problem where I send messages to an entire chare
array and some messages don't arrive.

I've been trying to create a minimal example that exhibits the problem.
I've gotten down to about 2000 lines of code. I can't see any bugs in my
code. Would anyone be willing to take a look at it or try to debug it on
your system?

I am running on CentOS6 with the newly released Charm 6.6.0. The build is
mpi-linux-x86_64, and the MPI is mpich-3.0.1. The problem shows up when I
run on only one process element. I haven't tried it on more.

The problem depends on a neighbor graph that is read in from a file. At the
start, each chare initializes itself and then sends an initialization
message to its neighbors. These messages all arrive, but when I try to send
a subsequent message to all elements of the chare array some elements don't
receive it. If I use hardcoded neighbor relationships like each element is
connected to the ones numerically before and after it the problem doesn't
occur. But when I use the neighbor graph that I want to use from the file
the problem occurs. The problem is not caused by reading from the file. I
can read the file and then overwrite the neighbor values with hardcoded ones
and the problem doesn't occur.

I've attached the code, but the file with the neighbor relationships is a
6GB netCDF file. I can send it to whoever is willing to work on the
problem. You will need to have the NetCDF library to link against my code.

Thanks,
Bob Steinke




On 10/03/2014 03:46 PM, Robert Steinke wrote:
I'm having a problem with my charm application.

Before I get into the problem, I tried to use the ccs_tools charm
debugger, but haven't been able to yet. I read in the manual that it only
works for net-* versions of charm, and I am running on an mpi-* version.
The process of getting my code to run on a net-* version started to turn
into a real mess. For example I'm using the parallel version of the NetCDF
library that requires MPI. I could probably get it running on a net-*
version, but my first question is whether that's the right road to be going
down. Is it likely the ccs_tools debugger will be useful for solving this
problem, or is there something else I can do?

Here's the problem:

In an entry method of one object I have a loop that sends out messages to
every element of a chare array. I'm sending an individual message to each
object in a loop, not a broadcast through the array proxy, because I need to
send different parameters to each object. Like this:

for (ii = 0; ii < proxySize; ii++)
{
proxy[ii].message(parameters[ii]);
}

When proxySize is large and I send a lot of messages (about 37,000) a
couple percent of them never arrive. The missing messages are scattered
around the array. When I send a small number of messages they all arrive.

Has anyone experienced something like this before?

I was hoping that the ccs_tools debugger would be able to do things like
show me the queued messages so I can see messages being sent and received so
I can tell if this is really a problem with charm not delivering messages or
if I'm doing something wrong. Is this something that ccs_tools could show
me?

Thanks for the help,

Bob Steinke

_______________________________________________
charm mailing list
charm AT cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/charm


_______________________________________________
charm mailing list
charm AT cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/charm






Archive powered by MHonArc 2.6.16.

Top of Page