Skip to Content.
Sympa Menu

charm - Re: [charm] Debugging Race Conditions

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Debugging Race Conditions


Chronological Thread 
  • From: Robert Bird <r.bird AT warwick.ac.uk>
  • To: Eric Bohm <ebohm AT illinois.edu>
  • Cc: charm AT cs.uiuc.edu
  • Subject: Re: [charm] Debugging Race Conditions
  • Date: Tue, 19 Aug 2014 14:47:47 -0600
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Further to this, If I run with +record (with the aim recompile my code with debug prints, and run with +replay), it seems to create twice the number of trace files

Upon termination, it only populates the first half (the ones with expected PE numbers). Then the program seems to have deadlocked, perhaps waiting for those files to be written and closed? The replay is unable to work if I wait and then kill the program, as it noticed the traces are damaged. 

Is there a known explanation to this? 


On Tue, Aug 19, 2014 at 9:53 AM, Robert Bird <r.bird AT warwick.ac.uk> wrote:
Hey

I'm starting to wonder if my issue is mac/network related, as LiveViz does not work either. When trying to use the charm debugger itself I get such messages as:

ccs: Server IP = 130.55.11.166, Server port = 62496 $
CcsServer FATAL ERROR: Error connecting to host:java.net.SocketTimeoutException

And then it dies without too much grace. 

If I try to use 127.0.0.1 instead of local host I get:

+DebugDisplay 130.55.11.166:0.0 ++server +DebugSuspend
ServThread started
ServThread terminated
Finished running parallel program

If I run it from our linux cluster the program starts to run (giving application prints), but dies on:

CcsServer FATAL ERROR: Error connecting to host:java.io.EOFException

or

java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:225)
at sun.nio.ch.IOUtil.read(IOUtil.java:198)

Hopefully this is something obvious that I'm not spotting, but I've tried everything I can think of. 

Thanks


On Tue, Aug 19, 2014 at 8:46 AM, Eric Bohm <ebohm AT illinois.edu> wrote:
Race conditions are one of the most difficult kinds of bug to unravel.

The record/replay feature is designed to help under these condition.  A run with +record will create files which record the exact order of events.  A run with +replay will replay execution from the event log created by the +record feature.  That way one can issue multiple runs with +record until the sought after condition occurs and +replay the target within a debugger.

Can you get ++debug to work for any charm program?  When applicable, it does rely on ssh and xforwarding working correctly, so sometimes this issue can be resolved by adding this line to your .ssh/config:
ForwardX11 yes

or by adding -X to the ssh command line in the nodelist file.


On 08/18/2014 02:51 PM, Robert Bird wrote:
Hey all

I've got a (rare) race condition, where by a charm element is inserted twice (according to the error int he stack trace when Charm aborts). 

I can only get this to happen in parallel, with random message queues, so I'm having a hard time tracking it down.

Is there an obvious way to debug race conditions such as this? 

I've tried to use ++debug in order to get reliable access to the trace in gdb, but it doesn't seem to launch quite as expected. I get debug prints about the threads at the start and the program runs, but no xterm window appears (nor waits for my input to start -- As far as I can tell I meet all the requirements, I can spawn X-window, $DISPLAY is set, xterm is in path.)

Any obvious pointers/hints? Especially about a general method for tracking down race conditions

Thanks
Bob
NB:  During a quick chat with Phil Miller he mentioned +record, does this allow me to record it in parallel, then replay on a serial gdb? 

--
Robert Bird
http://go.warwick.ac.uk/robertbird

+44 (0)24 7652 2863
CS202, High Performance Lab
Department of Computer Science
University of Warwick


_______________________________________________
charm mailing list
charm AT cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/charm


_______________________________________________
charm mailing list
charm AT cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/charm




--
Robert Bird
http://go.warwick.ac.uk/robertbird

+44 (0)24 7652 2863
CS202, High Performance Lab
Department of Computer Science
University of Warwick



--
Robert Bird
http://go.warwick.ac.uk/robertbird

+44 (0)24 7652 2863
CS202, High Performance Lab
Department of Computer Science
University of Warwick



Archive powered by MHonArc 2.6.16.

Top of Page