Skip to Content.
Sympa Menu

charm - Re: [charm] Process not consuming messages

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Process not consuming messages


Chronological Thread 
  • From: Phil Miller <mille121 AT illinois.edu>
  • To: Robert Bird <r.bird AT warwick.ac.uk>
  • Cc: Charm Mailing List <charm AT cs.illinois.edu>
  • Subject: Re: [charm] Process not consuming messages
  • Date: Thu, 21 Aug 2014 15:27:30 -0500
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Another technique you could try to see whether you have a PE spinning, or whether it's gone dead is to set an extra QD handler that prints out that it's been called and then just resets itself. If there's a PE spinning, then you'll see these prints just as the program makes progress, and then nothing once it hangs. If the system has gone dead for some reason, you'll start seeing that message many times in quick succession.


On Wed, Aug 20, 2014 at 7:02 PM, Robert Bird <r.bird AT warwick.ac.uk> wrote:
Hi all,

I have an iterative code that can deadlock during parallel operation.

It seems that all Charms associated with a "node" (CkMyNode) stop getting scheduled. 

Where @<number> denotes Chare array id, and (<number>) denotes the time-step, below we can see 440 not consuming the message it's sent.

CharmPatch.cpp:1473 @440     (21)    distributeGhostCells    >> done loop ghost send
CharmPatch.def.h:3295 @440   (21)    _atomic_7   >> finished sending ghosts, waiting for 128 ghosts with tag (rg = 43)

CharmPatch.cpp:1255 @441     (21)    distributeGhostCells    >> Sending to 440 in direction 2 (d=43)
CharmPatch.cpp:1457 @107     (21)    distributeGhostCells    >> Sending to 440 in direction 3 (d=43)
CharmPatch.cpp:1255 @442     (21)    distributeGhostCells    >> Sending to 440 in direction 8 (d=43)
CharmPatch.cpp:1255 @434     (21)    distributeGhostCells    >> Sending to 440 in direction 12 (d=43)

I have grepped out all the chares that exhibit his behaviour, and on a per run basis the all map to the same ckMyNode()

The code it is waiting to consume those messages is the following SDAG:

while (receivedGhostsCount < totalGhosts) 
{   
                            when SDAGreceiveGhostCells[(step*2)+1]( int direction, CharmGhostBuffer ghost, int sender_id)
                            {   
                                  // consume
                            }
}

Does any one have any ideas what may cause this?

The only thing I can think of is that another scheduled "node group" that shares the same physical mapping has stalled (perhaps in an infinite loop), stopping this getting scheduled? That being said however, the above is the only odd behaviour I have been able to find so far.

Best Regards,
Bob

-- 
Robert Bird
http://go.warwick.ac.uk/robertbird

+44 (0)24 7652 2863
CS202, High Performance Lab
Department of Computer Science
University of Warwick

_______________________________________________
charm mailing list
charm AT cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/charm





Archive powered by MHonArc 2.6.16.

Top of Page