Skip to Content.
Sympa Menu

charm - Re: [charm] Program hang when using load balancing and lots of PEs

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Program hang when using load balancing and lots of PEs


Chronological Thread 
  • From: Phil Miller <mille121 AT illinois.edu>
  • To: Robert Steinke <rsteinke AT uwyo.edu>, Charm Mailing List <charm AT cs.illinois.edu>
  • Subject: Re: [charm] Program hang when using load balancing and lots of PEs
  • Date: Tue, 27 Jan 2015 17:35:33 -0600
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Thanks for the output.

How often are your objects calling AtSync()?

Could you double-check the correctness and completeness of your objects' pup() routines? In particular, their own members, CBase_foo::pup(p), and (if applicable) __sdag_pup(p)? If something turns out to be missing, that can easily cause a hang.

Incidentally, we have some patches under review now that eliminate the need to call the boilerplate methods in PUP routines. Users will still have to get all their object's own members in there, though.

On Tue, Jan 27, 2015 at 5:29 PM, Robert Steinke <rsteinke AT uwyo.edu> wrote:
I'm attaching a file from a run with +LBDebug 3.  Within a few minutes it had gotten to the line:

currentTime = 52.970963, dt = 18.112661, iteration = 7.

Then it didn't do anything more.  30 minutes later I killed the job, and all of the output after that came after I killed the job.

There are two chare arrays.  One has 36452 elements, and the other has 11832 elements.


On 01/27/2015 03:14 PM, Phil Miller wrote:
The first thing to try would be running with the option "+LBDebug 3" to get some visibility into what's happening in the LB infrastructure. Could you send us output from such a run?

Also, how many objects are you running with across the whole job?

On Tue, Jan 27, 2015 at 3:51 PM, Robert Steinke <rsteinke AT uwyo.edu> wrote:
I have a program that hangs when I run on lots of PEs and use the load balancer (I'm using MetisLB).  If I run on 512 or fewer processors it is fine.  If I try to run on 1024 processors it hangs shortly after I call CkStartLB (I'm using TurnManualLBOn).  Also, if I don't call CkStartLB(); it runs fine on 1024 processors.

Is this a problem that someone else has encountered before?

Is this something that I should try to dig into, or is there someone else more familiar with the load balancer than I am who is willing to look into it, in which case I will apply my effort to creating a minimal test case that reproduces the problem.

Thanks
Bob Steinke

_______________________________________________
charm mailing list
charm AT cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/charm






Archive powered by MHonArc 2.6.16.

Top of Page