Skip to Content.
Sympa Menu

charm - Re: [charm] megatest hangs on Cray XC

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] megatest hangs on Cray XC


Chronological Thread 
  • From: "Buch, Ronak Akshay" <rabuch2 AT illinois.edu>
  • To: Ted Packwood <malice AT cray.com>
  • Cc: Nitin Bhat <nitin.bhat.k AT gmail.com>, "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
  • Subject: Re: [charm] megatest hangs on Cray XC
  • Date: Wed, 9 Oct 2019 19:42:32 +0000
  • Accept-language: en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=illinois.edu; dmarc=pass action=none header.from=illinois.edu; dkim=pass header.d=illinois.edu; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=2s3IzBhKIcrRfA6KX0KTlQShLROTyM0AQkK+T+ikYGM=; b=oMZLqHNbGIz3GyUkSyvZIkDn0Dg0/GetN9BzeIQ48Tt/nM9FcfBcaXXhL0OxWL8WxBnLslvCu6Fbe9BbYGHouAJS5xZCmFPacmTjp/IK6CkkJ7xHLxoWx7ejtFACwtGvfJ6zJ1WFLQ+zj7hMNKe75SnZZvhSlPyVudKKU/sx3YeUG6DalNvkSp2QDHhUxr1+KFgZHxjTbbOgykaBCoCGBAkxDy9AzbTqHd3LtbMM6jeTd9KjdLqgAiCwNj1x6s78y5DbK2oH15r811ECcPoiH+ghinMnKr5L9+4sGd06TqNXGnG6IDVxwgGdcNHK91PSFGng802FqjDmzJ35S5WDbw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=FJ0wSH7R8fFPghOXPS82Gly7pYXWvO+Dh/oT2ZJ1GXzvh3rV2K6ksNElakk7LgUEr+aSH8Zq/AvST/BCDGpAzErS9XF2koUPcLxX61FdyA3bSKKb1Lzrf6IB74GcCFvoHHvFO2caPZvZNo0mu74fSUiWh8rDbVSbTO50DOAGbLShl4CYggkigF/0zjEv7Yz8RM6iQiaz90C5M3rS7eXNy1QizCzEQifGMj87cgV5dp9su97rfMxPCY+n2AbpLhsUabhFiFab3+f2WSvQM1XSjNiYzMiX2CBxBxW9kMqzJFihD8EIyeVn6fD6nD0v7zSUo2j1y6UiPIJeqYOUJl6Nyg==
  • Authentication-results: illinois.edu; spf=pass smtp.mailfrom=rabuch2 AT illinois.edu; dkim=pass header.d=uillinoisedu.onmicrosoft.com header.s=selector2-uillinoisedu-onmicrosoft-com; dmarc=pass header.from=illinois.edu

Hi Ted,

Have you gotten a chance to try running NAMD with the fix that we merged a few days ago?

Thanks,
Ronak

On Fri, Oct 4, 2019 at 2:01 PM Ted Packwood <malice AT cray.com> wrote:
Hello,
This is great news!  I did send a traceback for the NAMD hang to Ronak already,
but from the description of what the issue was, it seems very likely this will
fix the NAMD hang as well.

It looks like the fix was already merged into master, so I'll update my local git
branch and try again with NAMD.

Thanks!
Ted


On 10/4/2019 1:15 PM, Nitin Bhat wrote:
Hi Ted, 

I opened this issue related to the hangs that you mentioned: https://github.com/UIUC-PPL/charm/issues/2508

After a bit of debugging, folks from our dev team were able to determine the root cause of the bug and create a fix for it. It’s in this PR: https://github.com/UIUC-PPL/charm/pull/2551

The bug was because of an incorrectly handled condition that was causing the migration message to be repeatedly executed in an infinite loop. With the fix, we buffer that migration message when that unexpected condition occurs and then execute its message handler when it is safe to do so. The linked PR should fix both the hangs that we encountered - the multi migration test hang and the completion detection hang (which was internally caused by the migration test itself). I tested it on Cori with 500 runs and didn’t see the bug. It’ll be great if you can test it on your machine too and let us know if the issue is fixed. We will be adding this bug fix to the 6.10 release. 

Additionally, Ronak Buch (cc’ed) from our team, mentioned that you encountered a bug while running NAMD on cray machines. Could you let us know the steps for reproducing it? There’s a possibility that this fix might solve that, if the encountered bug was a hang. However, it’s also possible that it could be a separate bug. If so, it’ll be good to start investigating that bug. 

Thanks,
Nitin

On Sep 4, 2019, at 4:05 PM, Ted Packwood <malice AT cray.com> wrote:

Hi Nitin,

That's good news!  I had great trouble isolating where in the charm++
code the test was failing.  And I also hit the second hang on occasion,
late in the run as you say.

Thanks a bunch!
Ted

On 9/3/2019 1:50 PM, Nitin Bhat wrote:
Hi Ted, 

Thanks for getting back with the details. 

I was able to reproduce the bug on 6.9. 

However, I am not seeing the bug in the current master. So, it’s likely that it was fixed since the release. However, I ran into another hang while running megatest during completion detection (much after multi-migration).
I saw this on both the master branch and the 6.9 version. I’ve created this github issue for it and will try to debug it. 

Additionally, I’ll try to find the cause for the multi-migration bug that you’re facing on 6.9 and see if there was a fix merged for it. 

Thanks,
Nitin

On Aug 28, 2019, at 11:48 AM, Ted Packwood <malice AT cray.com> wrote:

Hello Nitin,

Yes, here is my run command.  I have tried ppn=2 a couple of times with
no hangs, but not extensively.  The hang occurs with 6.8.2 and 6.9.0.

/usr/bin/time -p aprun -cc none -n4 -N2 -d16 -j1 -S1 ./pgm +ppn15

Please keep in mind the hang is intermittent.  You'll need to run the job
multiple times to see a failure.

Thanks!
Ted

On 8/28/2019 9:36 AM, Nitin Bhat wrote:
Hi Ted, 

Thanks for letting us know about the issue. We are running megatest with “++ppn 2” during our nightly build and as far I know, we haven’t run into this hang on gni builds (or any build).
I’ll try reproducing it and then debugging it on Cori. 

Could you send your final run command? Are you running make test TESTOPTS=“++ppn 2”? If so, does the hang occur when you run with “+p2” or “+p4”? 

Thanks,
Nitin Bhat
Software Engineer
Charmworks, Inc. 


On Aug 27, 2019, at 12:17 PM, Ted Packwood <malice AT cray.com> wrote:

Hello-
I am trying to resolve a hang I'm seeing on the Cray XC with the
gni-crayxc build.  I'm building with the gcc compiler, and my build
command is:
./build charm++ gni-crayxc persistent smp --with-production

Running megatest with +ppn2 or higher results in an intermittent
megatest hang.  The higher the ppn value, the more likely the hang.
Most of the hangs occur here:
test 43: initiated [multi migration (jackie)]

Could someone from Charm get back to me? 
Thanks much
Ted Packwood
Cray Inc.









Archive powered by MHonArc 2.6.19.

Top of Page