Skip to Content.
Sympa Menu

charm - Re: [charm] FW: Projections

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] FW: Projections


Chronological Thread 
  • From: Ronak Buch <rabuch2 AT illinois.edu>
  • To: "Ortega, Bob" <bobo AT mail.smu.edu>
  • Cc: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
  • Subject: Re: [charm] FW: Projections
  • Date: Mon, 14 Dec 2020 16:39:13 -0500
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=illinois.edu; dmarc=pass action=none header.from=illinois.edu; dkim=pass header.d=illinois.edu; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=4Pcvm+N9MNN9r62a7OT/h1tloF8My/6dRuJ2oR5b+9Q=; b=NU1TTBVIbelw0Ls8LEUnM091iyGTTTeC/+aEnHSEXf7MVn0yVCwy1dka+3YHATV8/YpaHgIMZfSEKvogy04VbGyS89vi0+cXCv3oSU3TP3xiVp0/M4tBPfeviP6vTCV6mlS6Eei8z5JW1RbcFyNzbBlMJ/57EMjDVThQ+dqemiWqjgD6pYrw6coEoaVhB2Yb9wIMPIo5cxvCcdPNv8navTZhHg/w1AW45mrm6KdVyzJcpqxLe8abd8Yn2pLDceXpCDPJ/sC8iA7BsMom3ZHNy5Yax2uV3eh5j6tQiLE4PE80LNfcgsLDcOoLcW7mT1CFRXFp71/JS+pm+T3Lg4Wv/Q==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=XWy5ZeRnHEq4Exyrnl3vbbvmrTxUdDY3dnFxuHSS2QIfaWfkGxqr2LI9xZx0VDsml2KD5CgUD58CL3sEoQQHsJrWT3W6w31yXL2SL4IiNMJf/KF9JreGVH6i0+kRR0bxUjlI3cjdM5SQoNmFbleuM6SOE/6SyumEoXG6MeDuDDxCb8o0RsGHj9zfgW0MtkUckeCsBrvcrw6U9mNlYUTecT2Oi8IghtdpXYZNpZHvgDdzYjY+IBOt+C3NBPViCq4e35OkTmGWqae9IfhDCE6HiQRPrN7kzPcovfA06FD2c96v60YeYgLyhxxfihzncpZUCxxiOXdsPkCsdUxpBYO8kQ==
  • Authentication-results: illinois.edu; spf=softfail smtp.mailfrom=rabuch2 AT illinois.edu; dkim=pass header.d=uillinoisedu.onmicrosoft.com header.s=selector2-uillinoisedu-onmicrosoft-com; dmarc=none header.from=illinois.edu
  • Authentication-results: lists.cs.illinois.edu; dkim=none (message not signed) header.d=none;lists.cs.illinois.edu; dmarc=none action=none header.from=illinois.edu;

Hi Bob,

I did in fact receive your message, I'm glad to see that things are working properly.

You seem to be running NAMD in parallel already. I've emboldened the lines in the startup output below that indicate your parallel execution (you're running on two physical nodes, 36 cores in total). One place you can look to get an overview of how Charm++ parallel execution works and the various flags and parameters you can use to customize execution is our manual: https://charm.readthedocs.io/en/latest/charm++/manual.html.

(One thing I should also note is that you are running an MPI build of Charm++, which is generally not recommended unless your platform doesn't support any of the native Charm++ machine layers, as those usually provide higher performance. Assuming you're using SMU's ManeFrame II, you'll probably want to use the UCX machine layer for Charm++.)

Charm++> Running on MPI version: 3.1

Charm++> level of thread support used: MPI_THREAD_SINGLE (desired: MPI_THREAD_SINGLE)

Charm++> Running in non-SMP mode: 36 processes (PEs)

Charm++> Using recursive bisection (scheme 3) for topology aware partitions

Converse/Charm++ Commit ID: v6.10.2-0-g7bf00fa-namd-charm-6.10.2-build-2020-Aug-05-556

Trace: logsize: 10000000

Charm++: Tracemode Projections enabled.

Trace: traceroot: /users/bobo/NAMD/NAMD_2.14_Source/Linux-x86_64-icc/./namd2.prj

CharmLB> Load balancer assumes all CPUs are same.

Charm++> Running on 2 hosts (2 sockets x 9 cores x 1 PUs = 18-way SMP)

Charm++> cpu topology info is gathered in 0.024 seconds.

Info: NAMD 2.14 for Linux-x86_64-MPI

Info:

Info: Please visit http://www.ks.uiuc.edu/Research/namd/

Info: for updates, documentation, and support information.

Info:

Info: Please cite Phillips et al., J. Chem. Phys. 153:044130 (2020) doi:10.1063/5.0014475

Info: in all publications reporting results obtained with NAMD.

Info:

Info: Based on Charm++/Converse 61002 for mpi-linux-x86_64-icc

Info: Built Wed Dec 9 22:01:36 CST 2020 by bobo on login04

Info: 1 NAMD  2.14  Linux-x86_64-MPI  36    v001  bobo

Info: Running on 36 processors, 36 nodes, 2 physical nodes.

Info: CPU topology information available.

Info: Charm++/Converse parallel runtime startup completed at 0.769882 s

Info: 2118.93 MB of memory in use based on /proc/self/stat

Info: Configuration file is stmv/stmv.namd

Info: Changed directory to stmv

TCL: Suspending until startup complete.

Info: SIMULATION PARAMETERS:

Info: TIMESTEP               1

Info: NUMBER OF STEPS        500

Info: STEPS PER CYCLE        20

Info: PERIODIC CELL BASIS 1  216.832 0 0

Info: PERIODIC CELL BASIS 2  0 216.832 0


Thanks,
Ronak

On Mon, Dec 14, 2020 at 8:56 AM Ortega, Bob <bobo AT mail.smu.edu> wrote:

Ronak,

 

I apologize.  I’ve decided to not include or attach a screenshot from Projections run because SYMPA keeps telling

me it cannot distribute the message with the screenshot.

 

Sorry if you’ve received this messages multiple times.  Usually I get a confirmation from the mail list server

but then realized I had not copied the list server. Just wanted to make sure you received this.  Sometimes when I included

a graphic it’s too large.  So, I resized it.  I tried before and it was above 400 kb.  Now, I resized again to belows 400 kb.

 

***********************************************************************************************

That worked!  No warning messages.

                     

We are attempting to be able to confirm parallel running of NAMD/Charm as it has been a long standing issue.  There is a serial version running, but of course, because we do have the ability to run

Applications in parallel, that is what this latest testing is about.   This why I have been seeking tools/resources to better understand what is going on during these runs.

 

As I mentioned to Nitin, I would really like to understand better (and where it indicates things are running in parallel) the output from NAMD, which starts off like this:

 

Charm++> Running on MPI version: 3.1

Charm++> level of thread support used: MPI_THREAD_SINGLE (desired: MPI_THREAD_SINGLE)

Charm++> Running in non-SMP mode: 36 processes (PEs)

Charm++> Using recursive bisection (scheme 3) for topology aware partitions

Converse/Charm++ Commit ID: v6.10.2-0-g7bf00fa-namd-charm-6.10.2-build-2020-Aug-05-556

Trace: logsize: 10000000

Charm++: Tracemode Projections enabled.

Trace: traceroot: /users/bobo/NAMD/NAMD_2.14_Source/Linux-x86_64-icc/./namd2.prj

CharmLB> Load balancer assumes all CPUs are same.

Charm++> Running on 2 hosts (2 sockets x 9 cores x 1 PUs = 18-way SMP)

Charm++> cpu topology info is gathered in 0.024 seconds.

Info: NAMD 2.14 for Linux-x86_64-MPI

Info:

Info: Please visit http://www.ks.uiuc.edu/Research/namd/

Info: for updates, documentation, and support information.

Info:

Info: Please cite Phillips et al., J. Chem. Phys. 153:044130 (2020) doi:10.1063/5.0014475

Info: in all publications reporting results obtained with NAMD.

Info:

Info: Based on Charm++/Converse 61002 for mpi-linux-x86_64-icc

Info: Built Wed Dec 9 22:01:36 CST 2020 by bobo on login04

Info: 1 NAMD  2.14  Linux-x86_64-MPI  36    v001  bobo

Info: Running on 36 processors, 36 nodes, 2 physical nodes.

Info: CPU topology information available.

Info: Charm++/Converse parallel runtime startup completed at 0.769882 s

Info: 2118.93 MB of memory in use based on /proc/self/stat

Info: Configuration file is stmv/stmv.namd

Info: Changed directory to stmv

TCL: Suspending until startup complete.

Info: SIMULATION PARAMETERS:

Info: TIMESTEP               1

Info: NUMBER OF STEPS        500

Info: STEPS PER CYCLE        20

Info: PERIODIC CELL BASIS 1  216.832 0 0

Info: PERIODIC CELL BASIS 2  0 216.832 0

 

 

So, in addition to learning more about projections, what other tools/apps/resources would you recommend that might help in monitoring/analyzing our attempts at parallization?

 

Thanks!

Bob

 

 

 

 

 

 

From: Ronak Buch <rabuch2 AT illinois.edu>
Date: Friday, December 11, 2020 at 2:02 PM
To: "Ortega, Bob" <bobo AT mail.smu.edu>
Cc: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>, Nitin Bhat <nitin AT hpccharm.com>
Subject: Re: [charm] FW: Projections

 

Hi Bob,

 

Your run command should look something like:

 

date;time srun -n 36 -N 2 -p fp-gpgpu-3 --mem=36GB ./namd2.prj stmv/stmv.namd +logsize 10000000 >namd2.prj.fp-gpgpu-3.6.log;date

 

Thanks,

Ronak

 

On Thu, Dec 10, 2020 at 3:31 PM Ortega, Bob <bobo AT mail.smu.edu> wrote:

Ronak,

 

Thank you for the quick reply.

 

Well, I’m using srun to run NAMD.  Here’s the command,

 

date;time srun -n 36 -N 2 -p fp-gpgpu-3 --mem=36GB ./namd2.prj stmv/stmv.namd >namd2.prj.fp-gpgpu-3.6.log;date

 

How can I submit a similar charmrun command targeting 36 processors, 2 nodes, the fp-gpgpu-3 queue partition, 36GB of memory and +logsize of 10000000?

 

Oh, I’m not getting the exception anymore and unfortunately, during that run, I didn’t log the results to a file.

 

If it occurs again, I’ll forward the log file.

 

Thanks,

Bob

 

From: Ronak Buch <rabuch2 AT illinois.edu>
Date: Thursday, December 10, 2020 at 2:09 PM
To: "Ortega, Bob" <bobo AT mail.smu.edu>
Cc: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>, Nitin Bhat <nitin AT hpccharm.com>
Subject: Re: [charm] FW: Projections

 

Hi Bob,

 

Regarding the +logsize parameter, it is a runtime parameter, not a compile time parameter, so you shouldn't add it to the Makefile, you should add to your run command (e.g. ./charmrun +p2 ./namd <namd input file name> +logsize 10000000).

 

Regarding the exception you're seeing, I'm not sure why that's happening, it's likely due to some issue in initialization. Would it be possible for you to share the generated logs for debugging?

 

Thanks,

Ronak

 

On Thu, Dec 10, 2020 at 12:36 PM Ortega, Bob <bobo AT mail.smu.edu> wrote:

 

 

From: "Ortega, Bob" <bobo AT mail.smu.edu>
Date: Thursday, December 10, 2020 at 11:24 AM
To: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
Cc: Nitin Bhat <nitin AT hpccharm.com>
Subject: FW: Projections

 

Nitin Bhat was kind enough to review my questions about some errors and messages I am receiving while using/running NAMD/Charm with projections enabled.

I am including some email messages I sent to Nitin about these issues. Let me know how I might resolve these issues and any references that may help to clarify

proper use of projections to be able to take further advantage of it’s capabilities.

 

Thanks,

Bob

 

 

 

From: Nitin Bhat <nitin AT hpccharm.com>
Date: Thursday, December 10, 2020 at 10:55 AM
To: "Ortega, Bob" <bobo AT mail.smu.edu>
Subject: Re: Projections

 

Hi Bob, 

 

I am just reading your latest emails about the issues that you’re seeing on projections. 

 

Can you reach out to the Charm mailing list (charm AT lists.cs.illinois.edu) with both the issues that you’re seeing?  (This one and the previous java exception that you saw when you launched projections). The folks who work with (and develop) projections will be able to better address those issues. 

 

Thanks,

Nitin

 

On Dec 10, 2020, at 8:52 AM, Ortega, Bob <bobo AT mail.smu.edu> wrote:

 

Nitin,

 

Thanks again for your support.  I’m now trying to find out how to use the following runtime option,

 

+logsize NUM

 

Because when I run NAMD.PRJ binary, at the end of the output I get this message:

 

*************************************************************

Warning: Projections log flushed to disk 101 times on 36 cores: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35.

Warning: The performance data is likely invalid, unless the flushes have been explicitly synchronized by your program.

Warning: This may be fixed by specifying a larger +logsize (current value 1000000).

 

I thought that perhaps this was entered into the Makefile under the projections section, so I put it there with this line,

 

+logsize 10000000

 

But I still am getting the Warning message.

 

Thanks,

Bob

 

Nitin,

 

As noted in an earlier email, I was successful running projections for traces generated by a run with 18 processors and 1 node.  But when I tried with 180 processors and 10 nodes, I get the following error when trying to run projections:

 

 

Do you know what could be the problem here?

 

Thanks,

Bob

 

 

 

 

 

PNG image




Archive powered by MHonArc 2.6.19.

Top of Page