Skip to Content.
Sympa Menu

charm - Re: [charm] [ppl] NAMD Charmrun error on Ranger

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [ppl] NAMD Charmrun error on Ranger


Chronological Thread 
  • From: Aditya Devarakonda <aditya08 AT cac.rutgers.edu>
  • To: Phil Miller <mille121 AT illinois.edu>
  • Cc: Eric Bohm <ebohm AT illinois.edu>, Mailing List <charm AT cs.illinois.edu>, Abhishek Gupta <gupta59 AT illinois.edu>, Charm, Jim Phillips <jim AT ks.uiuc.edu>
  • Subject: Re: [charm] [ppl] NAMD Charmrun error on Ranger
  • Date: Wed, 16 May 2012 16:48:04 -0400
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

For the time being, it might be better for our group to give the SMP
build a try.

If I understand the nodelist file with charmrun correctly - we would
need some support from TACC where the machinefile generated implicitly
for ibrun should be available for charmrun. Is that accurate?

On Wed, 2012-05-16 at 15:23 -0500, Phil Miller wrote:
> There's also using charmrun's own process launching mechanisms instead
> of the system's mpiexec, in order to get its more scalable tree
> structure with ++hierarchical-start. The downside is that this
> requires a nodelist file for charmrun to work with. Given that there
> is a common NAMD launching script that many users can reference, I
> don't think that's a big deal, since the logic only needs to be
> implemented once.
>
> It also looks like that option never got documented in the usage manual:
> http://charm.cs.illinois.edu/manuals/html/install/4_1.html
> For that matter, neither is ++scalable-start (one SSH connection per node).
>
> On Wed, May 16, 2012 at 3:11 PM, Eric Bohm
> <ebohm AT illinois.edu>
> wrote:
> > There is a P^2 startup and memory issue with the reliable channel
> > implementation on IBVERBS.
> >
> > A simple way to reduce its impact is to use the SMP build, one can then
> > reduce the number of necessary processes to one per node by running +p
> > numnodes +ppn 15 to have 15 worker threads per node multiplex across one
> > communication thread per node. You then have (P/16)^2, which will scale
> > much farther.
> >
> > On 05/16/2012 09:46 AM, Jim Phillips wrote:
> >> I think the mpiexec calls the ibrun script, which calls the real mpiexec.
> >>
> >> -Jim
> >>
> >>
> >> On Wed, 16 May 2012, Aditya Devarakonda wrote:
> >>
> >>> Thanks Jim,
> >>>
> >>> So, the pre-loaded NAMD batch scripts on Ranger seem to use Charm with
> >>> the mpiexec option. Now, is there a better way of doing this (through
> >>> ibrun perhaps).
> >>>
> >>> Maybe I'm wrong, but my thinking with adjusting the timeout is that the
> >>> problem could always creep back as we increase the number of nodes.
> >>>
> >>> Do you guys typically use mpiexec to start the NAMD processes on Ranger?
> >>>
> >>> Regards,
> >>> Aditya
> >>>
> >>> On Mon, 2012-05-14 at 09:58 -0500, Jim Phillips wrote:
> >>>
> >>>> Charmrun should have some options for adjusting the timeout. One goal
> >>>> of
> >>>> using mpiexec was to make this process more similar to other jobs on
> >>>> the
> >>>> machine so the timeout may just need to be extended (I'm not sure what
> >>>> the
> >>>> default is - that should probably be printed).
> >>>>
> >>>> -Jim
> >>>>
> >>>> On Sat, 12 May 2012, Aditya Devarakonda wrote:
> >>>>
> >>>>> Hi everyone,
> >>>>>
> >>>>> Hope you guys are doing well. Our group has been working with NAMD
> >>>>> for the
> >>>>> past couple of months and recently started running jobs on Ranger.
> >>>>>
> >>>>> We have been seeing some issues while running at 1K or more
> >>>>> processors. It
> >>>>> seems to be an issue with launching NAMD on remote nodes - we get the
> >>>>> following error:
> >>>>>
> >>>>> Charmrun> error 64 attaching to node:
> >>>>> Timeout waiting for node-program to connect
> >>>>>
> >>>>> We're using the NAMD_2.8_Linux-x86_64-ibverbs-Ranger build available
> >>>>> on
> >>>>> Ranger and using mpiexec
> >>>>>
> >>>>> charmrun +p ++mpiexec ++remote-shell mympiexec ++runscript
> >>>>> tacc_affinity
> >>>>> namd2 $CONFFILE
> >>>>>
> >>>>> We were able to run successfully scale up to 512 processors but not
> >>>>> beyond.
> >>>>> Any ideas?
> >>>>>
> >>>>> Thanks,
> >>>>> Aditya
> >>>>>
> >>>
> >>>
> >> _______________________________________________
> >> ppl mailing list
> >> ppl AT cs.uiuc.edu
> >> http://lists.cs.uiuc.edu/mailman/listinfo/ppl
> >
> > _______________________________________________
> > ppl mailing list
> > ppl AT cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/ppl






Archive powered by MHonArc 2.6.16.

Top of Page