Skip to Content.
Sympa Menu

charm - Re: [charm] [ppl] NAMD Charmrun error on Ranger

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [ppl] NAMD Charmrun error on Ranger


Chronological Thread 
  • From: Phil Miller <mille121 AT illinois.edu>
  • To: Eric Bohm <ebohm AT illinois.edu>
  • Cc: Charm Mailing List <charm AT cs.illinois.edu>, Abhishek Gupta <gupta59 AT illinois.edu>, Aditya Devarakonda <aditya08 AT cac.rutgers.edu>, Jim Phillips <jim AT ks.uiuc.edu>
  • Subject: Re: [charm] [ppl] NAMD Charmrun error on Ranger
  • Date: Wed, 16 May 2012 15:23:02 -0500
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

There's also using charmrun's own process launching mechanisms instead
of the system's mpiexec, in order to get its more scalable tree
structure with ++hierarchical-start. The downside is that this
requires a nodelist file for charmrun to work with. Given that there
is a common NAMD launching script that many users can reference, I
don't think that's a big deal, since the logic only needs to be
implemented once.

It also looks like that option never got documented in the usage manual:
http://charm.cs.illinois.edu/manuals/html/install/4_1.html
For that matter, neither is ++scalable-start (one SSH connection per node).

On Wed, May 16, 2012 at 3:11 PM, Eric Bohm
<ebohm AT illinois.edu>
wrote:
> There is a P^2 startup and memory issue with the reliable channel
> implementation on IBVERBS.
>
> A simple way to reduce its impact is to use the SMP build, one can then
> reduce the number of necessary processes to one per node by running +p
> numnodes +ppn 15 to have 15 worker threads per node multiplex across one
> communication thread per node.  You then have (P/16)^2, which will scale
> much farther.
>
> On 05/16/2012 09:46 AM, Jim Phillips wrote:
>> I think the mpiexec calls the ibrun script, which calls the real mpiexec.
>>
>> -Jim
>>
>>
>> On Wed, 16 May 2012, Aditya Devarakonda wrote:
>>
>>> Thanks Jim,
>>>
>>> So, the pre-loaded NAMD batch scripts on Ranger seem to use Charm with
>>> the mpiexec option. Now, is there a better way of doing this (through
>>> ibrun perhaps).
>>>
>>> Maybe I'm wrong, but my thinking with adjusting the timeout  is that the
>>> problem could always creep back as we increase the number of nodes.
>>>
>>> Do you guys typically use mpiexec to start the NAMD processes on Ranger?
>>>
>>> Regards,
>>> Aditya
>>>
>>> On Mon, 2012-05-14 at 09:58 -0500, Jim Phillips wrote:
>>>
>>>> Charmrun should have some options for adjusting the timeout.  One goal of
>>>> using mpiexec was to make this process more similar to other jobs on the
>>>> machine so the timeout may just need to be extended (I'm not sure what
>>>> the
>>>> default is - that should probably be printed).
>>>>
>>>> -Jim
>>>>
>>>> On Sat, 12 May 2012, Aditya Devarakonda wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> Hope you guys are doing well. Our group has been working with NAMD for
>>>>> the
>>>>> past couple of months and recently started running jobs on Ranger.
>>>>>
>>>>> We have been seeing some issues while running at 1K or more processors.
>>>>> It
>>>>> seems to be an issue with launching NAMD on remote nodes - we get the
>>>>> following error:
>>>>>
>>>>> Charmrun>  error 64 attaching to node:
>>>>> Timeout waiting for node-program to connect
>>>>>
>>>>> We're using the NAMD_2.8_Linux-x86_64-ibverbs-Ranger build available on
>>>>> Ranger and using mpiexec
>>>>>
>>>>> charmrun +p ++mpiexec ++remote-shell mympiexec ++runscript tacc_affinity
>>>>> namd2 $CONFFILE
>>>>>
>>>>> We were able to run successfully scale up to 512 processors but not
>>>>> beyond.
>>>>> Any ideas?
>>>>>
>>>>> Thanks,
>>>>> Aditya
>>>>>
>>>
>>>
>> _______________________________________________
>> ppl mailing list
>> ppl AT cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/ppl
>
> _______________________________________________
> ppl mailing list
> ppl AT cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/ppl





Archive powered by MHonArc 2.6.16.

Top of Page