Skip to Content.
Sympa Menu

charm - Re: [charm] [ppl] Unable to run charm++ on infiniband interface

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [ppl] Unable to run charm++ on infiniband interface


Chronological Thread 
  • From: Michel Espinoza-Fonseca <mef AT ddt.biochem.umn.edu>
  • To: Phil Miller <mille121 AT illinois.edu>
  • Cc: "charm AT cs.illinois.edu" <charm AT cs.illinois.edu>, "gupta59 AT illinois.edu" <gupta59 AT illinois.edu>
  • Subject: Re: [charm] [ppl] Unable to run charm++ on infiniband interface
  • Date: Wed, 8 Aug 2012 15:46:27 -0500
  • Accept-language: en-US
  • Acceptlanguage: en-US
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Phil, Abhishek,

Thank you very much for your comments. It seems that the ++batch <N> option
solved the problem and NAMD is now being launched properly. I have one more
question, though: do you have any suggestion on what is the optimal value of
<N>? I'm mostly curious as now I see a decrease in performance and not so
good scalability. I'm now experimenting with N=# nodes used for each job.

Thanks,
Michel

________________________________________
From:
unmobile AT gmail.com

[unmobile AT gmail.com]
On Behalf Of Phil Miller
[mille121 AT illinois.edu]
Sent: Wednesday, August 08, 2012 1:20 PM
To: Michel Espinoza-Fonseca
Cc:
charm AT cs.illinois.edu
Subject: Re: [ppl] [charm] Unable to run charm++ on infiniband interface

This is a problem of the SSH server on each compute node limiting the
number of simultaneous inbound connections it will allow, as a
protection against denial of service situations.

As Abhishek mentioned, the ++scalable-start flag to charmrun will
likely help, as it makes only a single connection to each node, and
forks from that. That may still experience resource limitations
elsewhere in the system, like authentication and directory servers. If
you still encounter trouble, you can try adding the flag
++batch N
where N is the number of SSH connections charmrun will open at a time.

As an alternative approach, if the cluster is commonly used to run MPI
jobs, you can just use its MPI process launcher by passing charmrun
the ++mpiexec flag in place of ++remote-shell ssh.

I hope that helps.

Phil

On Wed, Aug 8, 2012 at 12:04 PM, Michel Espinoza-Fonseca
<mef AT ddt.biochem.umn.edu>
wrote:
> Hi --
>
> Recently I tried to run NAMD using charm++ (charmrun) with infiniband
> support (ibverbs) on our HP Linux cluster running CentOS. I tested both
> precompiled and my own compiled versions of charmrun. I normally submit the
> jobs using the following command line:
>
>
> charmrun ++remote-shell ssh ++p 1400 ++verbose ++nodelist \ namd.hostfile
> namd2 my_job.in
>
>
> The problem appears shortly after the job starts, which normally ends with
> charmrun terminating (i.e., NAMD does not even start). Most of the times I
> get the following error:
>
>
> Charmrun> charmrun started...
>
> Charmrun> using namd.hostfile as nodesfile
>
> Charmrun> remote shell (node0004:0) started
>
> Charmrun> remote shell (node0010:1) started
>
> Charmrun> remote shell (node0020:2) started
>
> ...
>
> ERROR> starting rsh: Resource temporarily unavailable
>
> ssh_keysign: fork: Resource temporarily unavailable
>
> ssh_keysign: fork: Resource temporarily unavailable
>
> key_sign failed
>
> ssh_keysign: fork: Resource temporarily unavailable
>
> key_sign failed
>
> ssh_keysign: fork: Resource temporarily unavailable
>
> ...
>
> key_sign failed
>
> Permission denied (publickey,keyboard-interactive,hostbased)
>
>
>
> This is a recurring error which still appears after adding "CONV_RSH=ssh" to
> the PBS file or changing user limits (i.e., ulimit -u). I probably got it
> running only once or twice (out of tens of attempts). Interestingly, I also
> tried the SMP build which seems to work fine when "++ppn" is added to the
> command line, although NAMD scales poorly compared to the ibverbs build.
>
>
>
> My question is whether the problem could be related to the configuration of
> our system or I'm missing something that prevents charmrun from initiating
> properly.
>
> Thanks,
> Michel
>
>
>
>
> _______________________________________________
> charm mailing list
> charm AT cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/charm
>
> _______________________________________________
> ppl mailing list
> ppl AT cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/ppl
>




Archive powered by MHonArc 2.6.16.

Top of Page