Skip to Content.
Sympa Menu

charm - Re: [charm] [ppl] Unable to run charm++ on infiniband interface

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [ppl] Unable to run charm++ on infiniband interface


Chronological Thread 
  • From: Phil Miller <mille121 AT illinois.edu>
  • To: Michel Espinoza-Fonseca <mef AT ddt.biochem.umn.edu>
  • Cc: "charm AT cs.illinois.edu" <charm AT cs.illinois.edu>
  • Subject: Re: [charm] [ppl] Unable to run charm++ on infiniband interface
  • Date: Wed, 8 Aug 2012 13:20:41 -0500
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

This is a problem of the SSH server on each compute node limiting the
number of simultaneous inbound connections it will allow, as a
protection against denial of service situations.

As Abhishek mentioned, the ++scalable-start flag to charmrun will
likely help, as it makes only a single connection to each node, and
forks from that. That may still experience resource limitations
elsewhere in the system, like authentication and directory servers. If
you still encounter trouble, you can try adding the flag
++batch N
where N is the number of SSH connections charmrun will open at a time.

As an alternative approach, if the cluster is commonly used to run MPI
jobs, you can just use its MPI process launcher by passing charmrun
the ++mpiexec flag in place of ++remote-shell ssh.

I hope that helps.

Phil

On Wed, Aug 8, 2012 at 12:04 PM, Michel Espinoza-Fonseca
<mef AT ddt.biochem.umn.edu>
wrote:
> Hi --
>
> Recently I tried to run NAMD using charm++ (charmrun) with infiniband
> support (ibverbs) on our HP Linux cluster running CentOS. I tested both
> precompiled and my own compiled versions of charmrun. I normally submit the
> jobs using the following command line:
>
>
> charmrun ++remote-shell ssh ++p 1400 ++verbose ++nodelist \ namd.hostfile
> namd2 my_job.in
>
>
> The problem appears shortly after the job starts, which normally ends with
> charmrun terminating (i.e., NAMD does not even start). Most of the times I
> get the following error:
>
>
> Charmrun> charmrun started...
>
> Charmrun> using namd.hostfile as nodesfile
>
> Charmrun> remote shell (node0004:0) started
>
> Charmrun> remote shell (node0010:1) started
>
> Charmrun> remote shell (node0020:2) started
>
> ...
>
> ERROR> starting rsh: Resource temporarily unavailable
>
> ssh_keysign: fork: Resource temporarily unavailable
>
> ssh_keysign: fork: Resource temporarily unavailable
>
> key_sign failed
>
> ssh_keysign: fork: Resource temporarily unavailable
>
> key_sign failed
>
> ssh_keysign: fork: Resource temporarily unavailable
>
> ...
>
> key_sign failed
>
> Permission denied (publickey,keyboard-interactive,hostbased)
>
>
>
> This is a recurring error which still appears after adding "CONV_RSH=ssh" to
> the PBS file or changing user limits (i.e., ulimit -u). I probably got it
> running only once or twice (out of tens of attempts). Interestingly, I also
> tried the SMP build which seems to work fine when "++ppn" is added to the
> command line, although NAMD scales poorly compared to the ibverbs build.
>
>
>
> My question is whether the problem could be related to the configuration of
> our system or I'm missing something that prevents charmrun from initiating
> properly.
>
> Thanks,
> Michel
>
>
>
>
> _______________________________________________
> charm mailing list
> charm AT cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/charm
>
> _______________________________________________
> ppl mailing list
> ppl AT cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/ppl
>




Archive powered by MHonArc 2.6.16.

Top of Page