Skip to Content.
Sympa Menu

charm - Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE


Chronological Thread 
  • From: "Kale, Laxmikant V" <kale AT illinois.edu>
  • To: Stephen Cousins <steve.cousins AT maine.edu>, "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
  • Subject: Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE
  • Date: Wed, 25 Jul 2012 13:44:56 +0000
  • Accept-language: en-US
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Looks like this is an old email delivered late by the mail server.. This has been answered/solved around 6/14.
FYI.
-- 
Laxmikant (Sanjay) Kale         http://charm.cs.uiuc.edu
Professor, Computer Science     kale AT illinois.edu
201 N. Goodwin Avenue           Ph:  (217) 244-0094
Urbana, IL  61801-2302          FAX: (217) 265-6582

On 6/4/12 12:24 PM, "Stephen Cousins" <steve.cousins AT maine.edu> wrote:

Hi,

We have 16 nodes with both IB and 10GbE interfaces (both interfaces are Mellanox). We also have 16 nodes that have just IB. I can run NAMD on the IB-only nodes just fine, however if the job is allocated a node that has both IB and 10 GbE then it does not work.

charmrun output is:

Charmrun> IBVERBS version of charmrun
[0] Stack Traceback:
  [0:0] CmiAbort+0x5c  [0xcbef1e]
  [0:1] initInfiOtherNodeData+0x14a  [0xcbe488]
  [0:2] /opt/scyld/NAMD_2.9_Linux-x86_64-ibverbs/namd2 [0xcbcf8b]
  [0:3] /opt/scyld/NAMD_2.9_Linux-x86_64-ibverbs/namd2 [0xcbd9e5]
  [0:4] ConverseInit+0x1cd  [0xcbe001]
  [0:5] _ZN7BackEnd4initEiPPc+0x6f  [0x58ad13]
  [0:6] main+0x2f  [0x585fd7]
  [0:7] __libc_start_main+0xf4  [0x3633c1d9b4]
  [0:8] _ZNSt8ios_base4InitD1Ev+0x4a  [0x54105a]


And STDERR for the job is:

Charmrun> started all node programs in 1.544 seconds.
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
Fatal error on PE 0> failed to change qp state to RTR


We are running Moab and Torque for the scheduler and resource manager. The version of NAMD is:

    NAMD_2.9_Linux-x86_64-ibverbs

Do I need to specify that the link layer should use IB as opposed to Ethernet?

ibstat for the nodes with both interconnects:

CA 'mlx4_0'
        CA type: MT4099
        Number of ports: 2
        Firmware version: 2.10.0
        Hardware version: 0
        Node GUID: 0xffffffffffffffff
        System image GUID: 0xffffffffffffffff
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 10
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0x0202c9fffe34e8f0
                Link layer: Ethernet
        Port 2:
                State: Down
                Physical state: Disabled
                Rate: 10
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0x0202c9fffe34e8f1
                Link layer: Ethernet
CA 'mlx4_1'
        CA type: MT26428
        Number of ports: 1
        Firmware version: 2.9.1000
        Hardware version: b0
        Node GUID: 0x002590ffff16b658
        System image GUID: 0x002590ffff16b65b
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 40
                Base lid: 21
                LMC: 0
                SM lid: 3
                Capability mask: 0x02510868
                Port GUID: 0x002590ffff16b659
                Link layer: InfiniBand


ibstat for a node with just IB:

CA 'mlx4_0'
        CA type: MT26428
        Number of ports: 1
        Firmware version: 2.9.1000
        Hardware version: b0
        Node GUID: 0x002590ffff16bbe8
        System image GUID: 0x002590ffff16bbeb
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 40
                Base lid: 14
                LMC: 0
                SM lid: 3
                Capability mask: 0x02510868
                Port GUID: 0x002590ffff16bbe9
                Link layer: InfiniBand




Thanks for your help.

Steve

--
______________________________________________________________________
Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
Marine Sciences, 452 Aubert Hall       Target Tech, 20 Godfrey Drive
Orono, ME 04469    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     Orono, ME 04473
(207) 581-4302     ~ steve.cousins at maine.edu ~     (207) 866-6552



  • Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Kale, Laxmikant V, 07/25/2012

Archive powered by MHonArc 2.6.16.

Top of Page