Skip to Content.
Sympa Menu

charm - Re: [charm] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE


Chronological Thread 
  • From: Stephen Cousins <steve.cousins AT maine.edu>
  • To: charm AT cs.uiuc.edu
  • Subject: Re: [charm] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE
  • Date: Wed, 6 Jun 2012 17:31:24 -0400
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

I sent this a couple of days ago but I'm not sure it was approved since I wasn't a member. I am now so hopefully it will show up in the list.

Thanks for any help.

Steve

On Mon, Jun 4, 2012 at 1:24 PM, Stephen Cousins <steve.cousins AT maine.edu> wrote:
Hi,

We have 16 nodes with both IB and 10GbE interfaces (both interfaces are Mellanox). We also have 16 nodes that have just IB. I can run NAMD on the IB-only nodes just fine, however if the job is allocated a node that has both IB and 10 GbE then it does not work.

charmrun output is:

Charmrun> IBVERBS version of charmrun
[0] Stack Traceback:
  [0:0] CmiAbort+0x5c  [0xcbef1e]
  [0:1] initInfiOtherNodeData+0x14a  [0xcbe488]
  [0:2] /opt/scyld/NAMD_2.9_Linux-x86_64-ibverbs/namd2 [0xcbcf8b]
  [0:3] /opt/scyld/NAMD_2.9_Linux-x86_64-ibverbs/namd2 [0xcbd9e5]
  [0:4] ConverseInit+0x1cd  [0xcbe001]
  [0:5] _ZN7BackEnd4initEiPPc+0x6f  [0x58ad13]
  [0:6] main+0x2f  [0x585fd7]
  [0:7] __libc_start_main+0xf4  [0x3633c1d9b4]
  [0:8] _ZNSt8ios_base4InitD1Ev+0x4a  [0x54105a]


And STDERR for the job is:

Charmrun> started all node programs in 1.544 seconds.
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
Fatal error on PE 0> failed to change qp state to RTR


We are running Moab and Torque for the scheduler and resource manager. The version of NAMD is:

    NAMD_2.9_Linux-x86_64-ibverbs

Do I need to specify that the link layer should use IB as opposed to Ethernet?

ibstat for the nodes with both interconnects:

CA 'mlx4_0'
        CA type: MT4099
        Number of ports: 2
        Firmware version: 2.10.0
        Hardware version: 0
        Node GUID: 0xffffffffffffffff
        System image GUID: 0xffffffffffffffff
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 10
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0x0202c9fffe34e8f0
                Link layer: Ethernet
        Port 2:
                State: Down
                Physical state: Disabled
                Rate: 10
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0x0202c9fffe34e8f1
                Link layer: Ethernet
CA 'mlx4_1'
        CA type: MT26428
        Number of ports: 1
        Firmware version: 2.9.1000
        Hardware version: b0
        Node GUID: 0x002590ffff16b658
        System image GUID: 0x002590ffff16b65b
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 40
                Base lid: 21
                LMC: 0
                SM lid: 3
                Capability mask: 0x02510868
                Port GUID: 0x002590ffff16b659
                Link layer: InfiniBand


ibstat for a node with just IB:

CA 'mlx4_0'
        CA type: MT26428
        Number of ports: 1
        Firmware version: 2.9.1000
        Hardware version: b0
        Node GUID: 0x002590ffff16bbe8
        System image GUID: 0x002590ffff16bbeb
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 40
                Base lid: 14
                LMC: 0
                SM lid: 3
                Capability mask: 0x02510868
                Port GUID: 0x002590ffff16bbe9
                Link layer: InfiniBand




Thanks for your help.

Steve

--
______________________________________________________________________
Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
Marine Sciences, 452 Aubert Hall       Target Tech, 20 Godfrey Drive
Orono, ME 04469    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     Orono, ME 04473
(207) 581-4302     ~ steve.cousins at maine.edu ~     (207) 866-6552




--
______________________________________________________________________
Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
Marine Sciences, 452 Aubert Hall       Target Tech, 20 Godfrey Drive
Orono, ME 04469    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     Orono, ME 04473
(207) 581-4302     ~ steve.cousins at maine.edu ~     (207) 866-6552




Archive powered by MHonArc 2.6.16.

Top of Page