Skip to Content.
Sympa Menu

charm - Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE


Chronological Thread 
  • From: Gengbin Zheng <gzheng AT illinois.edu>
  • To: Stephen Cousins <steve.cousins AT maine.edu>
  • Cc: "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
  • Subject: Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE
  • Date: Mon, 11 Jun 2012 14:00:28 -0500
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Hi Steve,

Charm ib layger always assume ibport is 1, that is why it didn't work
when you have multiple interfaces.
I checked in a fix to test the ib ports. It is in the latest main branch.
Can you give it a try?

Gengbin

On Mon, Jun 11, 2012 at 11:12 AM, Stephen Cousins
<steve.cousins AT maine.edu>
wrote:
> Hi,
>
> Is this list active? Does anyone have any ideas about how charmrun can
> specify specific types of interfaces when using ibverbs?
>
> Thanks,
>
> Steve
>
> On Mon, Jun 4, 2012 at 1:24 PM, Stephen Cousins
> <steve.cousins AT maine.edu>
> wrote:
>>
>> Hi,
>>
>> We have 16 nodes with both IB and 10GbE interfaces (both interfaces are
>> Mellanox). We also have 16 nodes that have just IB. I can run NAMD on the
>> IB-only nodes just fine, however if the job is allocated a node that has
>> both IB and 10 GbE then it does not work.
>>
>> charmrun output is:
>>
>> Charmrun> IBVERBS version of charmrun
>> [0] Stack Traceback:
>>   [0:0] CmiAbort+0x5c  [0xcbef1e]
>>   [0:1] initInfiOtherNodeData+0x14a  [0xcbe488]
>>   [0:2] /opt/scyld/NAMD_2.9_Linux-x86_64-ibverbs/namd2 [0xcbcf8b]
>>   [0:3] /opt/scyld/NAMD_2.9_Linux-x86_64-ibverbs/namd2 [0xcbd9e5]
>>   [0:4] ConverseInit+0x1cd  [0xcbe001]
>>   [0:5] _ZN7BackEnd4initEiPPc+0x6f  [0x58ad13]
>>   [0:6] main+0x2f  [0x585fd7]
>>   [0:7] __libc_start_main+0xf4  [0x3633c1d9b4]
>>   [0:8] _ZNSt8ios_base4InitD1Ev+0x4a  [0x54105a]
>>
>>
>> And STDERR for the job is:
>>
>> Charmrun> started all node programs in 1.544 seconds.
>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> Reason: failed to change qp state to RTR
>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> Reason: failed to change qp state to RTR
>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> Reason: failed to change qp state to RTR
>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> Reason: failed to change qp state to RTR
>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> Reason: failed to change qp state to RTR
>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> Reason: failed to change qp state to RTR
>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> Reason: failed to change qp state to RTR
>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> Reason: failed to change qp state to RTR
>> Fatal error on PE 0> failed to change qp state to RTR
>>
>>
>> We are running Moab and Torque for the scheduler and resource manager. The
>> version of NAMD is:
>>
>>     NAMD_2.9_Linux-x86_64-ibverbs
>>
>> Do I need to specify that the link layer should use IB as opposed to
>> Ethernet?
>>
>> ibstat for the nodes with both interconnects:
>>
>> CA 'mlx4_0'
>>         CA type: MT4099
>>         Number of ports: 2
>>         Firmware version: 2.10.0
>>         Hardware version: 0
>>         Node GUID: 0xffffffffffffffff
>>         System image GUID: 0xffffffffffffffff
>>         Port 1:
>>                 State: Active
>>                 Physical state: LinkUp
>>                 Rate: 10
>>                 Base lid: 0
>>                 LMC: 0
>>                 SM lid: 0
>>                 Capability mask: 0x00010000
>>                 Port GUID: 0x0202c9fffe34e8f0
>>                 Link layer: Ethernet
>>         Port 2:
>>                 State: Down
>>                 Physical state: Disabled
>>                 Rate: 10
>>                 Base lid: 0
>>                 LMC: 0
>>                 SM lid: 0
>>                 Capability mask: 0x00010000
>>                 Port GUID: 0x0202c9fffe34e8f1
>>                 Link layer: Ethernet
>> CA 'mlx4_1'
>>         CA type: MT26428
>>         Number of ports: 1
>>         Firmware version: 2.9.1000
>>         Hardware version: b0
>>         Node GUID: 0x002590ffff16b658
>>         System image GUID: 0x002590ffff16b65b
>>         Port 1:
>>                 State: Active
>>                 Physical state: LinkUp
>>                 Rate: 40
>>                 Base lid: 21
>>                 LMC: 0
>>                 SM lid: 3
>>                 Capability mask: 0x02510868
>>                 Port GUID: 0x002590ffff16b659
>>                 Link layer: InfiniBand
>>
>>
>> ibstat for a node with just IB:
>>
>> CA 'mlx4_0'
>>         CA type: MT26428
>>         Number of ports: 1
>>         Firmware version: 2.9.1000
>>         Hardware version: b0
>>         Node GUID: 0x002590ffff16bbe8
>>         System image GUID: 0x002590ffff16bbeb
>>         Port 1:
>>                 State: Active
>>                 Physical state: LinkUp
>>                 Rate: 40
>>                 Base lid: 14
>>                 LMC: 0
>>                 SM lid: 3
>>                 Capability mask: 0x02510868
>>                 Port GUID: 0x002590ffff16bbe9
>>                 Link layer: InfiniBand
>>
>>
>>
>>
>> Thanks for your help.
>>
>> Steve
>>
>> --
>> ______________________________________________________________________
>> Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
>> Marine Sciences, 452 Aubert Hall       Target Tech, 20 Godfrey Drive
>> Orono, ME 04469    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     Orono, ME 04473
>> (207) 581-4302     ~ steve.cousins at maine.edu ~     (207) 866-6552
>>
>
>
>
> --
> ______________________________________________________________________
> Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
> Marine Sciences, 452 Aubert Hall       Target Tech, 20 Godfrey Drive
> Orono, ME 04469    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     Orono, ME 04473
> (207) 581-4302     ~ steve.cousins at maine.edu ~     (207) 866-6552
>





Archive powered by MHonArc 2.6.16.

Top of Page