Skip to Content.
Sympa Menu

charm - Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE


Chronological Thread 
  • From: Gengbin Zheng <gzheng AT illinois.edu>
  • To: Jim Phillips <jim AT ks.uiuc.edu>
  • Cc: Charm Mailing List <charm AT cs.illinois.edu>
  • Subject: Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE
  • Date: Tue, 12 Jun 2012 11:41:29 -0500
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Yes, I first test if a port is active, and then test if it is IB.
I have not really tested these scenarios though. Hope my fix works.

Gengbin

On Tue, Jun 12, 2012 at 11:37 AM, Jim Phillips
<jim AT ks.uiuc.edu>
wrote:
>
> Is there also a way to test if the port is active?  People have reported
> that their two-port IB cards only work if the first port is used.
>
> -Jim
>
> On Tue, 12 Jun 2012, Gengbin Zheng wrote:
>
>> I added one more line of code to check if the linker layer is IB or
>> Ethernet from the ibv_port_attr, hope this helps.
>> Change is in git current branch.
>>
>> Gengbin
>>
>>
>> On Tue, Jun 12, 2012 at 8:15 AM, Stephen Cousins
>> <steve.cousins AT maine.edu>
>> wrote:
>>> Hi Gengbin,
>>>
>>> That is where I started but it doesn't appear to be possible to reorder
>>> the
>>> devices on this cluster. No udev on the nodes. For now what you have done
>>> should be fine. I'll try to test it today.
>>>
>>> Thanks very much.
>>>
>>> Steve
>>>
>>> On Tue, Jun 12, 2012 at 1:37 AM, Gengbin Zheng
>>> <gzheng AT illinois.edu>
>>> wrote:
>>>>
>>>> I am not very sure about this. Maybe it can not check if it IB or
>>>> Ethernet.
>>>> As workaround, if you make port 1 IB and port 2 ethernet,  Charm should
>>>> work.
>>>>
>>>> Gengbin
>>>>
>>>> On Mon, Jun 11, 2012 at 5:28 PM, Stephen Cousins
>>>> <steve.cousins AT maine.edu>
>>>> wrote:
>>>>> So if it is a 10 GbE port that uses the same driver and is up and
>>>>> running
>>>>> (doing Ethernet things) will it still get in the way? If so, can your
>>>>> solution check if it is IB vs. Ethernet? Right now I just want something
>>>>> to
>>>>> work so even if we have to disable ethernet on that device that would be
>>>>> fine. A longer term goal though would be to be able to use both IB and
>>>>> 10GbE
>>>>> on these nodes.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Steve
>>>>>
>>>>>
>>>>> On Mon, Jun 11, 2012 at 4:10 PM, Gengbin Zheng
>>>>> <gzheng AT illinois.edu>
>>>>> wrote:
>>>>>>
>>>>>> I added a call to  ib_query_port to test if a port is valid or not
>>>>>> starting from port number 1.
>>>>>>
>>>>>> Gengbin
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 11, 2012 at 2:57 PM, Stephen Cousins
>>>>>> <steve.cousins AT maine.edu>
>>>>>> wrote:
>>>>>>> Hi Gengbin,
>>>>>>>
>>>>>>> Thanks a lot. I'll give it a try.
>>>>>>>
>>>>>>> How is the new test done? Do you check the Link Layer to make sure it
>>>>>>> actually is a IB device as opposed to Ethernet?
>>>>>>>
>>>>>>> Steve
>>>>>>>
>>>>>>> On Mon, Jun 11, 2012 at 3:00 PM, Gengbin Zheng
>>>>>>> <gzheng AT illinois.edu>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Steve,
>>>>>>>>
>>>>>>>>  Charm ib layger always assume ibport is 1, that is why it didn't
>>>>>>>> work
>>>>>>>> when you have multiple interfaces.
>>>>>>>>  I checked in a fix to test the ib ports. It is in the latest main
>>>>>>>> branch.
>>>>>>>>  Can you give it a try?
>>>>>>>>
>>>>>>>> Gengbin
>>>>>>>>
>>>>>>>> On Mon, Jun 11, 2012 at 11:12 AM, Stephen Cousins
>>>>>>>> <steve.cousins AT maine.edu>
>>>>>>>> wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Is this list active? Does anyone have any ideas about how charmrun
>>>>>>>>> can
>>>>>>>>> specify specific types of interfaces when using ibverbs?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Steve
>>>>>>>>>
>>>>>>>>> On Mon, Jun 4, 2012 at 1:24 PM, Stephen Cousins
>>>>>>>>> <steve.cousins AT maine.edu>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> We have 16 nodes with both IB and 10GbE interfaces (both
>>>>>>>>>> interfaces
>>>>>>>>>> are
>>>>>>>>>> Mellanox). We also have 16 nodes that have just IB. I can run
>>>>>>>>>> NAMD
>>>>>>>>>> on
>>>>>>>>>> the
>>>>>>>>>> IB-only nodes just fine, however if the job is allocated a node
>>>>>>>>>> that
>>>>>>>>>> has
>>>>>>>>>> both IB and 10 GbE then it does not work.
>>>>>>>>>>
>>>>>>>>>> charmrun output is:
>>>>>>>>>>
>>>>>>>>>> Charmrun> IBVERBS version of charmrun
>>>>>>>>>> [0] Stack Traceback:
>>>>>>>>>>   [0:0] CmiAbort+0x5c  [0xcbef1e]
>>>>>>>>>>   [0:1] initInfiOtherNodeData+0x14a  [0xcbe488]
>>>>>>>>>>   [0:2] /opt/scyld/NAMD_2.9_Linux-x86_64-ibverbs/namd2 [0xcbcf8b]
>>>>>>>>>>   [0:3] /opt/scyld/NAMD_2.9_Linux-x86_64-ibverbs/namd2 [0xcbd9e5]
>>>>>>>>>>   [0:4] ConverseInit+0x1cd  [0xcbe001]
>>>>>>>>>>   [0:5] _ZN7BackEnd4initEiPPc+0x6f  [0x58ad13]
>>>>>>>>>>   [0:6] main+0x2f  [0x585fd7]
>>>>>>>>>>   [0:7] __libc_start_main+0xf4  [0x3633c1d9b4]
>>>>>>>>>>   [0:8] _ZNSt8ios_base4InitD1Ev+0x4a  [0x54105a]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> And STDERR for the job is:
>>>>>>>>>>
>>>>>>>>>> Charmrun> started all node programs in 1.544 seconds.
>>>>>>>>>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>>>>>>>>>> Reason: failed to change qp state to RTR
>>>>>>>>>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>>>>>>>>>> Reason: failed to change qp state to RTR
>>>>>>>>>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>>>>>>>>>> Reason: failed to change qp state to RTR
>>>>>>>>>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>>>>>>>>>> Reason: failed to change qp state to RTR
>>>>>>>>>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>>>>>>>>>> Reason: failed to change qp state to RTR
>>>>>>>>>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>>>>>>>>>> Reason: failed to change qp state to RTR
>>>>>>>>>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>>>>>>>>>> Reason: failed to change qp state to RTR
>>>>>>>>>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>>>>>>>>>> Reason: failed to change qp state to RTR
>>>>>>>>>> Fatal error on PE 0> failed to change qp state to RTR
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We are running Moab and Torque for the scheduler and resource
>>>>>>>>>> manager.
>>>>>>>>>> The
>>>>>>>>>> version of NAMD is:
>>>>>>>>>>
>>>>>>>>>>     NAMD_2.9_Linux-x86_64-ibverbs
>>>>>>>>>>
>>>>>>>>>> Do I need to specify that the link layer should use IB as opposed
>>>>>>>>>> to
>>>>>>>>>> Ethernet?
>>>>>>>>>>
>>>>>>>>>> ibstat for the nodes with both interconnects:
>>>>>>>>>>
>>>>>>>>>> CA 'mlx4_0'
>>>>>>>>>>         CA type: MT4099
>>>>>>>>>>         Number of ports: 2
>>>>>>>>>>         Firmware version: 2.10.0
>>>>>>>>>>         Hardware version: 0
>>>>>>>>>>         Node GUID: 0xffffffffffffffff
>>>>>>>>>>         System image GUID: 0xffffffffffffffff
>>>>>>>>>>         Port 1:
>>>>>>>>>>                 State: Active
>>>>>>>>>>                 Physical state: LinkUp
>>>>>>>>>>                 Rate: 10
>>>>>>>>>>                 Base lid: 0
>>>>>>>>>>                 LMC: 0
>>>>>>>>>>                 SM lid: 0
>>>>>>>>>>                 Capability mask: 0x00010000
>>>>>>>>>>                 Port GUID: 0x0202c9fffe34e8f0
>>>>>>>>>>                 Link layer: Ethernet
>>>>>>>>>>         Port 2:
>>>>>>>>>>                 State: Down
>>>>>>>>>>                 Physical state: Disabled
>>>>>>>>>>                 Rate: 10
>>>>>>>>>>                 Base lid: 0
>>>>>>>>>>                 LMC: 0
>>>>>>>>>>                 SM lid: 0
>>>>>>>>>>                 Capability mask: 0x00010000
>>>>>>>>>>                 Port GUID: 0x0202c9fffe34e8f1
>>>>>>>>>>                 Link layer: Ethernet
>>>>>>>>>> CA 'mlx4_1'
>>>>>>>>>>         CA type: MT26428
>>>>>>>>>>         Number of ports: 1
>>>>>>>>>>         Firmware version: 2.9.1000
>>>>>>>>>>         Hardware version: b0
>>>>>>>>>>         Node GUID: 0x002590ffff16b658
>>>>>>>>>>         System image GUID: 0x002590ffff16b65b
>>>>>>>>>>         Port 1:
>>>>>>>>>>                 State: Active
>>>>>>>>>>                 Physical state: LinkUp
>>>>>>>>>>                 Rate: 40
>>>>>>>>>>                 Base lid: 21
>>>>>>>>>>                 LMC: 0
>>>>>>>>>>                 SM lid: 3
>>>>>>>>>>                 Capability mask: 0x02510868
>>>>>>>>>>                 Port GUID: 0x002590ffff16b659
>>>>>>>>>>                 Link layer: InfiniBand
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ibstat for a node with just IB:
>>>>>>>>>>
>>>>>>>>>> CA 'mlx4_0'
>>>>>>>>>>         CA type: MT26428
>>>>>>>>>>         Number of ports: 1
>>>>>>>>>>         Firmware version: 2.9.1000
>>>>>>>>>>         Hardware version: b0
>>>>>>>>>>         Node GUID: 0x002590ffff16bbe8
>>>>>>>>>>         System image GUID: 0x002590ffff16bbeb
>>>>>>>>>>         Port 1:
>>>>>>>>>>                 State: Active
>>>>>>>>>>                 Physical state: LinkUp
>>>>>>>>>>                 Rate: 40
>>>>>>>>>>                 Base lid: 14
>>>>>>>>>>                 LMC: 0
>>>>>>>>>>                 SM lid: 3
>>>>>>>>>>                 Capability mask: 0x02510868
>>>>>>>>>>                 Port GUID: 0x002590ffff16bbe9
>>>>>>>>>>                 Link layer: InfiniBand
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks for your help.
>>>>>>>>>>
>>>>>>>>>> Steve
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ______________________________________________________________________
>>>>>>>>>> Steve Cousins - Supercomputer Engineer/Administrator - Univ of
>>>>>>>>>> Maine
>>>>>>>>>> Marine Sciences, 452 Aubert Hall       Target Tech, 20 Godfrey
>>>>>>>>>> Drive
>>>>>>>>>> Orono, ME 04469    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     Orono, ME
>>>>>>>>>> 04473
>>>>>>>>>> (207) 581-4302     ~ steve.cousins at maine.edu ~     (207)
>>>>>>>>>> 866-6552
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ______________________________________________________________________
>>>>>>>>> Steve Cousins - Supercomputer Engineer/Administrator - Univ of
>>>>>>>>> Maine
>>>>>>>>> Marine Sciences, 452 Aubert Hall       Target Tech, 20 Godfrey
>>>>>>>>> Drive
>>>>>>>>> Orono, ME 04469    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     Orono, ME
>>>>>>>>> 04473
>>>>>>>>> (207) 581-4302     ~ steve.cousins at maine.edu ~     (207)
>>>>>>>>> 866-6552
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> ______________________________________________________________________
>>>>>>> Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
>>>>>>> Marine Sciences, 452 Aubert Hall       Target Tech, 20 Godfrey Drive
>>>>>>> Orono, ME 04469    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     Orono, ME 04473
>>>>>>> (207) 581-4302     ~ steve.cousins at maine.edu ~     (207) 866-6552
>>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> ______________________________________________________________________
>>>>> Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
>>>>> Marine Sciences, 452 Aubert Hall       Target Tech, 20 Godfrey Drive
>>>>> Orono, ME 04469    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     Orono, ME 04473
>>>>> (207) 581-4302     ~ steve.cousins at maine.edu ~     (207) 866-6552
>>>>>
>>>
>>>
>>>
>>>
>>> --
>>> ______________________________________________________________________
>>> Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
>>> Marine Sciences, 452 Aubert Hall       Target Tech, 20 Godfrey Drive
>>> Orono, ME 04469    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     Orono, ME 04473
>>> (207) 581-4302     ~ steve.cousins at maine.edu ~     (207) 866-6552
>>>
>>
>> _______________________________________________
>> charm mailing list
>> charm AT cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/charm
>>
>> _______________________________________________
>> ppl mailing list
>> ppl AT cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/ppl
>>





Archive powered by MHonArc 2.6.16.

Top of Page