Skip to Content.
Sympa Menu

charm - Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE


Chronological Thread 
  • From: Stephen Cousins <steve.cousins AT maine.edu>
  • To: Gengbin Zheng <gzheng AT illinois.edu>
  • Cc: Charm Mailing List <charm AT cs.illinois.edu>, Jim Phillips <jim AT ks.uiuc.edu>
  • Subject: Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE
  • Date: Thu, 14 Jun 2012 18:40:31 -0400
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Hi Gengbin,

I just pulled the latest and recompiled. It is working like a , .. well, charm. (sorry)
 
Thanks for working on this for us. I hope it will benefit others.

Best Regards,

Steve

On Thu, Jun 14, 2012 at 5:13 PM, Gengbin Zheng <gzheng AT illinois.edu> wrote:
I just checked in one more fix for this. Besides charm enumerating all
the ports, now all devices are enumerated too to find an IB interface.
We used to assume device 0 is always IB.
I tested on Steve's cluster at Maine University myself, it seems to
work now, capable of handling nodes with different IB interface setup.
Changes are checked in to charm git repository.

Gengbin

On Tue, Jun 12, 2012 at 11:41 AM, Zheng, Gengbin <gzheng AT illinois.edu> wrote:
> Yes, I first test if a port is active, and then test if it is IB.
> I have not really tested these scenarios though. Hope my fix works.
>
> Gengbin
>
> On Tue, Jun 12, 2012 at 11:37 AM, Jim Phillips <jim AT ks.uiuc.edu> wrote:
>>
>> Is there also a way to test if the port is active?  People have reported
>> that their two-port IB cards only work if the first port is used.
>>
>> -Jim
>>
>> On Tue, 12 Jun 2012, Gengbin Zheng wrote:
>>
>>> I added one more line of code to check if the linker layer is IB or
>>> Ethernet from the ibv_port_attr, hope this helps.
>>> Change is in git current branch.
>>>
>>> Gengbin
>>>
>>>
>>> On Tue, Jun 12, 2012 at 8:15 AM, Stephen Cousins
>>> <steve.cousins AT maine.edu> wrote:
>>>> Hi Gengbin,
>>>>
>>>> That is where I started but it doesn't appear to be possible to reorder the
>>>> devices on this cluster. No udev on the nodes. For now what you have done
>>>> should be fine. I'll try to test it today.
>>>>
>>>> Thanks very much.
>>>>
>>>> Steve
>>>>
>>>> On Tue, Jun 12, 2012 at 1:37 AM, Gengbin Zheng <gzheng AT illinois.edu> wrote:
>>>>>
>>>>> I am not very sure about this. Maybe it can not check if it IB or
>>>>> Ethernet.
>>>>> As workaround, if you make port 1 IB and port 2 ethernet,  Charm should
>>>>> work.
>>>>>
>>>>> Gengbin
>>>>>
>>>>> On Mon, Jun 11, 2012 at 5:28 PM, Stephen Cousins
>>>>> <steve.cousins AT maine.edu> wrote:
>>>>>> So if it is a 10 GbE port that uses the same driver and is up and
>>>>>> running
>>>>>> (doing Ethernet things) will it still get in the way? If so, can your
>>>>>> solution check if it is IB vs. Ethernet? Right now I just want something
>>>>>> to
>>>>>> work so even if we have to disable ethernet on that device that would be
>>>>>> fine. A longer term goal though would be to be able to use both IB and
>>>>>> 10GbE
>>>>>> on these nodes.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Steve
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 11, 2012 at 4:10 PM, Gengbin Zheng <gzheng AT illinois.edu>
>>>>>> wrote:
>>>>>>>
>>>>>>> I added a call to  ib_query_port to test if a port is valid or not
>>>>>>> starting from port number 1.
>>>>>>>
>>>>>>> Gengbin
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jun 11, 2012 at 2:57 PM, Stephen Cousins
>>>>>>> <steve.cousins AT maine.edu> wrote:
>>>>>>>> Hi Gengbin,
>>>>>>>>
>>>>>>>> Thanks a lot. I'll give it a try.
>>>>>>>>
>>>>>>>> How is the new test done? Do you check the Link Layer to make sure it
>>>>>>>> actually is a IB device as opposed to Ethernet?
>>>>>>>>
>>>>>>>> Steve
>>>>>>>>
>>>>>>>> On Mon, Jun 11, 2012 at 3:00 PM, Gengbin Zheng <gzheng AT illinois.edu>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Steve,
>>>>>>>>>
>>>>>>>>>  Charm ib layger always assume ibport is 1, that is why it didn't
>>>>>>>>> work
>>>>>>>>> when you have multiple interfaces.
>>>>>>>>>  I checked in a fix to test the ib ports. It is in the latest main
>>>>>>>>> branch.
>>>>>>>>>  Can you give it a try?
>>>>>>>>>
>>>>>>>>> Gengbin
>>>>>>>>>
>>>>>>>>> On Mon, Jun 11, 2012 at 11:12 AM, Stephen Cousins
>>>>>>>>> <steve.cousins AT maine.edu> wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Is this list active? Does anyone have any ideas about how charmrun
>>>>>>>>>> can
>>>>>>>>>> specify specific types of interfaces when using ibverbs?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Steve
>>>>>>>>>>
>>>>>>>>>> On Mon, Jun 4, 2012 at 1:24 PM, Stephen Cousins
>>>>>>>>>> <steve.cousins AT maine.edu>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> We have 16 nodes with both IB and 10GbE interfaces (both
>>>>>>>>>>> interfaces
>>>>>>>>>>> are
>>>>>>>>>>> Mellanox). We also have 16 nodes that have just IB. I can run
>>>>>>>>>>> NAMD
>>>>>>>>>>> on
>>>>>>>>>>> the
>>>>>>>>>>> IB-only nodes just fine, however if the job is allocated a node
>>>>>>>>>>> that
>>>>>>>>>>> has
>>>>>>>>>>> both IB and 10 GbE then it does not work.
>>>>>>>>>>>
>>>>>>>>>>> charmrun output is:
>>>>>>>>>>>
>>>>>>>>>>> Charmrun> IBVERBS version of charmrun
>>>>>>>>>>> [0] Stack Traceback:
>>>>>>>>>>>   [0:0] CmiAbort+0x5c  [0xcbef1e]
>>>>>>>>>>>   [0:1] initInfiOtherNodeData+0x14a  [0xcbe488]
>>>>>>>>>>>   [0:2] /opt/scyld/NAMD_2.9_Linux-x86_64-ibverbs/namd2 [0xcbcf8b]
>>>>>>>>>>>   [0:3] /opt/scyld/NAMD_2.9_Linux-x86_64-ibverbs/namd2 [0xcbd9e5]
>>>>>>>>>>>   [0:4] ConverseInit+0x1cd  [0xcbe001]
>>>>>>>>>>>   [0:5] _ZN7BackEnd4initEiPPc+0x6f  [0x58ad13]
>>>>>>>>>>>   [0:6] main+0x2f  [0x585fd7]
>>>>>>>>>>>   [0:7] __libc_start_main+0xf4  [0x3633c1d9b4]
>>>>>>>>>>>   [0:8] _ZNSt8ios_base4InitD1Ev+0x4a  [0x54105a]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> And STDERR for the job is:
>>>>>>>>>>>
>>>>>>>>>>> Charmrun> started all node programs in 1.544 seconds.
>>>>>>>>>>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>>>>>>>>>>> Reason: failed to change qp state to RTR
>>>>>>>>>>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>>>>>>>>>>> Reason: failed to change qp state to RTR
>>>>>>>>>>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>>>>>>>>>>> Reason: failed to change qp state to RTR
>>>>>>>>>>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>>>>>>>>>>> Reason: failed to change qp state to RTR
>>>>>>>>>>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>>>>>>>>>>> Reason: failed to change qp state to RTR
>>>>>>>>>>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>>>>>>>>>>> Reason: failed to change qp state to RTR
>>>>>>>>>>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>>>>>>>>>>> Reason: failed to change qp state to RTR
>>>>>>>>>>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>>>>>>>>>>> Reason: failed to change qp state to RTR
>>>>>>>>>>> Fatal error on PE 0> failed to change qp state to RTR
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> We are running Moab and Torque for the scheduler and resource
>>>>>>>>>>> manager.
>>>>>>>>>>> The
>>>>>>>>>>> version of NAMD is:
>>>>>>>>>>>
>>>>>>>>>>>     NAMD_2.9_Linux-x86_64-ibverbs
>>>>>>>>>>>
>>>>>>>>>>> Do I need to specify that the link layer should use IB as opposed
>>>>>>>>>>> to
>>>>>>>>>>> Ethernet?
>>>>>>>>>>>
>>>>>>>>>>> ibstat for the nodes with both interconnects:
>>>>>>>>>>>
>>>>>>>>>>> CA 'mlx4_0'
>>>>>>>>>>>         CA type: MT4099
>>>>>>>>>>>         Number of ports: 2
>>>>>>>>>>>         Firmware version: 2.10.0
>>>>>>>>>>>         Hardware version: 0
>>>>>>>>>>>         Node GUID: 0xffffffffffffffff
>>>>>>>>>>>         System image GUID: 0xffffffffffffffff
>>>>>>>>>>>         Port 1:
>>>>>>>>>>>                 State: Active
>>>>>>>>>>>                 Physical state: LinkUp
>>>>>>>>>>>                 Rate: 10
>>>>>>>>>>>                 Base lid: 0
>>>>>>>>>>>                 LMC: 0
>>>>>>>>>>>                 SM lid: 0
>>>>>>>>>>>                 Capability mask: 0x00010000
>>>>>>>>>>>                 Port GUID: 0x0202c9fffe34e8f0
>>>>>>>>>>>                 Link layer: Ethernet
>>>>>>>>>>>         Port 2:
>>>>>>>>>>>                 State: Down
>>>>>>>>>>>                 Physical state: Disabled
>>>>>>>>>>>                 Rate: 10
>>>>>>>>>>>                 Base lid: 0
>>>>>>>>>>>                 LMC: 0
>>>>>>>>>>>                 SM lid: 0
>>>>>>>>>>>                 Capability mask: 0x00010000
>>>>>>>>>>>                 Port GUID: 0x0202c9fffe34e8f1
>>>>>>>>>>>                 Link layer: Ethernet
>>>>>>>>>>> CA 'mlx4_1'
>>>>>>>>>>>         CA type: MT26428
>>>>>>>>>>>         Number of ports: 1
>>>>>>>>>>>         Firmware version: 2.9.1000
>>>>>>>>>>>         Hardware version: b0
>>>>>>>>>>>         Node GUID: 0x002590ffff16b658
>>>>>>>>>>>         System image GUID: 0x002590ffff16b65b
>>>>>>>>>>>         Port 1:
>>>>>>>>>>>                 State: Active
>>>>>>>>>>>                 Physical state: LinkUp
>>>>>>>>>>>                 Rate: 40
>>>>>>>>>>>                 Base lid: 21
>>>>>>>>>>>                 LMC: 0
>>>>>>>>>>>                 SM lid: 3
>>>>>>>>>>>                 Capability mask: 0x02510868
>>>>>>>>>>>                 Port GUID: 0x002590ffff16b659
>>>>>>>>>>>                 Link layer: InfiniBand
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ibstat for a node with just IB:
>>>>>>>>>>>
>>>>>>>>>>> CA 'mlx4_0'
>>>>>>>>>>>         CA type: MT26428
>>>>>>>>>>>         Number of ports: 1
>>>>>>>>>>>         Firmware version: 2.9.1000
>>>>>>>>>>>         Hardware version: b0
>>>>>>>>>>>         Node GUID: 0x002590ffff16bbe8
>>>>>>>>>>>         System image GUID: 0x002590ffff16bbeb
>>>>>>>>>>>         Port 1:
>>>>>>>>>>>                 State: Active
>>>>>>>>>>>                 Physical state: LinkUp
>>>>>>>>>>>                 Rate: 40
>>>>>>>>>>>                 Base lid: 14
>>>>>>>>>>>                 LMC: 0
>>>>>>>>>>>                 SM lid: 3
>>>>>>>>>>>                 Capability mask: 0x02510868
>>>>>>>>>>>                 Port GUID: 0x002590ffff16bbe9
>>>>>>>>>>>                 Link layer: InfiniBand
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks for your help.
>>>>>>>>>>>
>>>>>>>>>>> Steve
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ______________________________________________________________________
>>>>>>>>>>> Steve Cousins - Supercomputer Engineer/Administrator - Univ of
>>>>>>>>>>> Maine
>>>>>>>>>>> Marine Sciences, 452 Aubert Hall       Target Tech, 20 Godfrey
>>>>>>>>>>> Drive
>>>>>>>>>>> Orono, ME 04469    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     Orono, ME
>>>>>>>>>>> 04473
>>>>>>>>>>> (207) 581-4302     ~ steve.cousins at maine.edu ~     (207)
>>>>>>>>>>> 866-6552
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ______________________________________________________________________
>>>>>>>>>> Steve Cousins - Supercomputer Engineer/Administrator - Univ of
>>>>>>>>>> Maine
>>>>>>>>>> Marine Sciences, 452 Aubert Hall       Target Tech, 20 Godfrey
>>>>>>>>>> Drive
>>>>>>>>>> Orono, ME 04469    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     Orono, ME
>>>>>>>>>> 04473
>>>>>>>>>> (207) 581-4302     ~ steve.cousins at maine.edu ~     (207)
>>>>>>>>>> 866-6552
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> ______________________________________________________________________
>>>>>>>> Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
>>>>>>>> Marine Sciences, 452 Aubert Hall       Target Tech, 20 Godfrey Drive
>>>>>>>> Orono, ME 04469    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     Orono, ME 04473
>>>>>>>> (207) 581-4302     ~ steve.cousins at maine.edu ~     (207) 866-6552
>>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> ______________________________________________________________________
>>>>>> Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
>>>>>> Marine Sciences, 452 Aubert Hall       Target Tech, 20 Godfrey Drive
>>>>>> Orono, ME 04469    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     Orono, ME 04473
>>>>>> (207) 581-4302     ~ steve.cousins at maine.edu ~     (207) 866-6552
>>>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> ______________________________________________________________________
>>>> Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
>>>> Marine Sciences, 452 Aubert Hall       Target Tech, 20 Godfrey Drive
>>>> Orono, ME 04469    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     Orono, ME 04473
>>>> (207) 581-4302     ~ steve.cousins at maine.edu ~     (207) 866-6552
>>>>
>>>
>>> _______________________________________________
>>> charm mailing list
>>> charm AT cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/charm
>>>
>>> _______________________________________________
>>> ppl mailing list
>>> ppl AT cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/ppl
>>>
>
> _______________________________________________
> charm mailing list
> charm AT cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/charm
>
> _______________________________________________
> ppl mailing list
> ppl AT cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/ppl

_______________________________________________
charm mailing list
charm AT cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/charm



--
______________________________________________________________________
Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
Marine Sciences, 452 Aubert Hall       Target Tech, 20 Godfrey Drive
Orono, ME 04469    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     Orono, ME 04473
(207) 581-4302     ~ steve.cousins at maine.edu ~     (207) 866-6552




Archive powered by MHonArc 2.6.16.

Top of Page