Skip to Content.
Sympa Menu

charm - Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE


Chronological Thread 
  • From: Jim Phillips <jim AT ks.uiuc.edu>
  • To: Gengbin Zheng <gzheng AT illinois.edu>
  • Cc: Charm Mailing List <charm AT cs.illinois.edu>
  • Subject: Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE
  • Date: Tue, 12 Jun 2012 11:37:53 -0500 (CDT)
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>


Is there also a way to test if the port is active? People have reported that their two-port IB cards only work if the first port is used.

-Jim

On Tue, 12 Jun 2012, Gengbin Zheng wrote:

I added one more line of code to check if the linker layer is IB or
Ethernet from the ibv_port_attr, hope this helps.
Change is in git current branch.

Gengbin


On Tue, Jun 12, 2012 at 8:15 AM, Stephen Cousins
<steve.cousins AT maine.edu>
wrote:
Hi Gengbin,

That is where I started but it doesn't appear to be possible to reorder the
devices on this cluster. No udev on the nodes. For now what you have done
should be fine. I'll try to test it today.

Thanks very much.

Steve

On Tue, Jun 12, 2012 at 1:37 AM, Gengbin Zheng
<gzheng AT illinois.edu>
wrote:

I am not very sure about this. Maybe it can not check if it IB or
Ethernet.
As workaround, if you make port 1 IB and port 2 ethernet,  Charm should
work.

Gengbin

On Mon, Jun 11, 2012 at 5:28 PM, Stephen Cousins
<steve.cousins AT maine.edu>
wrote:
So if it is a 10 GbE port that uses the same driver and is up and
running
(doing Ethernet things) will it still get in the way? If so, can your
solution check if it is IB vs. Ethernet? Right now I just want something
to
work so even if we have to disable ethernet on that device that would be
fine. A longer term goal though would be to be able to use both IB and
10GbE
on these nodes.

Thanks,

Steve


On Mon, Jun 11, 2012 at 4:10 PM, Gengbin Zheng
<gzheng AT illinois.edu>
wrote:

I added a call to  ib_query_port to test if a port is valid or not
starting from port number 1.

Gengbin


On Mon, Jun 11, 2012 at 2:57 PM, Stephen Cousins
<steve.cousins AT maine.edu>
wrote:
Hi Gengbin,

Thanks a lot. I'll give it a try.

How is the new test done? Do you check the Link Layer to make sure it
actually is a IB device as opposed to Ethernet?

Steve

On Mon, Jun 11, 2012 at 3:00 PM, Gengbin Zheng
<gzheng AT illinois.edu>
wrote:

Hi Steve,

 Charm ib layger always assume ibport is 1, that is why it didn't
work
when you have multiple interfaces.
 I checked in a fix to test the ib ports. It is in the latest main
branch.
 Can you give it a try?

Gengbin

On Mon, Jun 11, 2012 at 11:12 AM, Stephen Cousins
<steve.cousins AT maine.edu>
wrote:
Hi,

Is this list active? Does anyone have any ideas about how charmrun
can
specify specific types of interfaces when using ibverbs?

Thanks,

Steve

On Mon, Jun 4, 2012 at 1:24 PM, Stephen Cousins
<steve.cousins AT maine.edu>
wrote:

Hi,

We have 16 nodes with both IB and 10GbE interfaces (both
interfaces
are
Mellanox). We also have 16 nodes that have just IB. I can run
NAMD
on
the
IB-only nodes just fine, however if the job is allocated a node
that
has
both IB and 10 GbE then it does not work.

charmrun output is:

Charmrun> IBVERBS version of charmrun
[0] Stack Traceback:
  [0:0] CmiAbort+0x5c  [0xcbef1e]
  [0:1] initInfiOtherNodeData+0x14a  [0xcbe488]
  [0:2] /opt/scyld/NAMD_2.9_Linux-x86_64-ibverbs/namd2 [0xcbcf8b]
  [0:3] /opt/scyld/NAMD_2.9_Linux-x86_64-ibverbs/namd2 [0xcbd9e5]
  [0:4] ConverseInit+0x1cd  [0xcbe001]
  [0:5] _ZN7BackEnd4initEiPPc+0x6f  [0x58ad13]
  [0:6] main+0x2f  [0x585fd7]
  [0:7] __libc_start_main+0xf4  [0x3633c1d9b4]
  [0:8] _ZNSt8ios_base4InitD1Ev+0x4a  [0x54105a]


And STDERR for the job is:

Charmrun> started all node programs in 1.544 seconds.
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
Fatal error on PE 0> failed to change qp state to RTR


We are running Moab and Torque for the scheduler and resource
manager.
The
version of NAMD is:

    NAMD_2.9_Linux-x86_64-ibverbs

Do I need to specify that the link layer should use IB as opposed
to
Ethernet?

ibstat for the nodes with both interconnects:

CA 'mlx4_0'
        CA type: MT4099
        Number of ports: 2
        Firmware version: 2.10.0
        Hardware version: 0
        Node GUID: 0xffffffffffffffff
        System image GUID: 0xffffffffffffffff
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 10
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0x0202c9fffe34e8f0
                Link layer: Ethernet
        Port 2:
                State: Down
                Physical state: Disabled
                Rate: 10
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0x0202c9fffe34e8f1
                Link layer: Ethernet
CA 'mlx4_1'
        CA type: MT26428
        Number of ports: 1
        Firmware version: 2.9.1000
        Hardware version: b0
        Node GUID: 0x002590ffff16b658
        System image GUID: 0x002590ffff16b65b
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 40
                Base lid: 21
                LMC: 0
                SM lid: 3
                Capability mask: 0x02510868
                Port GUID: 0x002590ffff16b659
                Link layer: InfiniBand


ibstat for a node with just IB:

CA 'mlx4_0'
        CA type: MT26428
        Number of ports: 1
        Firmware version: 2.9.1000
        Hardware version: b0
        Node GUID: 0x002590ffff16bbe8
        System image GUID: 0x002590ffff16bbeb
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 40
                Base lid: 14
                LMC: 0
                SM lid: 3
                Capability mask: 0x02510868
                Port GUID: 0x002590ffff16bbe9
                Link layer: InfiniBand




Thanks for your help.

Steve

--


______________________________________________________________________
Steve Cousins - Supercomputer Engineer/Administrator - Univ of
Maine
Marine Sciences, 452 Aubert Hall       Target Tech, 20 Godfrey
Drive
Orono, ME 04469    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     Orono, ME
04473
(207) 581-4302     ~ steve.cousins at maine.edu ~     (207)
866-6552




--


______________________________________________________________________
Steve Cousins - Supercomputer Engineer/Administrator - Univ of
Maine
Marine Sciences, 452 Aubert Hall       Target Tech, 20 Godfrey
Drive
Orono, ME 04469    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     Orono, ME
04473
(207) 581-4302     ~ steve.cousins at maine.edu ~     (207)
866-6552





--

______________________________________________________________________
Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
Marine Sciences, 452 Aubert Hall       Target Tech, 20 Godfrey Drive
Orono, ME 04469    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     Orono, ME 04473
(207) 581-4302     ~ steve.cousins at maine.edu ~     (207) 866-6552





--
______________________________________________________________________
Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
Marine Sciences, 452 Aubert Hall       Target Tech, 20 Godfrey Drive
Orono, ME 04469    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     Orono, ME 04473
(207) 581-4302     ~ steve.cousins at maine.edu ~     (207) 866-6552





--
______________________________________________________________________
Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
Marine Sciences, 452 Aubert Hall       Target Tech, 20 Godfrey Drive
Orono, ME 04469    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     Orono, ME 04473
(207) 581-4302     ~ steve.cousins at maine.edu ~     (207) 866-6552


_______________________________________________
charm mailing list
charm AT cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/charm

_______________________________________________
ppl mailing list
ppl AT cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/ppl



Archive powered by MHonArc 2.6.16.

Top of Page