[gpfsug-discuss] Problems with remote mount via routed IB

Zachary Mance zmance at ucar.edu
Mon Mar 12 22:10:06 GMT 2018


Since I am testing out remote mounting with EDR IB routers, I'll add to the
discussion.

In my lab environment I was seeing the same rdma connections being
established and then disconnected shortly after. The remote filesystem
would eventually mount on the clients, but it look a quite a while
(~2mins). Even after mounting, accessing files or any metadata operations
would take a while to execute, but eventually it happened.

After enabling verbsRdmaCm, everything mounted just fine and in a timely
manner. Spectrum Scale was using the librdmacm.so library.

I would first double check that you have both clusters able to talk to each
other on their IPoIB address, then make sure you enable verbsRdmaCm on both
clusters.


---------------------------------------------------------------------------------------------------------------
Zach Mance  zmance at ucar.edu  (303) 497-1883

HPC Data Infrastructure Group / CISL / NCAR
---------------------------------------------------------------------------------------------------------------

On Thu, Mar 1, 2018 at 1:41 AM, John Hearns <john.hearns at asml.com> wrote:

> In reply to Stuart,
> our setup is entirely Infiniband. We boot and install over IB, and rely
> heavily on IP over Infiniband.
>
> As for users being 'confused' due to multiple IPs, I would appreciate some
> more depth on that one.
> Sure, all batch systems are sensitive to hostnames (as I know to my cost!)
> but once you get that straightened out why should users care?
> I am not being aggressive, just keen to find out more.
>
>
>
> -----Original Message-----
> From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-
> bounces at spectrumscale.org] On Behalf Of Stuart Barkley
> Sent: Wednesday, February 28, 2018 6:50 PM
> To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> Subject: Re: [gpfsug-discuss] Problems with remote mount via routed IB
>
> The problem with CM is that it seems to require configuring IP over
> Infiniband.
>
> I'm rather strongly opposed to IP over IB.  We did run IPoIB years ago,
> but pulled it out of our environment as adding unneeded complexity.  It
> requires provisioning IP addresses across the Infiniband infrastructure and
> possibly adding routers to other portions of the IP infrastructure.  It was
> also confusing some users due to multiple IPs on the compute infrastructure.
>
> We have recently been in discussions with a vendor about their support for
> GPFS over IB and they kept directing us to using CM (which still didn't
> work).  CM wasn't necessary once we found out about the actual problem (we
> needed the undocumented verbsRdmaUseGidIndexZero configuration option among
> other things due to their use of SR-IOV based virtual IB interfaces).
>
> We don't use routed Infiniband and it might be that CM and IPoIB is
> required for IB routing, but I doubt it.  It sounds like the OP is keeping
> IB and IP infrastructure separate.
>
> Stuart Barkley
>
> On Mon, 26 Feb 2018 at 14:16 -0000, Aaron Knister wrote:
>
> > Date: Mon, 26 Feb 2018 14:16:34
> > From: Aaron Knister <aaron.s.knister at nasa.gov>
> > Reply-To: gpfsug main discussion list
> > <gpfsug-discuss at spectrumscale.org>
> > To: gpfsug-discuss at spectrumscale.org
> > Subject: Re: [gpfsug-discuss] Problems with remote mount via routed IB
> >
> > Hi Jan Erik,
> >
> > It was my understanding that the IB hardware router required RDMA CM to
> work.
> > By default GPFS doesn't use the RDMA Connection Manager but it can be
> > enabled (e.g. verbsRdmaCm=enable). I think this requires a restart on
> > clients/servers (in both clusters) to take effect. Maybe someone else
> > on the list can comment in more detail-- I've been told folks have
> > successfully deployed IB routers with GPFS.
> >
> > -Aaron
> >
> > On 2/26/18 11:38 AM, Sundermann, Jan Erik (SCC) wrote:
> > >
> > > Dear all
> > >
> > > we are currently trying to remote mount a file system in a routed
> > > Infiniband test setup and face problems with dropped RDMA
> > > connections. The setup is the
> > > following:
> > >
> > > - Spectrum Scale Cluster 1 is setup on four servers which are
> > > connected to the same infiniband network. Additionally they are
> > > connected to a fast ethernet providing ip communication in the network
> 192.168.11.0/24.
> > >
> > > - Spectrum Scale Cluster 2 is setup on four additional servers which
> > > are connected to a second infiniband network. These servers have IPs
> > > on their IB interfaces in the network 192.168.12.0/24.
> > >
> > > - IP is routed between 192.168.11.0/24 and 192.168.12.0/24 on a
> > > dedicated machine.
> > >
> > > - We have a dedicated IB hardware router connected to both IB subnets.
> > >
> > >
> > > We tested that the routing, both IP and IB, is working between the
> > > two clusters without problems and that RDMA is working fine both for
> > > internal communication inside cluster 1 and cluster 2
> > >
> > > When trying to remote mount a file system from cluster 1 in cluster
> > > 2, RDMA communication is not working as expected. Instead we see
> > > error messages on the remote host (cluster 2)
> > >
> > >
> > > 2018-02-23_13:48:47.037+0100: [I] VERBS RDMA connecting to
> > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0
> > > port 1 fabnum 0 index 2
> > > 2018-02-23_13:48:49.890+0100: [I] VERBS RDMA connected to
> > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0
> > > port 1 fabnum 0 sl 0 index 2
> > > 2018-02-23_13:48:53.138+0100: [E] VERBS RDMA closed connection to
> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0
> > > port 1 fabnum 0 error 733 index 3
> > > 2018-02-23_13:48:53.854+0100: [I] VERBS RDMA connecting to
> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0
> > > port 1 fabnum 0 index 3
> > > 2018-02-23_13:48:54.954+0100: [E] VERBS RDMA closed connection to
> > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0
> > > port 1 fabnum 0 error 733 index 1
> > > 2018-02-23_13:48:55.601+0100: [I] VERBS RDMA connected to
> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0
> > > port 1 fabnum 0 sl 0 index 3
> > > 2018-02-23_13:48:57.775+0100: [I] VERBS RDMA connecting to
> > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0
> > > port 1 fabnum 0 index 1
> > > 2018-02-23_13:48:59.557+0100: [I] VERBS RDMA connected to
> > > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0
> > > port 1 fabnum 0 sl 0 index 1
> > > 2018-02-23_13:48:59.876+0100: [E] VERBS RDMA closed connection to
> > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0
> > > port 1 fabnum 0 error 733 index 0
> > > 2018-02-23_13:49:02.020+0100: [I] VERBS RDMA connecting to
> > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0
> > > port 1 fabnum 0 index 0
> > > 2018-02-23_13:49:03.477+0100: [I] VERBS RDMA connected to
> > > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0
> > > port 1 fabnum 0 sl 0 index 0
> > > 2018-02-23_13:49:05.119+0100: [E] VERBS RDMA closed connection to
> > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0
> > > port 1 fabnum 0 error 733 index 2
> > > 2018-02-23_13:49:06.191+0100: [I] VERBS RDMA connecting to
> > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0
> > > port 1 fabnum 0 index 2
> > > 2018-02-23_13:49:06.548+0100: [I] VERBS RDMA connected to
> > > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0
> > > port 1 fabnum 0 sl 0 index 2
> > > 2018-02-23_13:49:11.578+0100: [E] VERBS RDMA closed connection to
> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0
> > > port 1 fabnum 0 error 733 index 3
> > > 2018-02-23_13:49:11.937+0100: [I] VERBS RDMA connecting to
> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0
> > > port 1 fabnum 0 index 3
> > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA connected to
> > > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0
> > > port 1 fabnum 0 sl 0 index 3
> > >
> > >
> > > and in the cluster with the file system (cluster 1)
> > >
> > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA rdma read error
> > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in
> > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err
> > > 129
> > > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA closed connection to
> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0
> > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3
> > > 2018-02-23_13:47:47.161+0100: [I] VERBS RDMA accepted and connected
> > > to
> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0
> > > port 1 fabnum 0 sl 0 index 3
> > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA rdma read error
> > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in
> > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err
> > > 129
> > > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA closed connection to
> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0
> > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3
> > > 2018-02-23_13:48:11.560+0100: [I] VERBS RDMA accepted and connected
> > > to
> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0
> > > port 1 fabnum 0 sl 0 index 3
> > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA rdma read error
> > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in
> > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err
> > > 129
> > > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA closed connection to
> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0
> > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3
> > > 2018-02-23_13:48:35.398+0100: [I] VERBS RDMA accepted and connected
> > > to
> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0
> > > port 1 fabnum 0 sl 0 index 3
> > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA rdma read error
> > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in
> > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err
> > > 129
> > > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA closed connection to
> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0
> > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3
> > > 2018-02-23_13:48:55.600+0100: [I] VERBS RDMA accepted and connected
> > > to
> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0
> > > port 1 fabnum 0 sl 0 index 3
> > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA rdma read error
> > > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in
> > > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err
> > > 129
> > > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA closed connection to
> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0
> > > port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3
> > > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA accepted and connected
> > > to
> > > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0
> > > port 1 fabnum 0 sl 0 index 3
> > >
> > >
> > >
> > > Any advice on how to configure the setup in a way that would allow
> > > the remote mount via routed IB would be very appreciated.
> > >
> > >
> > > Thank you and best regards
> > > Jan Erik
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > gpfsug-discuss mailing list
> > > gpfsug-discuss at spectrumscale.org
> > > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgp
> > > fsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.h
> > > earns%40asml.com%7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944e
> > > b2a39d93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSRE
> > > YpqcNNP8%3D&reserved=0
> > >
> >
> > --
> > Aaron Knister
> > NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight
> > Center
> > (301) 286-2776
> > _______________________________________________
> > gpfsug-discuss mailing list
> > gpfsug-discuss at spectrumscale.org
> > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfs
> > ug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.hearn
> > s%40asml.com%7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944eb2a39d
> > 93e96cad61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP8
> > %3D&reserved=0
> >
>
> --
> I've never been lost; I was once bewildered for three days, but never lost!
>                                         --  Daniel Boone
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> https://emea01.safelinks.protection.outlook.com/?url=
> http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-
> discuss&data=01%7C01%7Cjohn.hearns%40asml.com%
> 7Ce40045550fc3467dd62808d57ed4d0d9%7Caf73baa8f5944eb2a39d93e96cad
> 61fc%7C1&sdata=v%2F35G6ZnlHFBm%2BfVddvcuraFd9FRChyOSREYpqcNNP
> 8%3D&reserved=0
> -- The information contained in this communication and any attachments is
> confidential and may be privileged, and is for the sole use of the intended
> recipient(s). Any unauthorized review, use, disclosure or distribution is
> prohibited. Unless explicitly stated otherwise in the body of this
> communication or the attachment thereto (if any), the information is
> provided on an AS-IS basis without any express or implied warranties or
> liabilities. To the extent you are relying on this information, you are
> doing so at your own risk. If you are not the intended recipient, please
> notify the sender immediately by replying to this message and destroy all
> copies of this message and any attachments. Neither the sender nor the
> company/group of companies he or she represents shall be liable for the
> proper and complete transmission of the information contained in this
> communication, or for any delay in its receipt.
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180312/439108aa/attachment.htm>


More information about the gpfsug-discuss mailing list