[gpfsug-discuss] Problems with remote mount via routed IB

Stuart Barkley stuartb at 4gh.net
Wed Feb 28 17:49:47 GMT 2018


The problem with CM is that it seems to require configuring IP over
Infiniband.

I'm rather strongly opposed to IP over IB.  We did run IPoIB years
ago, but pulled it out of our environment as adding unneeded
complexity.  It requires provisioning IP addresses across the
Infiniband infrastructure and possibly adding routers to other
portions of the IP infrastructure.  It was also confusing some users
due to multiple IPs on the compute infrastructure.

We have recently been in discussions with a vendor about their support
for GPFS over IB and they kept directing us to using CM (which still
didn't work).  CM wasn't necessary once we found out about the actual
problem (we needed the undocumented verbsRdmaUseGidIndexZero
configuration option among other things due to their use of SR-IOV
based virtual IB interfaces).

We don't use routed Infiniband and it might be that CM and IPoIB is
required for IB routing, but I doubt it.  It sounds like the OP is
keeping IB and IP infrastructure separate.

Stuart Barkley

On Mon, 26 Feb 2018 at 14:16 -0000, Aaron Knister wrote:

> Date: Mon, 26 Feb 2018 14:16:34
> From: Aaron Knister <aaron.s.knister at nasa.gov>
> Reply-To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> To: gpfsug-discuss at spectrumscale.org
> Subject: Re: [gpfsug-discuss] Problems with remote mount via routed IB
>
> Hi Jan Erik,
>
> It was my understanding that the IB hardware router required RDMA CM to work.
> By default GPFS doesn't use the RDMA Connection Manager but it can be enabled
> (e.g. verbsRdmaCm=enable). I think this requires a restart on clients/servers
> (in both clusters) to take effect. Maybe someone else on the list can comment
> in more detail-- I've been told folks have successfully deployed IB routers
> with GPFS.
>
> -Aaron
>
> On 2/26/18 11:38 AM, Sundermann, Jan Erik (SCC) wrote:
> >
> > Dear all
> >
> > we are currently trying to remote mount a file system in a routed Infiniband
> > test setup and face problems with dropped RDMA connections. The setup is the
> > following:
> >
> > - Spectrum Scale Cluster 1 is setup on four servers which are connected to
> > the same infiniband network. Additionally they are connected to a fast
> > ethernet providing ip communication in the network 192.168.11.0/24.
> >
> > - Spectrum Scale Cluster 2 is setup on four additional servers which are
> > connected to a second infiniband network. These servers have IPs on their IB
> > interfaces in the network 192.168.12.0/24.
> >
> > - IP is routed between 192.168.11.0/24 and 192.168.12.0/24 on a dedicated
> > machine.
> >
> > - We have a dedicated IB hardware router connected to both IB subnets.
> >
> >
> > We tested that the routing, both IP and IB, is working between the two
> > clusters without problems and that RDMA is working fine both for internal
> > communication inside cluster 1 and cluster 2
> >
> > When trying to remote mount a file system from cluster 1 in cluster 2, RDMA
> > communication is not working as expected. Instead we see error messages on
> > the remote host (cluster 2)
> >
> >
> > 2018-02-23_13:48:47.037+0100: [I] VERBS RDMA connecting to 192.168.11.4
> > (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 2
> > 2018-02-23_13:48:49.890+0100: [I] VERBS RDMA connected to 192.168.11.4
> > (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0
> > index 2
> > 2018-02-23_13:48:53.138+0100: [E] VERBS RDMA closed connection to
> > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1
> > fabnum 0 error 733 index 3
> > 2018-02-23_13:48:53.854+0100: [I] VERBS RDMA connecting to 192.168.11.1
> > (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 3
> > 2018-02-23_13:48:54.954+0100: [E] VERBS RDMA closed connection to
> > 192.168.11.3 (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1
> > fabnum 0 error 733 index 1
> > 2018-02-23_13:48:55.601+0100: [I] VERBS RDMA connected to 192.168.11.1
> > (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0
> > index 3
> > 2018-02-23_13:48:57.775+0100: [I] VERBS RDMA connecting to 192.168.11.3
> > (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 1
> > 2018-02-23_13:48:59.557+0100: [I] VERBS RDMA connected to 192.168.11.3
> > (iccn003-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0
> > index 1
> > 2018-02-23_13:48:59.876+0100: [E] VERBS RDMA closed connection to
> > 192.168.11.2 (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1
> > fabnum 0 error 733 index 0
> > 2018-02-23_13:49:02.020+0100: [I] VERBS RDMA connecting to 192.168.11.2
> > (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 0
> > 2018-02-23_13:49:03.477+0100: [I] VERBS RDMA connected to 192.168.11.2
> > (iccn002-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0
> > index 0
> > 2018-02-23_13:49:05.119+0100: [E] VERBS RDMA closed connection to
> > 192.168.11.4 (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1
> > fabnum 0 error 733 index 2
> > 2018-02-23_13:49:06.191+0100: [I] VERBS RDMA connecting to 192.168.11.4
> > (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 2
> > 2018-02-23_13:49:06.548+0100: [I] VERBS RDMA connected to 192.168.11.4
> > (iccn004-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0
> > index 2
> > 2018-02-23_13:49:11.578+0100: [E] VERBS RDMA closed connection to
> > 192.168.11.1 (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1
> > fabnum 0 error 733 index 3
> > 2018-02-23_13:49:11.937+0100: [I] VERBS RDMA connecting to 192.168.11.1
> > (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 index 3
> > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA connected to 192.168.11.1
> > (iccn001-gpfs in gpfsstorage.localdomain) on mlx4_0 port 1 fabnum 0 sl 0
> > index 3
> >
> >
> > and in the cluster with the file system (cluster 1)
> >
> > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA rdma read error
> > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in
> > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129
> > 2018-02-23_13:47:36.112+0100: [E] VERBS RDMA closed connection to
> > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1
> > fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3
> > 2018-02-23_13:47:47.161+0100: [I] VERBS RDMA accepted and connected to
> > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1
> > fabnum 0 sl 0 index 3
> > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA rdma read error
> > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in
> > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129
> > 2018-02-23_13:48:04.317+0100: [E] VERBS RDMA closed connection to
> > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1
> > fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3
> > 2018-02-23_13:48:11.560+0100: [I] VERBS RDMA accepted and connected to
> > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1
> > fabnum 0 sl 0 index 3
> > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA rdma read error
> > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in
> > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129
> > 2018-02-23_13:48:32.523+0100: [E] VERBS RDMA closed connection to
> > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1
> > fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3
> > 2018-02-23_13:48:35.398+0100: [I] VERBS RDMA accepted and connected to
> > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1
> > fabnum 0 sl 0 index 3
> > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA rdma read error
> > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in
> > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129
> > 2018-02-23_13:48:53.135+0100: [E] VERBS RDMA closed connection to
> > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1
> > fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3
> > 2018-02-23_13:48:55.600+0100: [I] VERBS RDMA accepted and connected to
> > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1
> > fabnum 0 sl 0 index 3
> > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA rdma read error
> > IBV_WC_RETRY_EXC_ERR to 192.168.12.5 (iccn005-ib in
> > gpfsremoteclients.localdomain) on mlx4_0 port 1 fabnum 0 vendor_err 129
> > 2018-02-23_13:49:11.577+0100: [E] VERBS RDMA closed connection to
> > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1
> > fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 3
> > 2018-02-23_13:49:11.939+0100: [I] VERBS RDMA accepted and connected to
> > 192.168.12.5 (iccn005-ib in gpfsremoteclients.localdomain) on mlx4_0 port 1
> > fabnum 0 sl 0 index 3
> >
> >
> >
> > Any advice on how to configure the setup in a way that would allow the
> > remote mount via routed IB would be very appreciated.
> >
> >
> > Thank you and best regards
> > Jan Erik
> >
> >
> >
> >
> > _______________________________________________
> > gpfsug-discuss mailing list
> > gpfsug-discuss at spectrumscale.org
> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> >
>
> --
> Aaron Knister
> NASA Center for Climate Simulation (Code 606.2)
> Goddard Space Flight Center
> (301) 286-2776
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>

-- 
I've never been lost; I was once bewildered for three days, but never lost!
                                        --  Daniel Boone



More information about the gpfsug-discuss mailing list