[gpfsug-discuss] RDMA read error IBV_WC_RETRY_EXC_ERR

Philipp Helo Rehs Philipp.Rehs at uni-duesseldorf.de
Mon Mar 12 20:09:14 GMT 2018


Hello,
I am reading your mailing-list since some weeks and I am quiete
impressed about the knowledge and shared information here.

We have a gpfs cluster with 4 nsds and 120 clients on Infiniband.

Our NSD-Server have two infiniband ports on seperate cards
mlx5_0 and mlx5_1. We have RDMA-CM enabled and ipv6 enabled on all
nodes. We have added an IPoIB IP to all interfaces.

But when we enable the second interface we get the following error from
all nodes:

2018-03-12_20:49:38.923+0100: [E] VERBS RDMA closed connection to
10.100.0.83 (hilbert83-ib) on mlx5_1 port 1 fabnum 0 due to RDMA read
error IBV_WC_RETRY_EXC_ERR index 45
2018-03-12_20:49:38.923+0100: [E] VERBS RDMA rdma read error
IBV_WC_RETRY_EXC_ERR to 10.100.0.129 (hilbert129-ib) on mlx5_1 port 1
fabnum 0 vendor_err 129
2018-03-12_20:49:38.923+0100: [E] VERBS RDMA closed connection to
10.100.0.129 (hilbert129-ib) on mlx5_1 port 1 fabnum 0 due to RDMA read
error IBV_WC_RETRY_EXC_ERR index 31
2018-03-12_20:49:38.923+0100: [E] VERBS RDMA rdma read error
IBV_WC_RETRY_EXC_ERR to 10.100.0.134 (hilbert134-ib) on mlx5_1 port 1
fabnum 0 vendor_err 129

I have read that this issue can happen when verbsRdmasPerConnection is
to low. We tried to increase the value and it got better but the problem
is not fixed.


Current config:
minReleaseLevel 4.2.3.0
maxblocksize 16m
cipherList AUTHONLY
cesSharedRoot /ces
ccrEnabled yes
failureDetectionTime 40
leaseRecoveryWait 40
[hilbert1-ib,hilbert2-ib]
worker1Threads 256
maxReceiverThreads 256
[common]
tiebreakerDisks vd3;vd5;vd7
minQuorumNodes 2
verbsLibName libibverbs.so.1
verbsRdma enable
verbsRdmasPerNode 256
verbsRdmaSend no
scatterBufferSize 262144
pagepool 16g
verbsPorts mlx4_0/1
[nsdNodes]
verbsPorts mlx5_0/1 mlx5_1/1
[hilbert200-ib,hilbert201-ib,hilbert202-ib,hilbert203-ib,hilbert204-ib,hilbert205-ib,hilbert206-ib]
verbsPorts mlx4_0/1 mlx4_1/1
[common]
maxMBpS 11200
[common]
verbsRdmaCm enable
verbsRdmasPerConnection 14
adminMode central


Kind regards
 Philipp Rehs

---------------------------

Zentrum für Informations- und Medientechnologie
Kompetenzzentrum für wissenschaftliches Rechnen und Speichern

Heinrich-Heine-Universität Düsseldorf
Universitätsstr. 1
Raum 25.41.00.51
40225 Düsseldorf / Germany
Tel: +49-211-81-15557



More information about the gpfsug-discuss mailing list