[gpfsug-discuss] nodes being ejected out of the cluster

Damir Krstic damir.krstic at gmail.com
Wed Jan 11 14:39:13 GMT 2017


We are running GPFS 4.2 on our cluster (around 700 compute nodes). Our
storage (ESS GL6) is also running GPFS 4.2. Compute nodes and storage are
connected via Infiniband (FDR14). At the time of implementation of ESS, we
were instructed to enable RDMA in addition to IPoIB. Previously we only ran
IPoIB on our GPFS3.5 cluster.

Every since the implementation (sometime back in July of 2016) we see a lot
of compute nodes being ejected. What usually precedes the ejection are
following messages:

Jan 11 02:03:15 quser13 mmfs: [E] VERBS RDMA rdma send error
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum
0 vendor_err 135
Jan 11 02:03:15 quser13 mmfs: [E] VERBS RDMA closed connection to
172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error
IBV_WC_RNR_RETRY_EXC_ERR index 2
Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA rdma send error
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum
0 vendor_err 135
Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA closed connection to
172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error
IBV_WC_WR_FLUSH_ERR index 1
Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA rdma send error
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum
0 vendor_err 135
Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA closed connection to
172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error
IBV_WC_RNR_RETRY_EXC_ERR index 2
Jan 11 02:06:38 quser11 mmfs: [E] VERBS RDMA rdma send error
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum
0 vendor_err 135
Jan 11 02:06:38 quser11 mmfs: [E] VERBS RDMA closed connection to
172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error
IBV_WC_WR_FLUSH_ERR index 400

Even our ESS IO server sometimes ends up being ejected (case in point -
yesterday morning):

Jan 10 11:23:42 gssio2 mmfs: [E] VERBS RDMA rdma send error
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_1 port 1 fabnum
0 vendor_err 135
Jan 10 11:23:42 gssio2 mmfs: [E] VERBS RDMA closed connection to 172.41.2.1
(gssio1-fdr) on mlx5_1 port 1 fabnum 0 due to send error
IBV_WC_RNR_RETRY_EXC_ERR index 3001
Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA rdma send error
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_1 port 2 fabnum
0 vendor_err 135
Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA closed connection to 172.41.2.1
(gssio1-fdr) on mlx5_1 port 2 fabnum 0 due to send error
IBV_WC_RNR_RETRY_EXC_ERR index 2671
Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA rdma send error
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_0 port 2 fabnum
0 vendor_err 135
Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA closed connection to 172.41.2.1
(gssio1-fdr) on mlx5_0 port 2 fabnum 0 due to send error
IBV_WC_RNR_RETRY_EXC_ERR index 2495
Jan 10 11:23:44 gssio2 mmfs: [E] VERBS RDMA rdma send error
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_0 port 1 fabnum
0 vendor_err 135
Jan 10 11:23:44 gssio2 mmfs: [E] VERBS RDMA closed connection to 172.41.2.1
(gssio1-fdr) on mlx5_0 port 1 fabnum 0 due to send error
IBV_WC_RNR_RETRY_EXC_ERR index 3077
Jan 10 11:24:11 gssio2 mmfs: [N] Node 172.41.2.1 (gssio1-fdr) lease renewal
is overdue. Pinging to check if it is alive

I've had multiple PMRs open for this issue, and I am told that our ESS
needs code level upgrades in order to fix this issue. Looking at the
errors, I think the issue is Infiniband related, and I am wondering if
anyone on this list has seen similar issues?

Thanks for your help in advance.

Damir
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170111/ceef6c80/attachment.htm>


More information about the gpfsug-discuss mailing list