[gpfsug-discuss] nodes being ejected out of the cluster

Jan-Frode Myklebust janfrode at tanso.net
Wed Jan 11 15:10:03 GMT 2017


My first guess would also be rdmaSend, which the gssClientConfig.sh enables
by default, but isn't scalable to large clusters. It fits with your error
message:

https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Best%20Practices%20RDMA%20Tuning

   - """For GPFS version 3.5.0.11 and later, IB error IBV_WC_RNR_RETRY_EXC_ERR
   may occur if the cluster is too large when  verbsRdmaSend is enabled Idf
   these errors are observed in the mmfs log, disable verbsRdmaSend on all
   nodes.. Additionally, out of memory errors may occur if verbsRdmaSend is
   enabled on very large clusters.  If out of memory errors are observed,
   disabled verbsRdmaSend on all nodes in the cluster."""


Otherwise it would be nice if you could post your mmlsconfig to see if
something else sticks out..


  -jf



On Wed, Jan 11, 2017 at 4:03 PM, Olaf Weiser <olaf.weiser at de.ibm.com> wrote:

> most likely, there's smth wrong with your IB fabric ...
> you say, you run ~ 700 nodes ? ...
> Are you running with *verbsRdmaSend*enabled ? ,if so, please consider to
> disable  - and discuss this within the PMR
> another issue, you may check is  - Are you running the IPoIB in connected
> mode or datagram ... but as I said, please discuss this within the PMR ..
> there are to much dependencies to discuss this here ..
>
>
> cheers
>
>
> Mit freundlichen Grüßen / Kind regards
>
>
> Olaf Weiser
>
> EMEA Storage Competence Center Mainz, German / IBM Systems, Storage
> Platform,
> ------------------------------------------------------------
> ------------------------------------------------------------
> -------------------
> IBM Deutschland
> IBM Allee 1
> 71139 Ehningen
> Phone: +49-170-579-44-66 <+49%20170%205794466>
> E-Mail: olaf.weiser at de.ibm.com
> ------------------------------------------------------------
> ------------------------------------------------------------
> -------------------
> IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter
> Geschäftsführung: Martina Koederitz (Vorsitzende), Susanne Peter, Norbert
> Janzen, Dr. Christian Keller, Ivo Koerner, Markus Koerner
> Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart,
> HRB 14562 / WEEE-Reg.-Nr. DE 99369940
>
>
>
> From:        Damir Krstic <damir.krstic at gmail.com>
> To:        gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> Date:        01/11/2017 03:39 PM
> Subject:        [gpfsug-discuss] nodes being ejected out of the cluster
> Sent by:        gpfsug-discuss-bounces at spectrumscale.org
> ------------------------------
>
>
>
> We are running GPFS 4.2 on our cluster (around 700 compute nodes). Our
> storage (ESS GL6) is also running GPFS 4.2. Compute nodes and storage are
> connected via Infiniband (FDR14). At the time of implementation of ESS, we
> were instructed to enable RDMA in addition to IPoIB. Previously we only ran
> IPoIB on our GPFS3.5 cluster.
>
> Every since the implementation (sometime back in July of 2016) we see a
> lot of compute nodes being ejected. What usually precedes the ejection are
> following messages:
>
> Jan 11 02:03:15 quser13 mmfs: [E] VERBS RDMA rdma send error
> IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum
> 0 vendor_err 135
> Jan 11 02:03:15 quser13 mmfs: [E] VERBS RDMA closed connection to
> 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error
> IBV_WC_RNR_RETRY_EXC_ERR index 2
> Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA rdma send error
> IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum
> 0 vendor_err 135
> Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA closed connection to
> 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error
> IBV_WC_WR_FLUSH_ERR index 1
> Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA rdma send error
> IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum
> 0 vendor_err 135
> Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA closed connection to
> 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error
> IBV_WC_RNR_RETRY_EXC_ERR index 2
> Jan 11 02:06:38 quser11 mmfs: [E] VERBS RDMA rdma send error
> IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum
> 0 vendor_err 135
> Jan 11 02:06:38 quser11 mmfs: [E] VERBS RDMA closed connection to
> 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error
> IBV_WC_WR_FLUSH_ERR index 400
>
> Even our ESS IO server sometimes ends up being ejected (case in point -
> yesterday morning):
>
> Jan 10 11:23:42 gssio2 mmfs: [E] VERBS RDMA rdma send error
> IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_1 port 1 fabnum
> 0 vendor_err 135
> Jan 10 11:23:42 gssio2 mmfs: [E] VERBS RDMA closed connection to
> 172.41.2.1 (gssio1-fdr) on mlx5_1 port 1 fabnum 0 due to send error
> IBV_WC_RNR_RETRY_EXC_ERR index 3001
> Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA rdma send error
> IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_1 port 2 fabnum
> 0 vendor_err 135
> Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA closed connection to
> 172.41.2.1 (gssio1-fdr) on mlx5_1 port 2 fabnum 0 due to send error
> IBV_WC_RNR_RETRY_EXC_ERR index 2671
> Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA rdma send error
> IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_0 port 2 fabnum
> 0 vendor_err 135
> Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA closed connection to
> 172.41.2.1 (gssio1-fdr) on mlx5_0 port 2 fabnum 0 due to send error
> IBV_WC_RNR_RETRY_EXC_ERR index 2495
> Jan 10 11:23:44 gssio2 mmfs: [E] VERBS RDMA rdma send error
> IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_0 port 1 fabnum
> 0 vendor_err 135
> Jan 10 11:23:44 gssio2 mmfs: [E] VERBS RDMA closed connection to
> 172.41.2.1 (gssio1-fdr) on mlx5_0 port 1 fabnum 0 due to send error
> IBV_WC_RNR_RETRY_EXC_ERR index 3077
> Jan 10 11:24:11 gssio2 mmfs: [N] Node 172.41.2.1 (gssio1-fdr) lease
> renewal is overdue. Pinging to check if it is alive
>
> I've had multiple PMRs open for this issue, and I am told that our ESS
> needs code level upgrades in order to fix this issue. Looking at the
> errors, I think the issue is Infiniband related, and I am wondering if
> anyone on this list has seen similar issues?
>
> Thanks for your help in advance.
>
> Damir_______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170111/fbbd8964/attachment.htm>


More information about the gpfsug-discuss mailing list