[gpfsug-discuss] unusual node expels?

Alex Chekholko chekh at stanford.edu
Tue Dec 15 20:34:38 GMT 2015


Hi all,

I had a RHEL6.3 / MLNX OFED 1.5.3 / GPFS 3.5.0.10 cluster, which was 
working fine.

We tried to upgrade some stuff (our mistake!), specifically the Mellanox 
firmwares and the OS and switched to in-built CentOS OFED.

So now I have
CentOS 6.7 / GPFS 3.5.0.29 cluster where the GPFS client nodes refuse to 
stay connected.   Here is a typical log:


[root at cn1 ~]# cat /var/adm/ras/mmfs.log.latest
Tue Dec 15 12:21:38 PST 2015: runmmfs starting
Removing old /var/adm/ras/mmfs.log.* files:
Unloading modules from /lib/modules/2.6.32-573.8.1.el6.x86_64/extra
Loading modules from /lib/modules/2.6.32-573.8.1.el6.x86_64/extra
Module                  Size  Used by
mmfs26               1836054  0
mmfslinux             330095  1 mmfs26
tracedev               43757  2 mmfs26,mmfslinux
Tue Dec 15 12:21:39.230 2015: mmfsd initializing. {Version: 3.5.0.29 
Built: Nov  6 2015 15:28:46} ...
Tue Dec 15 12:21:40.847 2015: VERBS RDMA starting.
Tue Dec 15 12:21:40.849 2015: VERBS RDMA library libibverbs.so.1 
(version >= 1.1) loaded and initialized.
Tue Dec 15 12:21:40.850 2015: VERBS RDMA verbsRdmasPerNode reduced from 
128 to 98 to match (nsdMaxWorkerThreads 96 + (nspdThreadsPerQueue 2 * 
nspdQueues 1)).
Tue Dec 15 12:21:41.122 2015: VERBS RDMA device mlx4_0 port 1 fabnum 0 
opened, lid 10, 4x FDR INFINIBAND.
Tue Dec 15 12:21:41.123 2015: VERBS RDMA started.
Tue Dec 15 12:21:41.626 2015: Connecting to 10.210.16.40 hs-gs-01 <c0p0>
Tue Dec 15 12:21:41.627 2015: Connected to 10.210.16.40 hs-gs-01 <c0p0>
Tue Dec 15 12:21:41.628 2015: Connecting to 10.210.16.41 hs-gs-02 <c0p1>
Tue Dec 15 12:21:41.629 2015: Connected to 10.210.16.41 hs-gs-02 <c0p1>
Tue Dec 15 12:21:41.630 2015: Node 10.210.16.41 (hs-gs-02) is now the 
Group Leader.
Tue Dec 15 12:21:41.641 2015: mmfsd ready
Tue Dec 15 12:21:41 PST 2015: mmcommon mmfsup invoked. Parameters: 
10.210.17.1 10.210.16.41 all
Tue Dec 15 12:21:41 PST 2015: mounting /dev/hsgs
Tue Dec 15 12:21:41.918 2015: Command: mount hsgs
Tue Dec 15 12:21:42.131 2015: Connecting to 10.210.16.42 hs-gs-03 <c0n2>
Tue Dec 15 12:21:42.132 2015: Connecting to 10.210.16.43 hs-gs-04 <c0n3>
Tue Dec 15 12:21:42.133 2015: Connected to 10.210.16.42 hs-gs-03 <c0n2>
Tue Dec 15 12:21:42.134 2015: Connected to 10.210.16.43 hs-gs-04 <c0n3>
Tue Dec 15 12:21:42.148 2015: VERBS RDMA connecting to 10.210.16.41 
(hs-gs-02) on mlx4_0 port 1 fabnum 0 index 0
Tue Dec 15 12:21:42.149 2015: VERBS RDMA connected to 10.210.16.41 
(hs-gs-02) on mlx4_0 port 1 fabnum 0 sl 0 index 0
Tue Dec 15 12:21:42.153 2015: VERBS RDMA connecting to 10.210.16.40 
(hs-gs-01) on mlx4_0 port 1 fabnum 0 index 1
Tue Dec 15 12:21:42.154 2015: VERBS RDMA connected to 10.210.16.40 
(hs-gs-01) on mlx4_0 port 1 fabnum 0 sl 0 index 1
Tue Dec 15 12:21:42.171 2015: Connecting to 10.210.16.11 hs-ln01.local 
<c0n5>
Tue Dec 15 12:21:42.173 2015: Close connection to 10.210.16.11 
hs-ln01.local <c0n5> (No route to host)
Tue Dec 15 12:21:42.174 2015: Retry connection to 10.210.16.11 
hs-ln01.local <c0n5>
Tue Dec 15 12:21:42.173 2015: Close connection to 10.210.16.11 
hs-ln01.local <c0n5> (No route to host)
Tue Dec 15 12:22:55.322 2015: Request sent to 10.210.16.41 (hs-gs-02) to 
expel 10.210.16.11 (hs-ln01.local) from cluster HS-GS-Cluster.hs-gs-01
Tue Dec 15 12:22:55.323 2015: This node will be expelled from cluster 
HS-GS-Cluster.hs-gs-01 due to expel msg from 10.210.17.1 (cn1.local)
Tue Dec 15 12:22:55.324 2015: This node is being expelled from the cluster.
Tue Dec 15 12:22:55.323 2015: Lost membership in cluster 
HS-GS-Cluster.hs-gs-01. Unmounting file systems.
Tue Dec 15 12:22:55.325 2015: VERBS RDMA closed connection to 
10.210.16.41 (hs-gs-02) on mlx4_0 port 1 fabnum 0 index 0
Tue Dec 15 12:22:55.327 2015: Cluster Manager connection broke. Probing 
cluster HS-GS-Cluster.hs-gs-01
Tue Dec 15 12:22:55.328 2015: VERBS RDMA closed connection to 
10.210.16.40 (hs-gs-01) on mlx4_0 port 1 fabnum 0 index 1
Tue Dec 15 12:22:56.419 2015: Command: err 2: mount hsgs
Tue Dec 15 12:22:56.420 2015: Specified entity, such as a disk or file 
system, does not exist.
mount: No such file or directory
Tue Dec 15 12:22:56 PST 2015: finished mounting /dev/hsgs
Tue Dec 15 12:22:56.587 2015: Quorum loss. Probing cluster 
HS-GS-Cluster.hs-gs-01
Tue Dec 15 12:22:57.087 2015: Connecting to 10.210.16.40 hs-gs-01 <c0p0>
Tue Dec 15 12:22:57.088 2015: Connected to 10.210.16.40 hs-gs-01 <c0p0>
Tue Dec 15 12:22:57.089 2015: Connecting to 10.210.16.41 hs-gs-02 <c0p1>
Tue Dec 15 12:22:57.090 2015: Connected to 10.210.16.41 hs-gs-02 <c0p1>
Tue Dec 15 12:23:02.090 2015: Connecting to 10.210.16.42 hs-gs-03 <c0p2>
Tue Dec 15 12:23:02.092 2015: Connected to 10.210.16.42 hs-gs-03 <c0p2>
Tue Dec 15 12:23:49.604 2015: Node 10.210.16.41 (hs-gs-02) is now the 
Group Leader.
Tue Dec 15 12:23:49.614 2015: mmfsd ready
Tue Dec 15 12:23:49 PST 2015: mmcommon mmfsup invoked. Parameters: 
10.210.17.1 10.210.16.41 all
Tue Dec 15 12:23:49 PST 2015: mounting /dev/hsgs
Tue Dec 15 12:23:49.866 2015: Command: mount hsgs
Tue Dec 15 12:23:49.949 2015: Connecting to 10.210.16.43 hs-gs-04 <c0n3>
Tue Dec 15 12:23:49.950 2015: Connected to 10.210.16.43 hs-gs-04 <c0n3>
Tue Dec 15 12:23:49.957 2015: VERBS RDMA connecting to 10.210.16.41 
(hs-gs-02) on mlx4_0 port 1 fabnum 0 index 1
Tue Dec 15 12:23:49.958 2015: VERBS RDMA connected to 10.210.16.41 
(hs-gs-02) on mlx4_0 port 1 fabnum 0 sl 0 index 1
Tue Dec 15 12:23:49.962 2015: VERBS RDMA connecting to 10.210.16.40 
(hs-gs-01) on mlx4_0 port 1 fabnum 0 index 0
Tue Dec 15 12:23:49.963 2015: VERBS RDMA connected to 10.210.16.40 
(hs-gs-01) on mlx4_0 port 1 fabnum 0 sl 0 index 0
Tue Dec 15 12:23:49.980 2015: Close connection to 10.210.16.11 
hs-ln01.local <c0n5> (No route to host)
Tue Dec 15 12:23:49.981 2015: Retry connection to 10.210.16.11 
hs-ln01.local <c0n5>
Tue Dec 15 12:23:49.980 2015: Close connection to 10.210.16.11 
hs-ln01.local <c0n5> (No route to host)
Tue Dec 15 12:25:05.321 2015: Request sent to 10.210.16.41 (hs-gs-02) to 
expel 10.210.16.11 (hs-ln01.local) from cluster HS-GS-Cluster.hs-gs-01
Tue Dec 15 12:25:05.322 2015: This node will be expelled from cluster 
HS-GS-Cluster.hs-gs-01 due to expel msg from 10.210.17.1 (cn1.local)
Tue Dec 15 12:25:05.323 2015: This node is being expelled from the cluster.
Tue Dec 15 12:25:05.324 2015: Lost membership in cluster 
HS-GS-Cluster.hs-gs-01. Unmounting file systems.
Tue Dec 15 12:25:05.325 2015: VERBS RDMA closed connection to 
10.210.16.41 (hs-gs-02) on mlx4_0 port 1 fabnum 0 index 1
Tue Dec 15 12:25:05.326 2015: VERBS RDMA closed connection to 
10.210.16.40 (hs-gs-01) on mlx4_0 port 1 fabnum 0 index 0
Tue Dec 15 12:25:05.327 2015: Cluster Manager connection broke. Probing 
cluster HS-GS-Cluster.hs-gs-01
Tue Dec 15 12:25:06.413 2015: Command: err 2: mount hsgs
Tue Dec 15 12:25:06.414 2015: Specified entity, such as a disk or file 
system, does not exist.
mount: No such file or directory
Tue Dec 15 12:25:06 PST 2015: finished mounting /dev/hsgs
Tue Dec 15 12:25:06.569 2015: Quorum loss. Probing cluster 
HS-GS-Cluster.hs-gs-01
Tue Dec 15 12:25:07.069 2015: Connecting to 10.210.16.40 hs-gs-01 <c0p0>
Tue Dec 15 12:25:07.070 2015: Connected to 10.210.16.40 hs-gs-01 <c0p0>
Tue Dec 15 12:25:07.071 2015: Connecting to 10.210.16.41 hs-gs-02 <c0p1>
Tue Dec 15 12:25:07.072 2015: Connected to 10.210.16.41 hs-gs-02 <c0p1>
Tue Dec 15 12:25:12.072 2015: Connecting to 10.210.16.42 hs-gs-03 <c0p2>
Tue Dec 15 12:25:12.073 2015: Connected to 10.210.16.42 hs-gs-03 <c0p2>
Tue Dec 15 12:25:59.585 2015: Node 10.210.16.41 (hs-gs-02) is now the 
Group Leader.
Tue Dec 15 12:25:59.596 2015: mmfsd ready
Tue Dec 15 12:25:59 PST 2015: mmcommon mmfsup invoked. Parameters: 
10.210.17.1 10.210.16.41 all
Tue Dec 15 12:25:59 PST 2015: mounting /dev/hsgs
Tue Dec 15 12:25:59.856 2015: Command: mount hsgs
Tue Dec 15 12:25:59.934 2015: Connecting to 10.210.16.43 hs-gs-04 <c0n3>
Tue Dec 15 12:25:59.935 2015: Connected to 10.210.16.43 hs-gs-04 <c0n3>
Tue Dec 15 12:25:59.941 2015: VERBS RDMA connecting to 10.210.16.41 
(hs-gs-02) on mlx4_0 port 1 fabnum 0 index 0
Tue Dec 15 12:25:59.942 2015: VERBS RDMA connected to 10.210.16.41 
(hs-gs-02) on mlx4_0 port 1 fabnum 0 sl 0 index 0
Tue Dec 15 12:25:59.945 2015: VERBS RDMA connecting to 10.210.16.40 
(hs-gs-01) on mlx4_0 port 1 fabnum 0 index 1
Tue Dec 15 12:25:59.947 2015: VERBS RDMA connected to 10.210.16.40 
(hs-gs-01) on mlx4_0 port 1 fabnum 0 sl 0 index 1
Tue Dec 15 12:25:59.963 2015: Close connection to 10.210.16.11 
hs-ln01.local <c0n5> (No route to host)
Tue Dec 15 12:25:59.964 2015: Retry connection to 10.210.16.11 
hs-ln01.local <c0n5>
Tue Dec 15 12:25:59.965 2015: Close connection to 10.210.16.11 
hs-ln01.local <c0n5> (No route to host)
Tue Dec 15 12:27:15.457 2015: Request sent to 10.210.16.41 (hs-gs-02) to 
expel 10.210.16.11 (hs-ln01.local) from cluster HS-GS-Cluster.hs-gs-01
Tue Dec 15 12:27:15.458 2015: This node will be expelled from cluster 
HS-GS-Cluster.hs-gs-01 due to expel msg from 10.210.17.1 (cn1.local)
Tue Dec 15 12:27:15.459 2015: This node is being expelled from the cluster.
Tue Dec 15 12:27:15.460 2015: Lost membership in cluster 
HS-GS-Cluster.hs-gs-01. Unmounting file systems.
Tue Dec 15 12:27:15.461 2015: VERBS RDMA closed connection to 
10.210.16.41 (hs-gs-02) on mlx4_0 port 1 fabnum 0 index 0
Tue Dec 15 12:27:15.462 2015: Cluster Manager connection broke. Probing 
cluster HS-GS-Cluster.hs-gs-01
Tue Dec 15 12:27:15.463 2015: VERBS RDMA closed connection to 
10.210.16.40 (hs-gs-01) on mlx4_0 port 1 fabnum 0 index 1
Tue Dec 15 12:27:16.578 2015: Command: err 2: mount hsgs
Tue Dec 15 12:27:16.579 2015: Specified entity, such as a disk or file 
system, does not exist.
mount: No such file or directory
Tue Dec 15 12:27:16 PST 2015: finished mounting /dev/hsgs
Tue Dec 15 12:27:16.938 2015: Quorum loss. Probing cluster 
HS-GS-Cluster.hs-gs-01
Tue Dec 15 12:27:17.439 2015: Connecting to 10.210.16.40 hs-gs-01 <c0p0>
Tue Dec 15 12:27:17.440 2015: Connected to 10.210.16.40 hs-gs-01 <c0p0>
Tue Dec 15 12:27:17.441 2015: Connecting to 10.210.16.41 hs-gs-02 <c0p1>
Tue Dec 15 12:27:17.442 2015: Connected to 10.210.16.41 hs-gs-02 <c0p1>
Tue Dec 15 12:27:22.442 2015: Connecting to 10.210.16.42 hs-gs-03 <c0p2>
Tue Dec 15 12:27:22.443 2015: Connected to 10.210.16.42 hs-gs-03 <c0p2>
Tue Dec 15 12:28:09.955 2015: Node 10.210.16.41 (hs-gs-02) is now the 
Group Leader.
Tue Dec 15 12:28:09.965 2015: mmfsd ready
Tue Dec 15 12:28:10 PST 2015: mmcommon mmfsup invoked. Parameters: 
10.210.17.1 10.210.16.41 all
Tue Dec 15 12:28:10 PST 2015: mounting /dev/hsgs
Tue Dec 15 12:28:10.222 2015: Command: mount hsgs
Tue Dec 15 12:28:10.314 2015: Connecting to 10.210.16.43 hs-gs-04 <c0n3>
Tue Dec 15 12:28:10.315 2015: Connected to 10.210.16.43 hs-gs-04 <c0n3>
Tue Dec 15 12:28:10.322 2015: VERBS RDMA connecting to 10.210.16.41 
(hs-gs-02) on mlx4_0 port 1 fabnum 0 index 1
Tue Dec 15 12:28:10.323 2015: VERBS RDMA connected to 10.210.16.41 
(hs-gs-02) on mlx4_0 port 1 fabnum 0 sl 0 index 1
Tue Dec 15 12:28:10.326 2015: VERBS RDMA connecting to 10.210.16.40 
(hs-gs-01) on mlx4_0 port 1 fabnum 0 index 0
Tue Dec 15 12:28:10.328 2015: VERBS RDMA connected to 10.210.16.40 
(hs-gs-01) on mlx4_0 port 1 fabnum 0 sl 0 index 0
Tue Dec 15 12:28:10.344 2015: Close connection to 10.210.16.11 
hs-ln01.local <c0n5> (No route to host)
Tue Dec 15 12:28:10.345 2015: Retry connection to 10.210.16.11 
hs-ln01.local <c0n5>
Tue Dec 15 12:28:10.346 2015: Close connection to 10.210.16.11 
hs-ln01.local <c0n5> (No route to host)



All the IB / RDMA stuff looks OK to me, but as soon as the GPFS clients 
connect, they try to expel each other.  The 4 NSD servers seem just fine 
though.  Trying the Mellanox OFED 3.x yields the same results, so 
somehow I think it's not an IB issue.

[root at cn1 ~]# uname -r
2.6.32-573.8.1.el6.x86_64
[root at cn1 ~]# rpm -qa|grep gpfs
gpfs.gpl-3.5.0-29.noarch
gpfs.docs-3.5.0-29.noarch
gpfs.msg.en_US-3.5.0-29.noarch
gpfs.base-3.5.0-29.x86_64

Does anyone have any suggestions?

Regards,
-- 
chekh at stanford.edu 347-401-4860 chekh at stanford.edu




More information about the gpfsug-discuss mailing list