[gpfsug-discuss] Wrong nodename after server restart
IBM Spectrum Scale
scale at us.ibm.com
Tue Sep 12 16:01:21 BST 2017
Michal,
When a node is added to a cluster that has a different domain than the
rest of the nodes in the cluster, the GPFS daemons running on the various
nodes can develop an inconsistent understanding of what the common suffix
of all the domain names are. The symptoms you show with the "tsctl
shownodes up" output, and in particular the incorrect node names of the
two nodes you restarted, as seen on a node you did not restart, are
consistent with this problem. I also note your cluster appears to have
the necessary pre-condition to trip on this problem, whale.img.cas.cz does
not share a common suffix with the other nodes in the cluster. The common
suffix of the other nodes in the cluster is ".img.local". Was
whale.img.cas.cz recently added to the cluster?
Unfortunately, the general work-around is to recycle all the nodes at
once: mmshutdown -a, followed by mmstartup -a.
I hope this helps.
Regards, The Spectrum Scale (GPFS) team
------------------------------------------------------------------------------------------------------------------
If you feel that your question can benefit other users of Spectrum Scale
(GPFS), then please post it to the public IBM developerWroks Forum at
https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479
.
If your query concerns a potential software error in Spectrum Scale (GPFS)
and you have an IBM software maintenance contract please contact
1-800-237-5511 in the United States or your local IBM Service Center in
other countries.
The forum is informally monitored as time permits and should not be used
for priority messages to the Spectrum Scale (GPFS) team.
From: Michal Zacek <zacekm at img.cas.cz>
To: gpfsug-discuss at spectrumscale.org
Date: 09/12/2017 05:41 AM
Subject: [gpfsug-discuss] Wrong nodename after server restart
Sent by: gpfsug-discuss-bounces at spectrumscale.org
Hi,
I had to restart two of my gpfs servers (gpfs-n4 and gpfs-quorum) and
after that I was unable to move CES IP address back with strange error
"mmces address move: GPFS is down on this node". After I double checked
that gpfs state is active on all nodes, I dug deeper and I think I found
problem, but I don't really know how this could happen.
Look at the names of nodes:
[root at gpfs-n2 ~]# mmlscluster # Looks good
GPFS cluster information
========================
GPFS cluster name: gpfscl1.img.local
GPFS cluster id: 17792677515884116443
GPFS UID domain: img.local
Remote shell command: /usr/bin/ssh
Remote file copy command: /usr/bin/scp
Repository type: CCR
Node Daemon node name IP address Admin node name
Designation
----------------------------------------------------------------------------------
1 gpfs-n4.img.local 192.168.20.64 gpfs-n4.img.local
quorum-manager
2 gpfs-quorum.img.local 192.168.20.60 gpfs-quorum.img.local quorum
3 gpfs-n3.img.local 192.168.20.63 gpfs-n3.img.local
quorum-manager
4 tau.img.local 192.168.1.248 tau.img.local
5 gpfs-n1.img.local 192.168.20.61 gpfs-n1.img.local
quorum-manager
6 gpfs-n2.img.local 192.168.20.62 gpfs-n2.img.local
quorum-manager
8 whale.img.cas.cz 147.231.150.108 whale.img.cas.cz
[root at gpfs-n2 ~]# mmlsmount gpfs01 -L # not so good
File system gpfs01 is mounted on 7 nodes:
192.168.20.63 gpfs-n3
192.168.20.61 gpfs-n1
192.168.20.62 gpfs-n2
192.168.1.248 tau
192.168.20.64 gpfs-n4.img.local
192.168.20.60 gpfs-quorum.img.local
147.231.150.108 whale.img.cas.cz
[root at gpfs-n2 ~]# tsctl shownodes up | tr ',' '\n' # very wrong
whale.img.cas.cz.img.local
tau.img.local
gpfs-quorum.img.local.img.local
gpfs-n1.img.local
gpfs-n2.img.local
gpfs-n3.img.local
gpfs-n4.img.local.img.local
The "tsctl shownodes up" is the reason why I'm not able to move CES
address back to gpfs-n4 node, but the real problem are different
nodenames. I think OS is configured correctly:
[root at gpfs-n4 /]# hostname
gpfs-n4
[root at gpfs-n4 /]# hostname -f
gpfs-n4.img.local
[root at gpfs-n4 /]# cat /etc/resolv.conf
nameserver 192.168.20.30
nameserver 147.231.150.2
search img.local
domain img.local
[root at gpfs-n4 /]# cat /etc/hosts | grep gpfs-n4
192.168.20.64 gpfs-n4.img.local gpfs-n4
[root at gpfs-n4 /]# host gpfs-n4
gpfs-n4.img.local has address 192.168.20.64
[root at gpfs-n4 /]# host 192.168.20.64
64.20.168.192.in-addr.arpa domain name pointer gpfs-n4.img.local.
Can someone help me with this.
Thanks,
Michal
p.s. gpfs version: 4.2.3-2 (CentOS 7)
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=l_sz-tPolX87WmSf2zBhhPpggnfQJKp7-BqV8euBp7A&s=XSPGkKRMza8PhYQg8AxeKW9cOTNeCI9uph486_6Xajo&e=
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170912/aba30a31/attachment.htm>
More information about the gpfsug-discuss
mailing list