[gpfsug-discuss] Fw: Wrong nodename after server restart
IBM Spectrum Scale
scale at us.ibm.com
Wed Sep 13 22:33:30 BST 2017
----- Forwarded by Eric Agar/Poughkeepsie/IBM on 09/13/2017 05:32 PM -----
From: IBM Spectrum Scale/Poughkeepsie/IBM
To: Michal Zacek <zacekm at img.cas.cz>
Date: 09/13/2017 05:29 PM
Subject: Re: [gpfsug-discuss] Wrong nodename after server restart
Sent by: Eric Agar
Hello Michal,
It should not be necessary to delete whale.img.cas.cz and rename it. But,
that is an option you can take, if you prefer it. If you decide to take
that option, please see the last paragraph of this response.
The confusion starts at the moment a node is added to the active cluster
where the new node does not have the same common domain suffix as the
nodes that were already in the cluster. The confusion increases when the
GPFS daemons on some nodes, but not all nodes, are recycled. Doing
mmshutdown -a, followed by mmstartup -a, once after the new node has been
added allows all GPFS daemons on all nodes to come up at the same time and
arrive at the same answer to the question, "what is the common domain
suffix for all the nodes in the cluster now?" In the case of your
cluster, the answer will be "the common domain suffix is the empty string"
or, put another way, "there is no common domain suffix"; that is okay, as
long as all the GPFS daemons come to the same conclusion.
After you recycle the cluster, you can check to make sure all seems well
by running "tsctl shownodes up" on every node, and make sure the answer is
correct on each node.
If the mmshutdown -a / mmstartup -a recycle works, the problem should not
recur with the current set of nodes in the cluster. Even as individual
GPFS daemons are recycled going forward, they should still understand the
cluster's nodes have no common domain suffix.
However, I can imagine sequences of events that would cause the issue to
occur again after nodes are deleted or added to the cluster while the
cluster is active. For example, if whale.img.cas.cz were to be deleted
from the current cluster, that action would restore the cluster to having
a common domain suffix of ".img.local", but already running GPFS daemons
would not realize it. If the delete of whale occurred while the cluster
was active, subsequent recycling of the GPFS daemon on just a subset of
the nodes would cause the recycled daemons to understand the common domain
suffix to now be ".img.local". But, daemons that had not been recycled
would still think there is no common domain suffix. The confusion would
occur again.
On the other hand, adding and deleting nodes to/from the cluster should
not cause the issue to occur again as long as the cluster continues to
have the same (in this case, no) common domain suffix.
If you decide to delete whale.img.case.cz, rename it to have the
".img.local" domain suffix, and add it back to the cluster, it would be
best to do so after all the GPFS daemons are shut down with mmshutdown -a,
but before any of the daemons are restarted with mmstartup. This would
allow all the subsequent running daemons to come to the conclusion that
".img.local" is now the common domain suffix.
I hope this helps.
Regards,
Eric Agar
Regards, The Spectrum Scale (GPFS) team
------------------------------------------------------------------------------------------------------------------
If you feel that your question can benefit other users of Spectrum Scale
(GPFS), then please post it to the public IBM developerWroks Forum at
https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479
.
If your query concerns a potential software error in Spectrum Scale (GPFS)
and you have an IBM software maintenance contract please contact
1-800-237-5511 in the United States or your local IBM Service Center in
other countries.
The forum is informally monitored as time permits and should not be used
for priority messages to the Spectrum Scale (GPFS) team.
From: Michal Zacek <zacekm at img.cas.cz>
To: IBM Spectrum Scale <scale at us.ibm.com>
Date: 09/13/2017 03:42 AM
Subject: Re: [gpfsug-discuss] Wrong nodename after server restart
Hello
yes you are correct, Whale was added two days a go. It's necessary to
delete whale.img.cas.cz from cluster before mmshutdown/mmstartup? If the
two domains may cause problems in the future I can rename whale (and all
planed nodes) to img.local suffix.
Many thanks for the prompt reply.
Regards
Michal
Dne 12.9.2017 v 17:01 IBM Spectrum Scale napsal(a):
Michal,
When a node is added to a cluster that has a different domain than the
rest of the nodes in the cluster, the GPFS daemons running on the various
nodes can develop an inconsistent understanding of what the common suffix
of all the domain names are. The symptoms you show with the "tsctl
shownodes up" output, and in particular the incorrect node names of the
two nodes you restarted, as seen on a node you did not restart, are
consistent with this problem. I also note your cluster appears to have
the necessary pre-condition to trip on this problem, whale.img.cas.cz does
not share a common suffix with the other nodes in the cluster. The common
suffix of the other nodes in the cluster is ".img.local". Was
whale.img.cas.cz recently added to the cluster?
Unfortunately, the general work-around is to recycle all the nodes at
once: mmshutdown -a, followed by mmstartup -a.
I hope this helps.
Regards, The Spectrum Scale (GPFS) team
------------------------------------------------------------------------------------------------------------------
If you feel that your question can benefit other users of Spectrum Scale
(GPFS), then please post it to the public IBM developerWroks Forum at
https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479
.
If your query concerns a potential software error in Spectrum Scale (GPFS)
and you have an IBM software maintenance contract please contact
1-800-237-5511 in the United States or your local IBM Service Center in
other countries.
The forum is informally monitored as time permits and should not be used
for priority messages to the Spectrum Scale (GPFS) team.
From: Michal Zacek <zacekm at img.cas.cz>
To: gpfsug-discuss at spectrumscale.org
Date: 09/12/2017 05:41 AM
Subject: [gpfsug-discuss] Wrong nodename after server restart
Sent by: gpfsug-discuss-bounces at spectrumscale.org
Hi,
I had to restart two of my gpfs servers (gpfs-n4 and gpfs-quorum) and
after that I was unable to move CES IP address back with strange error
"mmces address move: GPFS is down on this node". After I double checked
that gpfs state is active on all nodes, I dug deeper and I think I found
problem, but I don't really know how this could happen.
Look at the names of nodes:
[root at gpfs-n2 ~]# mmlscluster # Looks good
GPFS cluster information
========================
GPFS cluster name: gpfscl1.img.local
GPFS cluster id: 17792677515884116443
GPFS UID domain: img.local
Remote shell command: /usr/bin/ssh
Remote file copy command: /usr/bin/scp
Repository type: CCR
Node Daemon node name IP address Admin node name
Designation
----------------------------------------------------------------------------------
1 gpfs-n4.img.local 192.168.20.64 gpfs-n4.img.local
quorum-manager
2 gpfs-quorum.img.local 192.168.20.60 gpfs-quorum.img.local quorum
3 gpfs-n3.img.local 192.168.20.63 gpfs-n3.img.local
quorum-manager
4 tau.img.local 192.168.1.248 tau.img.local
5 gpfs-n1.img.local 192.168.20.61 gpfs-n1.img.local
quorum-manager
6 gpfs-n2.img.local 192.168.20.62 gpfs-n2.img.local
quorum-manager
8 whale.img.cas.cz 147.231.150.108 whale.img.cas.cz
[root at gpfs-n2 ~]# mmlsmount gpfs01 -L # not so good
File system gpfs01 is mounted on 7 nodes:
192.168.20.63 gpfs-n3
192.168.20.61 gpfs-n1
192.168.20.62 gpfs-n2
192.168.1.248 tau
192.168.20.64 gpfs-n4.img.local
192.168.20.60 gpfs-quorum.img.local
147.231.150.108 whale.img.cas.cz
[root at gpfs-n2 ~]# tsctl shownodes up | tr ',' '\n' # very wrong
whale.img.cas.cz.img.local
tau.img.local
gpfs-quorum.img.local.img.local
gpfs-n1.img.local
gpfs-n2.img.local
gpfs-n3.img.local
gpfs-n4.img.local.img.local
The "tsctl shownodes up" is the reason why I'm not able to move CES
address back to gpfs-n4 node, but the real problem are different
nodenames. I think OS is configured correctly:
[root at gpfs-n4 /]# hostname
gpfs-n4
[root at gpfs-n4 /]# hostname -f
gpfs-n4.img.local
[root at gpfs-n4 /]# cat /etc/resolv.conf
nameserver 192.168.20.30
nameserver 147.231.150.2
search img.local
domain img.local
[root at gpfs-n4 /]# cat /etc/hosts | grep gpfs-n4
192.168.20.64 gpfs-n4.img.local gpfs-n4
[root at gpfs-n4 /]# host gpfs-n4
gpfs-n4.img.local has address 192.168.20.64
[root at gpfs-n4 /]# host 192.168.20.64
64.20.168.192.in-addr.arpa domain name pointer gpfs-n4.img.local.
Can someone help me with this.
Thanks,
Michal
p.s. gpfs version: 4.2.3-2 (CentOS 7)
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=l_sz-tPolX87WmSf2zBhhPpggnfQJKp7-BqV8euBp7A&s=XSPGkKRMza8PhYQg8AxeKW9cOTNeCI9uph486_6Xajo&e=
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
--
Michal Žáček | Information Technologies
+420 296 443 128
+420 296 443 333
michal.zacek at img.cas.cz
www.img.cas.cz
Institute of Molecular Genetics of the ASCR, v. v. i., Vídeňská 1083, 142
20 Prague 4, Czech Republic
ID: 68378050 | VAT ID: CZ68378050
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170913/461b8b50/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 1997 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170913/461b8b50/attachment.png>
More information about the gpfsug-discuss
mailing list