[gpfsug-discuss] Wrong nodename after server restart

Tue Sep 12 16:01:21 BST 2017

Michal,

When a node is added to a cluster that has a different domain than the 
rest of the nodes in the cluster, the GPFS daemons running on the various 
nodes can develop an inconsistent understanding of what the common suffix 
of all the domain names are.  The symptoms you show with the "tsctl 
shownodes up" output, and in particular the incorrect node names of the 
two nodes you restarted, as seen on a node you did not restart, are 
consistent with this problem.  I also note your cluster appears to have 
the necessary pre-condition to trip on this problem, whale.img.cas.cz does 
not share a common suffix with the other nodes in the cluster.  The common 
suffix of the other nodes in the cluster is ".img.local".  Was 
whale.img.cas.cz recently added to the cluster?

Unfortunately, the general work-around is to recycle all the nodes at 
once: mmshutdown -a, followed by mmstartup -a.

I hope this helps.

Regards, The Spectrum Scale (GPFS) team

------------------------------------------------------------------------------------------------------------------
If you feel that your question can benefit other users of  Spectrum Scale 
(GPFS), then please post it to the public IBM developerWroks Forum at 
https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479
. 

If your query concerns a potential software error in Spectrum Scale (GPFS) 
and you have an IBM software maintenance contract please contact 
1-800-237-5511 in the United States or your local IBM Service Center in 
other countries. 

The forum is informally monitored as time permits and should not be used 
for priority messages to the Spectrum Scale (GPFS) team.

From:   Michal Zacek <zacekm at img.cas.cz>
To:     gpfsug-discuss at spectrumscale.org
Date:   09/12/2017 05:41 AM
Subject:        [gpfsug-discuss] Wrong nodename after server restart
Sent by:        gpfsug-discuss-bounces at spectrumscale.org

Hi,

I had to restart two of my gpfs servers (gpfs-n4 and gpfs-quorum) and 
after that I was unable to move CES IP address back with strange error 
"mmces address move: GPFS is down on this node". After I double checked 
that gpfs state is active on all nodes, I dug deeper and I think I found 
problem, but I don't really know how this could happen.

Look at the names of nodes:

[root at gpfs-n2 ~]# mmlscluster     # Looks good

GPFS cluster information
========================
   GPFS cluster name:         gpfscl1.img.local
   GPFS cluster id:           17792677515884116443
   GPFS UID domain:           img.local
   Remote shell command:      /usr/bin/ssh
   Remote file copy command:  /usr/bin/scp
   Repository type:           CCR

  Node  Daemon node name       IP address       Admin node name 
Designation
----------------------------------------------------------------------------------
    1   gpfs-n4.img.local      192.168.20.64 gpfs-n4.img.local 
quorum-manager
    2   gpfs-quorum.img.local  192.168.20.60 gpfs-quorum.img.local  quorum
    3   gpfs-n3.img.local      192.168.20.63 gpfs-n3.img.local 
quorum-manager
    4   tau.img.local          192.168.1.248 tau.img.local
    5   gpfs-n1.img.local      192.168.20.61 gpfs-n1.img.local 
quorum-manager
    6   gpfs-n2.img.local      192.168.20.62 gpfs-n2.img.local 
quorum-manager
    8   whale.img.cas.cz       147.231.150.108 whale.img.cas.cz

[root at gpfs-n2 ~]# mmlsmount gpfs01 -L   # not so good

File system gpfs01 is mounted on 7 nodes:
   192.168.20.63   gpfs-n3
   192.168.20.61   gpfs-n1
   192.168.20.62   gpfs-n2
   192.168.1.248   tau
   192.168.20.64   gpfs-n4.img.local
   192.168.20.60   gpfs-quorum.img.local
   147.231.150.108 whale.img.cas.cz

[root at gpfs-n2 ~]# tsctl shownodes up | tr ','  '\n'   # very wrong
whale.img.cas.cz.img.local
tau.img.local
gpfs-quorum.img.local.img.local
gpfs-n1.img.local
gpfs-n2.img.local
gpfs-n3.img.local
gpfs-n4.img.local.img.local

The "tsctl shownodes up" is the reason why I'm not able to move CES 
address back to gpfs-n4 node, but the real problem are different 
nodenames. I think OS is configured correctly:

[root at gpfs-n4 /]# hostname
gpfs-n4

[root at gpfs-n4 /]# hostname -f
gpfs-n4.img.local

[root at gpfs-n4 /]# cat /etc/resolv.conf
nameserver 192.168.20.30
nameserver 147.231.150.2
search img.local
domain img.local

[root at gpfs-n4 /]# cat /etc/hosts | grep gpfs-n4
192.168.20.64    gpfs-n4.img.local gpfs-n4

[root at gpfs-n4 /]# host gpfs-n4
gpfs-n4.img.local has address 192.168.20.64

[root at gpfs-n4 /]# host 192.168.20.64
64.20.168.192.in-addr.arpa domain name pointer gpfs-n4.img.local.

Can someone help me with this.

Thanks,
Michal

p.s.  gpfs version: 4.2.3-2 (CentOS 7)
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=l_sz-tPolX87WmSf2zBhhPpggnfQJKp7-BqV8euBp7A&s=XSPGkKRMza8PhYQg8AxeKW9cOTNeCI9uph486_6Xajo&e= 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170912/aba30a31/attachment.htm>