[gpfsug-discuss] NFS issues

Wed Apr 26 18:53:51 BST 2017

> On 2017 Apr 26 Wed, at 16:20, Simon Thompson (IT Research Support) <S.J.Thompson at bham.ac.uk> wrote:
> 
> Nope, the clients are all L3 connected, so not an arp issue.

...not on the client, but the server-facing L3 switch
still need to manage its ARP table, and might miss
the IP moving to a new MAC. 

Cisco switches have  a default ARP cache timeout of 4 hours, fwiw.

Can your network team provide you the ARP status
from the switch when you see a fail-over being stuck?

— Peter

> 
> Two things we have observed:
> 
> 1. It triggers when one of the CES IPs moves and quickly moves back again.
> The move occurs because the NFS server goes into grace:
> 
> 2017-04-25 20:36:49 : epoch 00040183 : <NODENAME> :
> ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN
> GRACE, duration 60
> 2017-04-25 20:36:49 : epoch 00040183 : <NODENAME> :
> ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server
> recovery event 2 nodeid -1 ip <CESIP>
> 2017-04-25 20:36:49 : epoch 00040183 : <NODENAME> :
> ganesha.nfsd-1261[dbus] nfs_release_v4_client :STATE :EVENT :NFS Server V4
> recovery release ip <CESIP>
> 2017-04-25 20:36:49 : epoch 00040183 : <NODENAME> :
> ganesha.nfsd-1261[dbus] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
> 2017-04-25 20:37:42 : epoch 00040183 : <NODENAME> :
> ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN
> GRACE, duration 60
> 2017-04-25 20:37:44 : epoch 00040183 : <NODENAME> :
> ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN
> GRACE, duration 60
> 2017-04-25 20:37:44 : epoch 00040183 : <NODENAME> :
> ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server
> recovery event 4 nodeid 2 ip
> 
> 
> 
> We can't see in any of the logs WHY ganesha is going into grace. Any
> suggestions on how to debug this further? (I.e. If we can stop the grace
> issues, we can solve the problem mostly).
> 
> 
> 2. Our clients are using LDAP which is bound to the CES IPs. If we
> shutdown nslcd on the client we can get the client to recover once all the
> TIME_WAIT connections have gone. Maybe this was a bad choice on our side
> to bind to the CES IPs - we figured it would handily move the IPs for us,
> but I guess the mmcesfuncs isn't aware of this and so doesn't kill the
> connections to the IP as it goes away.
> 
> 
> So two approaches we are going to try. Reconfigure the nslcd on a couple
> of clients and see if they still show up the issues when fail-over occurs.
> Second is to work out why the NFS servers are going into grace in the
> first place.
> 
> Simon
> 
> On 26/04/2017, 00:46, "gpfsug-discuss-bounces at spectrumscale.org on behalf
> of Greg.Lehmann at csiro.au" <gpfsug-discuss-bounces at spectrumscale.org on
> behalf of Greg.Lehmann at csiro.au> wrote:
> 
>> Are you using infiniband or Ethernet? I'm wondering if IBM have solved
>> the gratuitous arp issue which we see with our non-protocols NFS
>> implementation.
>> 
>> -----Original Message-----
>> From: gpfsug-discuss-bounces at spectrumscale.org
>> [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon
>> Thompson (IT Research Support)
>> Sent: Wednesday, 26 April 2017 3:31 AM
>> To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
>> Subject: Re: [gpfsug-discuss] NFS issues
>> 
>> I did some digging in the mmcesfuncs to see what happens server side on
>> fail over.
>> 
>> Basically the server losing the IP is supposed to terminate all sessions
>> and the receiver server sends ACK tickles.
>> 
>> My current supposition is that for whatever reason, the losing server
>> isn't releasing something and the client still has hold of a connection
>> which is mostly dead. The tickle then fails to the client from the new
>> server.
>> 
>> This would explain why failing the IP back to the original server usually
>> brings the client back to life.
>> 
>> This is only my working theory at the moment as we can't reliably
>> reproduce this. Next time it happens we plan to grab some netstat from
>> each side. 
>> 
>> Then we plan to issue "mmcmi tcpack $cesIpPort $clientIpPort" on the
>> server that received the IP and see if that fixes it (i.e. the receiver
>> server didn't tickle properly). (Usage extracted from mmcesfuncs which is
>> ksh of course). ... CesIPPort is colon separated IP:portnumber (of NFSd)
>> for anyone interested.
>> 
>> Then try and kill he sessions on the losing server to check if there is
>> stuff still open and re-tickle the client.
>> 
>> If we can get steps to workaround, I'll log a PMR. I suppose I could do
>> that now, but given its non deterministic and we want to be 100% sure
>> it's not us doing something wrong, I'm inclined to wait until we do some
>> more testing.
>> 
>> I agree with the suggestion that it's probably IO pending nodes that are
>> affected, but don't have any data to back that up yet. We did try with a
>> read workload on a client, but may we need either long IO blocked reads
>> or writes (from the GPFS end).
>> 
>> We also originally had soft as the default option, but saw issues then
>> and the docs suggested hard, so we switched and also enabled sync (we
>> figured maybe it was NFS client with uncommited writes), but neither have
>> resolved the issues entirely. Difficult for me to say if they improved
>> the issue though given its sporadic.
>> 
>> Appreciate people's suggestions!
>> 
>> Thanks
>> 
>> Simon
>> ________________________________________
>> From: gpfsug-discuss-bounces at spectrumscale.org
>> [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jan-Frode
>> Myklebust [janfrode at tanso.net]
>> Sent: 25 April 2017 18:04
>> To: gpfsug main discussion list
>> Subject: Re: [gpfsug-discuss] NFS issues
>> 
>> I *think* I've seen this, and that we then had open TCP connection from
>> client to NFS server according to netstat, but these connections were not
>> visible from netstat on NFS-server side.
>> 
>> Unfortunately I don't remember what the fix was..
>> 
>> 
>> 
>> -jf
>> 
>> tir. 25. apr. 2017 kl. 16.06 skrev Simon Thompson (IT Research Support)
>> <S.J.Thompson at bham.ac.uk<mailto:S.J.Thompson at bham.ac.uk>>:
>> Hi,
>> 
>> From what I can see, Ganesha uses the Export_Id option in the config file
>> (which is managed by CES) for this. I did find some reference in the
>> Ganesha devs list that if its not set, then it would read the FSID from
>> the GPFS file-system, either way they should surely be consistent across
>> all the nodes. The posts I found were from someone with an IBM email
>> address, so I guess someone in the IBM teams.
>> 
>> I checked a couple of my protocol nodes and they use the same Export_Id
>> consistently, though I guess that might not be the same as the FSID value.
>> 
>> Perhaps someone from IBM could comment on if FSID is likely to the cause
>> of my problems?
>> 
>> Thanks
>> 
>> Simon
>> 
>> On 25/04/2017, 14:51,
>> "gpfsug-discuss-bounces at spectrumscale.org<mailto:gpfsug-discuss-bounces at sp
>> ectrumscale.org> on behalf of Ouwehand, JJ"
>> <gpfsug-discuss-bounces at spectrumscale.org<mailto:gpfsug-discuss-bounces at sp
>> ectrumscale.org> on behalf of
>> j.ouwehand at vumc.nl<mailto:j.ouwehand at vumc.nl>> wrote:
>> 
>>> Hello,
>>> 
>>> At first a short introduction. My name is Jaap Jan Ouwehand, I work at
>>> a Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of
>>> IBM Spectrum Scale, Spectrum Archive and Spectrum Protect in our
>>> critical (office, research and clinical data) business process. We have
>>> three large GPFS filesystems for different purposes.
>>> 
>>> We also had such a situation with cNFS. A failover (IPtakeover) was
>>> technically good, only clients experienced "stale filehandles". We
>>> opened a PMR at IBM and after testing, deliver logs, tcpdumps and a few
>>> months later, the solution appeared to be in the fsid option.
>>> 
>>> An NFS filehandle is built by a combination of fsid and a hash function
>>> on the inode. After a failover, the fsid value can be different and the
>>> client has a "stale filehandle". To avoid this, the fsid value can be
>>> statically specified. See:
>>> 
>>> https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum
>>> .
>>> scale.v4r22.doc/bl1adm_nfslin.htm
>>> 
>>> Maybe there is also a value in Ganesha that changes after a failover.
>>> Certainly since most sessions will be re-established after a failback.
>>> Maybe you see more debug information with tcpdump.
>>> 
>>> 
>>> Kind regards,
>>> 
>>> Jaap Jan Ouwehand
>>> ICT Specialist (Storage & Linux)
>>> VUmc - ICT
>>> E: jj.ouwehand at vumc.nl<mailto:jj.ouwehand at vumc.nl>
>>> W: www.vumc.com<http://www.vumc.com>
>>> 
>>> 
>>> 
>>> -----Oorspronkelijk bericht-----
>>> Van: 
>>> gpfsug-discuss-bounces at spectrumscale.org<mailto:gpfsug-discuss-bounces@
>>> spectrumscale.org>
>>> [mailto:gpfsug-discuss-bounces at spectrumscale.org<mailto:gpfsug-discuss-
>>> bounces at spectrumscale.org>] Namens Simon Thompson (IT Research Support)
>>> Verzonden: dinsdag 25 april 2017 13:21
>>> Aan: 
>>> gpfsug-discuss at spectrumscale.org<mailto:gpfsug-discuss at spectrumscale.or
>>> g>
>>> Onderwerp: [gpfsug-discuss] NFS issues
>>> 
>>> Hi,
>>> 
>>> We have recently started deploying NFS in addition our existing SMB
>>> exports on our protocol nodes.
>>> 
>>> We use a RR DNS name that points to 4 VIPs for SMB services and
>>> failover seems to work fine with SMB clients. We figured we could use
>>> the same name and IPs and run Ganesha on the protocol servers, however
>>> we are seeing issues with NFS clients when IP failover occurs.
>>> 
>>> In normal operation on a client, we might see several mounts from
>>> different IPs obviously due to the way the DNS RR is working, but it
>>> all works fine.
>>> 
>>> In a failover situation, the IP will move to another node and some
>>> clients will carry on, others will hang IO to the mount points referred
>>> to by the IP which has moved. We can *sometimes* trigger this by
>>> manually suspending a CES node, but not always and some clients
>>> mounting from the IP moving will be fine, others won't.
>>> 
>>> If we resume a node an it fails back, the clients that are hanging will
>>> usually recover fine. We can reboot a client prior to failback and it
>>> will be fine, stopping and starting the ganesha service on a protocol
>>> node will also sometimes resolve the issues.
>>> 
>>> So, has anyone seen this sort of issue and any suggestions for how we
>>> could either debug more or workaround?
>>> 
>>> We are currently running the packages
>>> nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones).
>>> 
>>> At one point we were seeing it a lot, and could track it back to an
>>> underlying GPFS network issue that was causing protocol nodes to be
>>> expelled occasionally, we resolved that and the issues became less
>>> apparent, but maybe we just fixed one failure mode so see it less often.
>>> 
>>> On the clients, we use -o sync,hard BTW as in the IBM docs.
>>> 
>>> On a client showing the issues, we'll see in dmesg, NFS related
>>> messages
>>> like:
>>> [Wed Apr 12 16:59:53 2017] nfs: server
>>> MYNFSSERVER.bham.ac.uk<http://MYNFSSERVER.bham.ac.uk> not responding,
>>> timed out
>>> 
>>> Which explains the client hang on certain mount points.
>>> 
>>> The symptoms feel very much like those logged in this Gluster/ganesha
>>> bug:
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1354439
>>> 
>>> 
>>> Thanks
>>> 
>>> Simon
>>> 
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at spectrumscale.org<http://spectrumscale.org>
>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at spectrumscale.org<http://spectrumscale.org>
>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>> 
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org<http://spectrumscale.org>
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss