[gpfsug-discuss] NFS issues

Ouwehand, JJ j.ouwehand at vumc.nl
Tue Apr 25 14:51:22 BST 2017


Hello,

At first a short introduction. My name is Jaap Jan Ouwehand, I work at a Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of IBM Spectrum Scale, Spectrum Archive and Spectrum Protect in our critical (office, research and clinical data) business process. We have three large GPFS filesystems for different purposes.

We also had such a situation with cNFS. A failover (IPtakeover) was technically good, only clients experienced "stale filehandles". We opened a PMR at IBM and after testing, deliver logs, tcpdumps and a few months later, the solution appeared to be in the fsid option.

An NFS filehandle is built by a combination of fsid and a hash function on the inode. After a failover, the fsid value can be different and the client has a "stale filehandle". To avoid this, the fsid value can be statically specified. See:

https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1adm_nfslin.htm

Maybe there is also a value in Ganesha that changes after a failover. Certainly since most sessions will be re-established after a failback. Maybe you see more debug information with tcpdump.


Kind regards,
 
Jaap Jan Ouwehand
ICT Specialist (Storage & Linux)
VUmc - ICT
E: jj.ouwehand at vumc.nl
W: www.vumc.com



-----Oorspronkelijk bericht-----
Van: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Namens Simon Thompson (IT Research Support)
Verzonden: dinsdag 25 april 2017 13:21
Aan: gpfsug-discuss at spectrumscale.org
Onderwerp: [gpfsug-discuss] NFS issues

Hi,

We have recently started deploying NFS in addition our existing SMB exports on our protocol nodes.

We use a RR DNS name that points to 4 VIPs for SMB services and failover seems to work fine with SMB clients. We figured we could use the same name and IPs and run Ganesha on the protocol servers, however we are seeing issues with NFS clients when IP failover occurs.

In normal operation on a client, we might see several mounts from different IPs obviously due to the way the DNS RR is working, but it all works fine.

In a failover situation, the IP will move to another node and some clients will carry on, others will hang IO to the mount points referred to by the IP which has moved. We can *sometimes* trigger this by manually suspending a CES node, but not always and some clients mounting from the IP moving will be fine, others won't.

If we resume a node an it fails back, the clients that are hanging will usually recover fine. We can reboot a client prior to failback and it will be fine, stopping and starting the ganesha service on a protocol node will also sometimes resolve the issues.

So, has anyone seen this sort of issue and any suggestions for how we could either debug more or workaround?

We are currently running the packages
nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones).

At one point we were seeing it a lot, and could track it back to an underlying GPFS network issue that was causing protocol nodes to be expelled occasionally, we resolved that and the issues became less apparent, but maybe we just fixed one failure mode so see it less often.

On the clients, we use -o sync,hard BTW as in the IBM docs.

On a client showing the issues, we'll see in dmesg, NFS related messages
like:
[Wed Apr 12 16:59:53 2017] nfs: server MYNFSSERVER.bham.ac.uk not responding, timed out

Which explains the client hang on certain mount points.

The symptoms feel very much like those logged in this Gluster/ganesha bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1354439


Thanks

Simon

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



More information about the gpfsug-discuss mailing list