[gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS

Wed Oct 17 12:44:41 BST 2018

Dear Mailing List readers,

I've come to a preliminary conclusion that explains the behavior in an
appropriate manner, so I'm trying to summarize my current thinking with
this audience.

Problem statement:
   Big performance derivation between native GPFS (fast) and loopback NFS
   mount on the same node (way slower) for single client, single thread,
   small files workload.

Current explanation:
   tar seems to use close() on files, not fclose(). That is an application
   choice and common behavior. The ideas is to allow OS write caching to
   speed up process run time.

   When running locally on ext3 / xfs / GPFS / .. that allows async
   destaging of data down to disk, somewhat compromising data for better
   performance.
   As we're talking about write caching on the same node that the
   application runs on - a crash is missfortune but in the same failure
   domain.
   E.g. if you run a compile job that includes extraction of a tar and the
   node crashes you'll have to restart the entire job, anyhow.

   The NFSv2 spec defined that NFS io's are to be 'sync', probably because
   the compile job on the nfs client would survive if the NFS Server
   crashes, so the failure domain would be different

   NFSv3 in rfc1813 below acknowledged the performance impact and
   introduced the 'async' flag for NFS, which would handle IO's similar to
   local IOs, allowing to destage in the background.

   Keep in mind - applications, independent if running locally or via NFS
   can always decided to use the fclose() option, which will ensure that
   data is destaged to persistent storage right away.
   But its an applications choice if that's really mandatory or whether
   performance has higher priority.

   The linux 'sync' (man sync) tool allows to sync 'dirty' memory cache
   down to disk - very filesystem independent.

-> single client, single thread, small files workload on GPFS can be
destaged async, allowing to hide latency and parallelizing disk IOs.
-> NFS client IO's are sync, so the second IO can only be started after the
first one hit non volatile memory -> much higher latency

   The Spectrum Scale NFS implementation (based on ganesha) does not
   support the async mount option, which is a bit of a pitty. There might
   also be implementation differences compared to kernel-nfs, I did not
   investigate into that direction.

   However, the principles of the difference are explained for my by the
   above behavior.

   One workaround that I saw working well for multiple customers was to
   replace the NFS client by a Spectrum Scale nsd client.
   That has two advantages, but is certainly not suitable in all cases:
      - Improved speed by efficent NSD protocol and NSD client side write
      caching
      - Write Caching in the same failure domain as the application (on NSD
      client) which seems to be more reasonable compared to NFS Server side
      write caching.

References:

NFS sync vs async
	https://tools.ietf.org/html/rfc1813
	The write throughput bottleneck caused by the synchronous definition
of write in the NFS version 2 protocol has been addressed by adding support
so that the NFS server can do unsafe writes.
	Unsafe writes are writes which have not been committed to stable
storage before the operation returns.  This specification defines a method
for committing these unsafe writes to stable storage in a reliable way.

sync() vs fsync()

https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/com.ibm.aix.performance/using_sync_fsync_calls.htm
	- An application program makes an fsync() call for a specified file.
This causes all of the pages that contain modified data for that file to be
written to disk. The writing is complete when the fsync() call returns to
the program.

	- An application program makes a sync() call. This causes all of the
file pages in memory that contain modified data to be scheduled for writing
to disk. The writing is not necessarily complete when the sync() call
returns to the program.

	- A user can enter the sync command, which in turn issues a sync()
call. Again, some of the writes may not be complete when the user is
prompted for input (or the next command in a shell script is processed).

close() vs fclose()
	A successful close does not guarantee that the data has been
successfully saved to disk, as the kernel defers writes.  It is not common
for a file system to flush the buffers when the stream is closed.  If you
need to be  sure  that  the  data  is
	physically stored use fsync(2).  (It will depend on the disk hardware
at this point.)

Mit freundlichen Grüßen / Kind regards

Alexander Saupp

IBM Systems, Storage Platform, EMEA Storage Competence Center

 Phone:            +49 7034-643-1512                         IBM Deutschland GmbH                             

 Mobile:           +49-172 7251072                           Am Weiher 24                                     

 Email:            alexander.saupp at de.ibm.com                65451 Kelsterbach                                

                                                             Germany                                          

 IBM Deutschland                                                                                              
 GmbH /                                                                                                       
 Vorsitzender des                                                                                             
 Aufsichtsrats:                                                                                               
 Martin Jetter                                                                                                
 Geschäftsführung:                                                                                            
 Matthias Hartmann                                                                                            
 (Vorsitzender),                                                                                              
 Norbert Janzen,                                                                                              
 Stefan Lutz,                                                                                                 
 Nicole Reimer,                                                                                               
 Dr. Klaus                                                                                                    
 Seifert, Wolfgang                                                                                            
 Wendt                                                                                                        
 Sitz der                                                                                                     
 Gesellschaft:                                                                                                
 Ehningen /                                                                                                   
 Registergericht:                                                                                             
 Amtsgericht                                                                                                  
 Stuttgart, HRB                                                                                               
 14562 /                                                                                                      
 WEEE-Reg.-Nr. DE                                                                                             
 99369940                                                                                                     

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20181017/dbb6df5d/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ecblank.gif
Type: image/gif
Size: 45 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20181017/dbb6df5d/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 19995626.gif
Type: image/gif
Size: 1851 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20181017/dbb6df5d/attachment-0001.gif>