[gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS

Wed Oct 17 14:42:02 BST 2018

hi all,

has anyone tried to use tools like eatmydata that allow the user to
"ignore" the syncs (there's another tool that has less explicit name if
it would make you feel better ;).

stijn

On 10/17/2018 03:26 PM, Tomer Perry wrote:
> Just to clarify ( from man exports):
> "  async  This option allows the NFS server to violate the NFS protocol 
> and reply to requests before any changes made by that request have been 
> committed  to  stable  storage  (e.g.
>               disc drive).
> 
>               Using this option usually improves performance, but at the 
> cost that an unclean server restart (i.e. a crash) can cause data to be 
> lost or corrupted."
> 
> With the Ganesha implementation in Spectrum Scale, it was decided not to 
> allow this violation - so this async export options wasn't exposed.
> I believe that for those customers  that agree to take the risk, using 
> async mount option ( from the client) will achieve similar behavior.
> 
> Regards,
> 
> Tomer Perry
> Scalable I/O Development (Spectrum Scale)
> email: tomp at il.ibm.com
> 1 Azrieli Center, Tel Aviv 67021, Israel
> Global Tel:    +1 720 3422758
> Israel Tel:      +972 3 9188625
> Mobile:         +972 52 2554625
> 
> 
> 
> 
> From:   "Olaf Weiser" <olaf.weiser at de.ibm.com>
> To:     gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> Date:   17/10/2018 16:16
> Subject:        Re: [gpfsug-discuss] Preliminary conclusion: single 
> client, single thread, small files - native Scale vs NFS
> Sent by:        gpfsug-discuss-bounces at spectrumscale.org
> 
> 
> 
> Jallo Jan, 
> you can expect to get slightly improved numbers from the lower response 
> times of the HAWC ... but the loss of performance comes from the fact, 
> that 
> GPFS or (async kNFS) writes with multiple parallel threads - in opposite 
> to e.g. tar via GaneshaNFS  comes with single threads fsync on each file.. 
> 
> 
> you'll never outperform e.g. 128 (maybe slower), but, parallel threads 
> (running write-behind)   <--->   with one single but fast threads, ....
> 
> so as Alex suggest.. if possible.. take gpfs client of kNFS  for those 
> types of workloads..
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> From:        Jan-Frode Myklebust <janfrode at tanso.net>
> To:        gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> Date:        10/17/2018 02:24 PM
> Subject:        Re: [gpfsug-discuss] Preliminary conclusion: single 
> client, single thread, small files - native Scale vs NFS
> Sent by:        gpfsug-discuss-bounces at spectrumscale.org
> 
> 
> 
> Do you know if the slow throughput is caused by the network/nfs-protocol 
> layer, or does it help to use faster storage (ssd)? If on storage, have 
> you considered if HAWC can help?
> 
> I?m thinking about adding an SSD pool as a first tier to hold the active 
> dataset for a similar setup, but that?s mainly to solve the small file 
> read workload (i.e. random I/O ).
> 
> 
> -jf
> ons. 17. okt. 2018 kl. 07:47 skrev Alexander Saupp <
> Alexander.Saupp at de.ibm.com>:
> Dear Mailing List readers,
> 
> I've come to a preliminary conclusion that explains the behavior in an 
> appropriate manner, so I'm trying to summarize my current thinking with 
> this audience.
> 
> Problem statement: 
> Big performance derivation between native GPFS (fast) and loopback NFS 
> mount on the same node (way slower) for single client, single thread, 
> small files workload.
> 
> 
> Current explanation:
> tar seems to use close() on files, not fclose(). That is an application 
> choice and common behavior. The ideas is to allow OS write caching to 
> speed up process run time.
> 
> When running locally on ext3 / xfs / GPFS / .. that allows async destaging 
> of data down to disk, somewhat compromising data for better performance. 
> As we're talking about write caching on the same node that the application 
> runs on - a crash is missfortune but in the same failure domain.
> E.g. if you run a compile job that includes extraction of a tar and the 
> node crashes you'll have to restart the entire job, anyhow.
> 
> The NFSv2 spec defined that NFS io's are to be 'sync', probably because 
> the compile job on the nfs client would survive if the NFS Server crashes, 
> so the failure domain would be different
> 
> NFSv3 in rfc1813 below acknowledged the performance impact and introduced 
> the 'async' flag for NFS, which would handle IO's similar to local IOs, 
> allowing to destage in the background.
> 
> Keep in mind - applications, independent if running locally or via NFS can 
> always decided to use the fclose() option, which will ensure that data is 
> destaged to persistent storage right away.
> But its an applications choice if that's really mandatory or whether 
> performance has higher priority.
> 
> The linux 'sync' (man sync) tool allows to sync 'dirty' memory cache down 
> to disk - very filesystem independent.
> 
> -> single client, single thread, small files workload on GPFS can be 
> destaged async, allowing to hide latency and parallelizing disk IOs.
> -> NFS client IO's are sync, so the second IO can only be started after 
> the first one hit non volatile memory -> much higher latency
> 
> 
> The Spectrum Scale NFS implementation (based on ganesha) does not support 
> the async mount option, which is a bit of a pitty. There might also be 
> implementation differences compared to kernel-nfs, I did not investigate 
> into that direction.
> 
> However, the principles of the difference are explained for my by the 
> above behavior. 
> 
> One workaround that I saw working well for multiple customers was to 
> replace the NFS client by a Spectrum Scale nsd client.
> That has two advantages, but is certainly not suitable in all cases:
> - Improved speed by efficent NSD protocol and NSD client side write 
> caching
> - Write Caching in the same failure domain as the application (on NSD 
> client) which seems to be more reasonable compared to NFS Server side 
> write caching.
> 
> References:
> 
> NFS sync vs async
> https://tools.ietf.org/html/rfc1813
> The write throughput bottleneck caused by the synchronous definition of 
> write in the NFS version 2 protocol has been addressed by adding support 
> so that the NFS server can do unsafe writes.
> Unsafe writes are writes which have not been committed to stable storage 
> before the operation returns. This specification defines a method for 
> committing these unsafe writes to stable storage in a reliable way.
> 
> 
> sync() vs fsync()
> https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/com.ibm.aix.performance/using_sync_fsync_calls.htm
> 
> - An application program makes an fsync() call for a specified file. This 
> causes all of the pages that contain modified data for that file to be 
> written to disk. The writing is complete when the fsync() call returns to 
> the program.
> 
> - An application program makes a sync() call. This causes all of the file 
> pages in memory that contain modified data to be scheduled for writing to 
> disk. The writing is not necessarily complete when the sync() call returns 
> to the program.
> 
> - A user can enter the sync command, which in turn issues a sync() call. 
> Again, some of the writes may not be complete when the user is prompted 
> for input (or the next command in a shell script is processed).
> 
> 
> close() vs fclose()
> A successful close does not guarantee that the data has been successfully 
> saved to disk, as the kernel defers writes. It is not common for a file 
> system to flush the buffers when the stream is closed. If you need to be 
> sure that the data is
> physically stored use fsync(2). (It will depend on the disk hardware at 
> this point.)
> 
> 
> Mit freundlichen Grüßen / Kind regards
> 
> Alexander Saupp
> 
> IBM Systems, Storage Platform, EMEA Storage Competence Center
> 
> 
> Phone:
> +49 7034-643-1512
> IBM Deutschland GmbH
> 
> Mobile:
> +49-172 7251072
> Am Weiher 24
> Email:
> alexander.saupp at de.ibm.com
> 65451 Kelsterbach
> 
> 
> Germany
> 
> IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter
> Geschäftsführung: Matthias Hartmann (Vorsitzender), Norbert Janzen, Stefan 
> Lutz, Nicole Reimer, Dr. Klaus Seifert, Wolfgang Wendt
> Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, 
> HRB 14562 / WEEE-Reg.-Nr. DE 99369940 
> 
> 
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss[attachment "ecblank.gif" 
> deleted by Olaf Weiser/Germany/IBM] [attachment "19995626.gif" deleted by 
> Olaf Weiser/Germany/IBM] [attachment "ecblank.gif" deleted by Olaf 
> Weiser/Germany/IBM] _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> 
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>