[gpfsug-discuss] Metadata with GNR code

Fri Sep 21 17:45:34 BST 2018

somebody did listen and remembered to what i did say :-D

... and you are absolute correct, there is no need for SSD's to get great
zero length mdtest results, most people don't know that create workloads
unless carefully executed in general almost exclusively stresses a
filesystem client and has almost no load to the storage or the server side
UNLESS any of the following is true :

1. your mdtest runs longer than the OS sync interval, exact duration
depends on the OS, but is typically 60 or 5 seconds.
2. the amount of files you create exceed your file cache setting
3. you haven't filed your filesystem recovery log to the point where log
wrap kicks in

there are other possible reasons, but the above is the top 3 list. the
network is kind of critical, but not as critical as long as there are
enough tasks and nodes in parallel, but for small number of nodes you need
some fast (fast as in low latency not throughput) network to handle the
parallel token traffic. given that there is an unused inode/token
prefetcher in Scale, means some inodes are already 'owned' by the client
before you even create your first file network speed is less relevant as
long as you stay under that limit.

all above is obvious a burst, short period statement, if you have a
sustained create workload then obvious all this needs to be written to the
storage device in the right place, this is the point where the storage
controller write cache followed by the sustained de-staging rate to media
is the most critical piece, not the speed of the individual media e.g.
NLSAS or SSD. as long as the write cache can coalesce data good enough
and/or there is enough bandwidth to the media combined with the fact that
Scale tries to align things in data and metadata blocks where possible
NLSAS drives are just fine for many cases.

But, there is one particular part where flash in any form is unbeatable and
that is the main reason people should buy it for metadata - eventually one
needs to read all that metadata back, does a directory listing, deletes a
folder, etc. in all this cases you need to READ stuff back. while write
caches and smart logic in a filesystem client can help significant with
writes, on reads there is no magic that can be done (except massive caches
again but thats unrealistic for larger systems), so you have to get the
data from the media and now having 100 usec response time on flash vs 10 ms
average will make a significant difference for real world applications.

btw. i offered to speak at the SC18 event in Dallas about Scale related
work, even i don't work for IBM anymore. if my talk gets accepted i will
see some of you in Dallas :-D

Sven

On Fri, Sep 21, 2018 at 3:14 AM Jan-Frode Myklebust <janfrode at tanso.net>
wrote:

> That reminds me of a point Sven made when I was trying to optimize mdtest
> results with metadata on FlashSystem... He sent me the following:
>
> -- started at 11/15/2015 15:20:39 --
> mdtest-1.9.3 was launched with 138 total task(s) on 23 node(s)
> Command line used: /ghome/oehmes/mpi/bin/mdtest-pcmpi9131-existingdir -d
> /ibm/fs2-4m-02/shared/mdtest-ec -i 1 -n 70000 -F -i 1 -w 0 -Z -u
> Path: /ibm/fs2-4m-02/
> sharedFS: 32.0 TiB   Used FS: 6.7%   Inodes: 145.4 Mi   Used Inodes: 22.0%
> 138 tasks, 9660000 files
> SUMMARY: (of 1 iterations)
>    Operation                      Max            Min           Mean
> Std Dev
>    ---------                      ---            ---           ----
> -------
> File creation     :     650440.486     650440.486     650440.486
> 0.000
> File stat         :   23599134.618   23599134.618   23599134.618
> 0.000
> File read         :    2171391.097    2171391.097    2171391.097
> 0.000
> File removal      :    1007566.981    1007566.981    1007566.981
> 0.000
> Tree creation     :          3.072          3.072          3.072
> 0.000
> Tree removal      :          1.471          1.471          1.471
> 0.000
> -- finished at 11/15/2015 15:21:10 --
>
> from a GL6 -- only spinning disks -- pointing out that mdtest doesn't
> really require Flash/SSD. The key to good results are
>
> a) large GPFS log ( mmchfs -L 128m)
>
> b) high maxfilestocache (you need to be able to cache all entries , so for
> 10 million across 20 nodes you need to have at least 750k per node)
>
> c) fast network, thats key to handle the token requests and metadata
> operations that need to get over the network.
>
>
>
>   -jf
>
> On Fri, Sep 21, 2018 at 10:22 AM Olaf Weiser <olaf.weiser at de.ibm.com>
> wrote:
>
>> see a mdtest for a default block size file system ...
>> 4 MB blocksize..
>> mdata is on SSD
>> data is on HDD   ... which is not really relevant for this mdtest ;-)
>>
>>
>> -- started at 09/07/2018 06:54:54 --
>>
>> mdtest-1.9.3 was launched with 40 total task(s) on 20 node(s)
>> Command line used: mdtest -n 25000 -i 3 -u -d
>> /homebrewed/gh24_4m_4m/mdtest
>> Path: /homebrewed/gh24_4m_4m
>> FS: 10.0 TiB   Used FS: 0.0%   Inodes: 12.0 Mi   Used Inodes: 2.3%
>>
>> 40 tasks, 1000000 files/directories
>>
>> SUMMARY: (of 3 iterations)
>>   Operation                      Max            Min           Mean
>>  Std Dev
>>   ---------                      ---            ---           ----
>>  -------
>>   Directory creation:     449160.409     430869.822     437002.187
>> 8597.272
>>   Directory stat    :    6664420.560    5785712.544    6324276
>> <(544)%20632-4276>.731     385192.527
>>   Directory removal :     398360.058     351503.369     371630.648
>>  19690.580
>>   File creation     :     288985.217     270550.129     279096.800
>> 7585.659
>>   File stat         :    6720685.117    6641301.499    6674123.407
>>  33833.182
>>   File read         :    3055661.372    2871044.881    2945513.966
>>  79479.638
>>   File removal      :     215187.602     146639.435     179898.441
>>  28021.467
>>   Tree creation     :         10.215          3.165          6.603
>>    2.881
>>   Tree removal      :          5.484          0.880          2.418
>>    2.168
>>
>> -- finished at 09/07/2018 06:55:42 --
>>
>>
>>
>>
>> Mit freundlichen Grüßen / Kind regards
>>
>>
>> Olaf Weiser
>>
>> EMEA Storage Competence Center Mainz, German / IBM Systems, Storage
>> Platform,
>>
>> -------------------------------------------------------------------------------------------------------------------------------------------
>> IBM Deutschland
>> IBM Allee 1
>> 71139 Ehningen
>> Phone: +49-170-579-44-66 <+49%20170%205794466>
>> E-Mail: olaf.weiser at de.ibm.com
>>
>> -------------------------------------------------------------------------------------------------------------------------------------------
>> IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter
>> Geschäftsführung: Martina Koederitz (Vorsitzende), Susanne Peter, Norbert
>> Janzen, Dr. Christian Keller, Ivo Koerner, Markus Koerner
>> Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart,
>> HRB 14562 / WEEE-Reg.-Nr. DE 99369940
>>
>>
>>
>> From:        "Andrew Beattie" <abeattie at au1.ibm.com>
>> To:        gpfsug-discuss at spectrumscale.org
>> Date:        09/21/2018 02:34 AM
>> Subject:        Re: [gpfsug-discuss] Metadata with GNR code
>> Sent by:        gpfsug-discuss-bounces at spectrumscale.org
>> ------------------------------
>>
>>
>>
>> Simon,
>>
>> My recommendation is still very much to use SSD for Metadata and NL-SAS
>> for data and
>> the GH14 / GH24 Building blocks certainly make this much easier.
>>
>> Unless your filesystem is massive (Summit sized) you will typically still
>> continue to benefit from the Random IO performance of SSD (even RI SSD) in
>> comparison to NL-SAS.
>>
>> It still makes more sense to me to continue to use 2 copy or 3 copy for
>> Metadata even in ESS / GNR style environments.  The read performance for
>> metadata using 3copy is still significantly better than any other scenario.
>>
>> As with anything there are exceptions to the rule, but my experiences
>> with ESS and ESS with SSD so far still maintain that the standard thoughts
>> on managing Metadata and Small file IO remain the same -- even with the
>> improvements around sub blocks with Scale V5.
>>
>> MDtest is still the typical benchmark for this comparison and MDTest
>> shows some very clear differences  even on SSD when you use a large
>> filesystem block size with more sub blocks vs a smaller block size with
>> 1/32 subblocks
>>
>> This only gets worse if you change the storage media from SSD to NL-SAS
>> *Andrew Beattie*
>> *Software Defined Storage  - IT Specialist*
>> *Phone: *614-2133-7927
>> *E-mail: **abeattie at au1.ibm.com* <abeattie at au1.ibm.com>
>>
>>
>> ----- Original message -----
>> From: Simon Thompson <S.J.Thompson at bham.ac.uk>
>> Sent by: gpfsug-discuss-bounces at spectrumscale.org
>> To: "gpfsug-discuss at spectrumscale.org" <gpfsug-discuss at spectrumscale.org>
>> Cc:
>> Subject: [gpfsug-discuss] Metadata with GNR code
>> Date: Fri, Sep 21, 2018 3:29 AM
>>
>> Just wondering if anyone has any strong views/recommendations with
>> metadata when using GNR code?
>>
>>
>>
>> I know in “san” based GPFS, there is a recommendation to have data and
>> metadata split with the metadata on SSD.
>>
>>
>>
>> I’ve also heard that with GNR there isn’t much difference in splitting
>> data and metadata.
>>
>>
>>
>> We’re looking at two systems and want to replicate metadata, but not data
>> (mostly) between them, so I’m not really sure how we’d do this without
>> having separate system pool (and then NSDs in different failure groups)….
>>
>>
>>
>> If we used 8+2P vdisks for metadata only, would we still see no
>> difference in performance compared to mixed (I guess the 8+2P is still
>> spread over a DA so we’d get half the drives in the GNR system active…).
>>
>>
>>
>> Or should we stick SSD based storage in as well for the metadata pool?
>> (Which brings an interesting question about RAID code related to the recent
>> discussions on mirroring vs RAID5…)
>>
>>
>>
>> Thoughts welcome!
>>
>>
>>
>> Simon
>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> *http://gpfsug.org/mailman/listinfo/gpfsug-discuss*
>> <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>>
>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180921/2c6f51bf/attachment.htm>