[gpfsug-discuss] Write performances and filesystem size

Ivano Talamo Ivano.Talamo at psi.ch
Wed Nov 22 08:23:22 GMT 2017


Hello Olaf,

thank you for your reply and for confirming that this is not expected, 
as we also thought. We did repeat the test with 2 vdisks only without 
dedicated ones for metadata but the result did not change.

We now opened a PMR.

Thanks,
Ivano

Il 16/11/17 17:08, Olaf Weiser ha scritto:
> Hi Ivano,
> so from this output, the performance degradation is not explainable ..
> in my current environments.. , having multiple file systems (so vdisks
> on one BB) .. and it works fine ..
>
>  as said .. just open a PMR.. I would'nt consider this as the "expected
> behavior"
> the only thing is.. the MD disks are a bit small.. so maybe redo your
> tests and for a simple compare between 1/2 1/1 or 1/4 capacity  test
> with 2 vdisks only and /dataAndMetadata/
> cheers
>
>
>
>
>
> From:        Ivano Talamo <Ivano.Talamo at psi.ch>
> To:        gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> Date:        11/16/2017 08:52 AM
> Subject:        Re: [gpfsug-discuss] Write performances and filesystem size
> Sent by:        gpfsug-discuss-bounces at spectrumscale.org
> ------------------------------------------------------------------------
>
>
>
> Hi,
>
> as additional information I past the recovery group information in the
> full and half size cases.
> In both cases:
> - data is on sf_g_01_vdisk01
> - metadata on sf_g_01_vdisk02
> - sf_g_01_vdisk07 is not used in the filesystem.
>
> This is with the full-space filesystem:
>
>                     declustered                     current       allowable
>  recovery group       arrays     vdisks  pdisks  format version format
> version
>  -----------------  -----------  ------  ------  --------------
> --------------
>  sf-g-01                      3       6      86  4.2.2.0        4.2.2.0
>
>
>  declustered   needs                            replace
> scrub       background activity
>     array     service  vdisks  pdisks  spares  threshold  free space
> duration  task   progress  priority
>  -----------  -------  ------  ------  ------  ---------  ----------
> --------  -------------------------
>  NVR          no            1       2     0,0          1    3632 MiB
> 14 days  scrub       95%  low
>  DA1          no            4      83    2,44          1      57 TiB
> 14 days  scrub        0%  low
>  SSD          no            1       1     0,0          1     372 GiB
> 14 days  scrub       79%  low
>
>                                          declustered
>        checksum
>  vdisk               RAID code              array     vdisk size  block
> size  granularity  state remarks
>  ------------------  ------------------  -----------  ----------
> ----------  -----------  ----- -------
>  sf_g_01_logTip      2WayReplication     NVR              48 MiB      2
> MiB      4096      ok    logTip
>  sf_g_01_logTipBackup  Unreplicated        SSD              48 MiB
> 2 MiB      4096      ok    logTipBackup
>  sf_g_01_logHome     4WayReplication     DA1             144 GiB      2
> MiB      4096      ok    log
>  sf_g_01_vdisk02     3WayReplication     DA1             103 GiB      1
> MiB     32 KiB     ok
>  sf_g_01_vdisk07     3WayReplication     DA1             103 GiB      1
> MiB     32 KiB     ok
>  sf_g_01_vdisk01     8+2p                DA1             540 TiB     16
> MiB     32 KiB     ok
>
>  config data         declustered array   spare space    remarks
>  ------------------  ------------------  -------------  -------
>  rebuild space       DA1                 53 pdisk
>     increasing VCD spares is suggested
>
>  config data         disk group fault tolerance         remarks
>  ------------------  ---------------------------------  -------
>  rg descriptor       1 enclosure + 1 drawer + 2 pdisk   limited by
> rebuild space
>  system index        1 enclosure + 1 drawer + 2 pdisk   limited by
> rebuild space
>
>  vdisk               disk group fault tolerance         remarks
>  ------------------  ---------------------------------  -------
>  sf_g_01_logTip      1 pdisk
>  sf_g_01_logTipBackup  0 pdisk
>  sf_g_01_logHome     1 enclosure + 1 drawer + 1 pdisk   limited by
> rebuild space
>  sf_g_01_vdisk02     1 enclosure + 1 drawer             limited by
> rebuild space
>  sf_g_01_vdisk07     1 enclosure + 1 drawer             limited by
> rebuild space
>  sf_g_01_vdisk01     2 pdisk
>
>
> This is with the half-space filesystem:
>
>                     declustered                     current       allowable
>  recovery group       arrays     vdisks  pdisks  format version format
> version
>  -----------------  -----------  ------  ------  --------------
> --------------
>  sf-g-01                      3       6      86  4.2.2.0        4.2.2.0
>
>
>  declustered   needs                            replace
> scrub       background activity
>     array     service  vdisks  pdisks  spares  threshold  free space
> duration  task   progress  priority
>  -----------  -------  ------  ------  ------  ---------  ----------
> --------  -------------------------
>  NVR          no            1       2     0,0          1    3632 MiB
> 14 days  scrub        4%  low
>  DA1          no            4      83    2,44          1     395 TiB
> 14 days  scrub        0%  low
>  SSD          no            1       1     0,0          1     372 GiB
> 14 days  scrub       79%  low
>
>                                          declustered
>        checksum
>  vdisk               RAID code              array     vdisk size  block
> size  granularity  state remarks
>  ------------------  ------------------  -----------  ----------
> ----------  -----------  ----- -------
>  sf_g_01_logTip      2WayReplication     NVR              48 MiB      2
> MiB      4096      ok    logTip
>  sf_g_01_logTipBackup  Unreplicated        SSD              48 MiB
> 2 MiB      4096      ok    logTipBackup
>  sf_g_01_logHome     4WayReplication     DA1             144 GiB      2
> MiB      4096      ok    log
>  sf_g_01_vdisk02     3WayReplication     DA1             103 GiB      1
> MiB     32 KiB     ok
>  sf_g_01_vdisk07     3WayReplication     DA1             103 GiB      1
> MiB     32 KiB     ok
>  sf_g_01_vdisk01     8+2p                DA1             270 TiB     16
> MiB     32 KiB     ok
>
>  config data         declustered array   spare space    remarks
>  ------------------  ------------------  -------------  -------
>  rebuild space       DA1                 68 pdisk
>     increasing VCD spares is suggested
>
>  config data         disk group fault tolerance         remarks
>  ------------------  ---------------------------------  -------
>  rg descriptor       1 node + 3 pdisk                   limited by
> rebuild space
>  system index        1 node + 3 pdisk                   limited by
> rebuild space
>
>  vdisk               disk group fault tolerance         remarks
>  ------------------  ---------------------------------  -------
>  sf_g_01_logTip      1 pdisk
>  sf_g_01_logTipBackup  0 pdisk
>  sf_g_01_logHome     1 node + 2 pdisk                   limited by
> rebuild space
>  sf_g_01_vdisk02     1 node + 1 pdisk                   limited by
> rebuild space
>  sf_g_01_vdisk07     1 node + 1 pdisk                   limited by
> rebuild space
>  sf_g_01_vdisk01     2 pdisk
>
>
> Thanks,
> Ivano
>
>
>
>
> Il 16/11/17 13:03, Olaf Weiser ha scritto:
>> Rjx, that makes it a bit clearer.. as  your vdisk  is big enough to span
>> over all pdisks  in each of your test 1/1 or 1/2 or 1/4  of capacity...
>> should bring the same performance. ..
>>
>> You mean something about vdisk Layout. ..
>> So in your test,  for the full capacity test, you use just one vdisk per
>> RG - so 2 in total for 'data' - right?
>>
>> What about Md .. did you create separate vdisk for MD  / what size then
>> ?
>>
>> Gesendet von IBM Verse
>>
>> Ivano Talamo --- Re: [gpfsug-discuss] Write performances and filesystem
>> size ---
>>
>> Von:                 "Ivano Talamo" <Ivano.Talamo at psi.ch>
>> An:                 "gpfsug main discussion list"
> <gpfsug-discuss at spectrumscale.org>
>> Datum:                 Do. 16.11.2017 03:49
>> Betreff:                 Re: [gpfsug-discuss] Write performances and
> filesystem size
>>
>> ------------------------------------------------------------------------
>>
>> Hello Olaf,
>>
>> yes, I confirm that is the Lenovo version of the ESS GL2, so 2
>> enclosures/4 drawers/166 disks in total.
>>
>> Each recovery group has one declustered array with all disks inside, so
>> vdisks use all the physical ones, even in the case of a vdisk that is
>> 1/4 of the total size.
>>
>> Regarding the layout allocation we used scatter.
>>
>> The tests were done on the just created filesystem, so no close-to-full
>> effect. And we run gpfsperf write seq.
>>
>> Thanks,
>> Ivano
>>
>>
>> Il 16/11/17 04:42, Olaf Weiser ha scritto:
>>> Sure... as long we assume that really all physical disk are used .. the
>>> fact that  was told 1/2  or 1/4  might turn out that one / two complet
>>> enclosures 're eliminated ... ?  ..that s why I was asking for  more
>>> details ..
>>>
>>> I dont see this degration in my environments. . as long the vdisks are
>>> big enough to span over all pdisks ( which should be the case for
>>> capacity in a range of TB ) ... the performance stays the same
>>>
>>> Gesendet von IBM Verse
>>>
>>> Jan-Frode Myklebust --- Re: [gpfsug-discuss] Write performances and
>>> filesystem size ---
>>>
>>> Von:    "Jan-Frode Myklebust" <janfrode at tanso.net>
>>> An:    "gpfsug main discussion list" <gpfsug-discuss at spectrumscale.org>
>>> Datum:    Mi. 15.11.2017 21:35
>>> Betreff:    Re: [gpfsug-discuss] Write performances and filesystem size
>>>
>>> ------------------------------------------------------------------------
>>>
>>> Olaf, this looks like a Lenovo «ESS GLxS» version. Should be using same
>>> number of spindles for any size filesystem, so I would also expect them
>>> to perform the same.
>>>
>>>
>>>
>>> -jf
>>>
>>>
>>> ons. 15. nov. 2017 kl. 11:26 skrev Olaf Weiser <olaf.weiser at de.ibm.com
>>> <mailto:olaf.weiser at de.ibm.com>>:
>>>
>>>      to add a comment ...  .. very simply... depending on how you
>>>     allocate the physical block storage .... if you - simply - using
>>>     less physical resources when reducing the capacity (in the same
>>>     ratio) .. you get , what you see....
>>>
>>>     so you need to tell us, how you allocate your block-storage .. (Do
>>>     you using RAID controllers , where are your LUNs coming from, are
>>>     then less RAID groups involved, when reducing the capacity ?...)
>>>
>>>     GPFS can be configured to give you pretty as much as what the
>>>     hardware can deliver.. if you reduce resource.. ... you'll get less
>>>     , if you enhance your hardware .. you get more... almost regardless
>>>     of the total capacity in #blocks ..
>>>
>>>
>>>
>>>
>>>
>>>
>>>     From:        "Kumaran Rajaram" <kums at us.ibm.com
>>>     <mailto:kums at us.ibm.com>>
>>>     To:        gpfsug main discussion list
>>>     <gpfsug-discuss at spectrumscale.org
>>>     <mailto:gpfsug-discuss at spectrumscale.org>>
>>>     Date:        11/15/2017 11:56 AM
>>>     Subject:        Re: [gpfsug-discuss] Write performances and
>>>     filesystem size
>>>     Sent by:        gpfsug-discuss-bounces at spectrumscale.org
>>>     <mailto:gpfsug-discuss-bounces at spectrumscale.org>
>>>
>> ------------------------------------------------------------------------
>>>
>>>
>>>
>>>     Hi,
>>>
>>>     >>Am I missing something? Is this an expected behaviour and someone
>>>     has an explanation for this?
>>>
>>>     Based on your scenario, write degradation as the file-system is
>>>     populated is possible if you had formatted the file-system with "-j
>>>     cluster".
>>>
>>>     For consistent file-system performance, we recommend *mmcrfs "-j
>>>     scatter" layoutMap.*   Also, we need to ensure the mmcrfs "-n"  is
>>>     set properly.
>>>
>>>     [snip from mmcrfs]/
>>>     # mmlsfs <fs> | egrep 'Block allocation| Estimated number'
>>>     -j                 scatter                  Block allocation type
>>>     -n                 128                       Estimated number of
>>>     nodes that will mount file system/
>>>     [/snip]
>>>
>>>
>>>     [snip from man mmcrfs]/
>>>     *layoutMap={scatter|*//*cluster}*//
>>>                      Specifies the block allocation map type. When
>>>                      allocating blocks for a given file, GPFS first
>>>                      uses a round‐robin algorithm to spread the data
>>>                      across all disks in the storage pool. After a
>>>                      disk is selected, the location of the data
>>>                      block on the disk is determined by the block
>>>                      allocation map type*. If cluster is
>>>                      specified, GPFS attempts to allocate blocks in
>>>                      clusters. Blocks that belong to a particular
>>>                      file are kept adjacent to each other within
>>>                      each cluster. If scatter is specified,
>>>                      the location of the block is chosen randomly.*/
>>>     /
>>>                  *  The cluster allocation method may provide
>>>                      better disk performance for some disk
>>>                      subsystems in relatively small installations.
>>>                      The benefits of clustered block allocation
>>>                      diminish when the number of nodes in the
>>>                      cluster or the number of disks in a file system
>>>                      increases, or when the file system’s free space
>>>                      becomes fragmented. *//The *cluster*//
>>>                      allocation method is the default for GPFS
>>>                      clusters with eight or fewer nodes and for file
>>>                      systems with eight or fewer disks./
>>>     /
>>>                     *The scatter allocation method provides
>>>                      more consistent file system performance by
>>>                      averaging out performance variations due to
>>>                      block location (for many disk subsystems, the
>>>                      location of the data relative to the disk edge
>>>                      has a substantial effect on performance).*//This
>>>                      allocation method is appropriate in most cases
>>>                      and is the default for GPFS clusters with more
>>>                      than eight nodes or file systems with more than
>>>                      eight disks./
>>>     /
>>>                      The block allocation map type cannot be changed
>>>                      after the storage pool has been created./
>>>
>>>     */
>>>     -n/*/*NumNodes*//
>>>             The estimated number of nodes that will mount the file
>>>             system in the local cluster and all remote clusters.
>>>             This is used as a best guess for the initial size of
>>>             some file system data structures. The default is 32.
>>>             This value can be changed after the file system has been
>>>             created but it does not change the existing data
>>>             structures. Only the newly created data structure is
>>>             affected by the new value. For example, new storage
>>>             pool./
>>>     /
>>>             When you create a GPFS file system, you might want to
>>>             overestimate the number of nodes that will mount the
>>>             file system. GPFS uses this information for creating
>>>             data structures that are essential for achieving maximum
>>>             parallelism in file system operations (For more
>>>             information, see GPFS architecture in IBM Spectrum
>>>             Scale: Concepts, Planning, and Installation Guide ). If
>>>             you are sure there will never be more than 64 nodes,
>>>             allow the default value to be applied. If you are
>>>             planning to add nodes to your system, you should specify
>>>             a number larger than the default./
>>>
>>>     [/snip from man mmcrfs]
>>>
>>>     Regards,
>>>     -Kums
>>>
>>>
>>>
>>>
>>>
>>>     From:        Ivano Talamo <Ivano.Talamo at psi.ch
>>>     <mailto:Ivano.Talamo at psi.ch>>
>>>     To:        <gpfsug-discuss at spectrumscale.org
>>>     <mailto:gpfsug-discuss at spectrumscale.org>>
>>>     Date:        11/15/2017 11:25 AM
>>>     Subject:        [gpfsug-discuss] Write performances and filesystem
>> size
>>>     Sent by:        gpfsug-discuss-bounces at spectrumscale.org
>>>     <mailto:gpfsug-discuss-bounces at spectrumscale.org>
>>>
>> ------------------------------------------------------------------------
>>>
>>>
>>>
>>>     Hello everybody,
>>>
>>>     together with my colleagues we are actually running some tests on
>> a new
>>>     DSS G220 system and we see some unexpected behaviour.
>>>
>>>     What we actually see is that write performances (we did not test read
>>>     yet) decreases with the decrease of filesystem size.
>>>
>>>     I will not go into the details of the tests, but here are some
>> numbers:
>>>
>>>     - with a filesystem using the full 1.2 PB space we get 14 GB/s as the
>>>     sum of the disk activity on the two IO servers;
>>>     - with a filesystem using half of the space we get 10 GB/s;
>>>     - with a filesystem using 1/4 of the space we get 5 GB/s.
>>>
>>>     We also saw that performances are not affected by the vdisks layout,
>>>     ie.
>>>     taking the full space with one big vdisk or 2 half-size vdisks per RG
>>>     gives the same performances.
>>>
>>>     To our understanding the IO should be spread evenly across all the
>>>     pdisks in the declustered array, and looking at iostat all disks
>>>     seem to
>>>     be accessed. But so there must be some other element that affects
>>>     performances.
>>>
>>>     Am I missing something? Is this an expected behaviour and someone
>>>     has an
>>>     explanation for this?
>>>
>>>     Thank you,
>>>     Ivano
>>>     _______________________________________________
>>>     gpfsug-discuss mailing list
>>>     gpfsug-discuss at spectrumscale.org <http://spectrumscale.org
> <http://spectrumscale.org/>>_
>>>
>>
> __https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=McIf98wfiVqHU8ZygezLrQ&m=py_FGl3hi9yQsby94NZdpBFPwcUU0FREyMSSvuK_10U&s=Bq1J9eIXxadn5yrjXPHmKEht0CDBwfKJNH72p--T-6s&e=_
>>>
>>>
>>>     _______________________________________________
>>>     gpfsug-discuss mailing list
>>>     gpfsug-discuss at spectrumscale.org <http://spectrumscale.org
> <http://spectrumscale.org/>>
>>>     http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>
>>>
>>>
>>>     _______________________________________________
>>>     gpfsug-discuss mailing list
>>>     gpfsug-discuss at spectrumscale.org <http://spectrumscale.org
> <http://spectrumscale.org/>>
>>>     http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at spectrumscale.org
>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>>
>>
>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>



More information about the gpfsug-discuss mailing list