[gpfsug-discuss] Write performances and filesystem size

Ivano Talamo Ivano.Talamo at psi.ch
Thu Nov 16 13:51:51 GMT 2017


Hi,

as additional information I past the recovery group information in the 
full and half size cases.
In both cases:
- data is on sf_g_01_vdisk01
- metadata on sf_g_01_vdisk02
- sf_g_01_vdisk07 is not used in the filesystem.

This is with the full-space filesystem:

                     declustered                     current       allowable
  recovery group       arrays     vdisks  pdisks  format version format 
version
  -----------------  -----------  ------  ------  -------------- 
--------------
  sf-g-01                      3       6      86  4.2.2.0        4.2.2.0 


  declustered   needs                            replace 
scrub       background activity
     array     service  vdisks  pdisks  spares  threshold  free space 
duration  task   progress  priority
  -----------  -------  ------  ------  ------  ---------  ---------- 
--------  -------------------------
  NVR          no            1       2     0,0          1    3632 MiB 
14 days  scrub       95%  low
  DA1          no            4      83    2,44          1      57 TiB 
14 days  scrub        0%  low
  SSD          no            1       1     0,0          1     372 GiB 
14 days  scrub       79%  low

                                          declustered 
        checksum
  vdisk               RAID code              array     vdisk size  block 
size  granularity  state remarks
  ------------------  ------------------  -----------  ---------- 
----------  -----------  ----- -------
  sf_g_01_logTip      2WayReplication     NVR              48 MiB      2 
MiB      4096      ok    logTip
  sf_g_01_logTipBackup  Unreplicated        SSD              48 MiB 
2 MiB      4096      ok    logTipBackup
  sf_g_01_logHome     4WayReplication     DA1             144 GiB      2 
MiB      4096      ok    log
  sf_g_01_vdisk02     3WayReplication     DA1             103 GiB      1 
MiB     32 KiB     ok
  sf_g_01_vdisk07     3WayReplication     DA1             103 GiB      1 
MiB     32 KiB     ok
  sf_g_01_vdisk01     8+2p                DA1             540 TiB     16 
MiB     32 KiB     ok

  config data         declustered array   spare space    remarks
  ------------------  ------------------  -------------  -------
  rebuild space       DA1                 53 pdisk 
     increasing VCD spares is suggested

  config data         disk group fault tolerance         remarks
  ------------------  ---------------------------------  -------
  rg descriptor       1 enclosure + 1 drawer + 2 pdisk   limited by 
rebuild space
  system index        1 enclosure + 1 drawer + 2 pdisk   limited by 
rebuild space

  vdisk               disk group fault tolerance         remarks
  ------------------  ---------------------------------  -------
  sf_g_01_logTip      1 pdisk
  sf_g_01_logTipBackup  0 pdisk
  sf_g_01_logHome     1 enclosure + 1 drawer + 1 pdisk   limited by 
rebuild space
  sf_g_01_vdisk02     1 enclosure + 1 drawer             limited by 
rebuild space
  sf_g_01_vdisk07     1 enclosure + 1 drawer             limited by 
rebuild space
  sf_g_01_vdisk01     2 pdisk


This is with the half-space filesystem:

                     declustered                     current       allowable
  recovery group       arrays     vdisks  pdisks  format version format 
version
  -----------------  -----------  ------  ------  -------------- 
--------------
  sf-g-01                      3       6      86  4.2.2.0        4.2.2.0 


  declustered   needs                            replace 
scrub       background activity
     array     service  vdisks  pdisks  spares  threshold  free space 
duration  task   progress  priority
  -----------  -------  ------  ------  ------  ---------  ---------- 
--------  -------------------------
  NVR          no            1       2     0,0          1    3632 MiB 
14 days  scrub        4%  low
  DA1          no            4      83    2,44          1     395 TiB 
14 days  scrub        0%  low
  SSD          no            1       1     0,0          1     372 GiB 
14 days  scrub       79%  low

                                          declustered 
        checksum
  vdisk               RAID code              array     vdisk size  block 
size  granularity  state remarks
  ------------------  ------------------  -----------  ---------- 
----------  -----------  ----- -------
  sf_g_01_logTip      2WayReplication     NVR              48 MiB      2 
MiB      4096      ok    logTip
  sf_g_01_logTipBackup  Unreplicated        SSD              48 MiB 
2 MiB      4096      ok    logTipBackup
  sf_g_01_logHome     4WayReplication     DA1             144 GiB      2 
MiB      4096      ok    log
  sf_g_01_vdisk02     3WayReplication     DA1             103 GiB      1 
MiB     32 KiB     ok
  sf_g_01_vdisk07     3WayReplication     DA1             103 GiB      1 
MiB     32 KiB     ok
  sf_g_01_vdisk01     8+2p                DA1             270 TiB     16 
MiB     32 KiB     ok

  config data         declustered array   spare space    remarks
  ------------------  ------------------  -------------  -------
  rebuild space       DA1                 68 pdisk 
     increasing VCD spares is suggested

  config data         disk group fault tolerance         remarks
  ------------------  ---------------------------------  -------
  rg descriptor       1 node + 3 pdisk                   limited by 
rebuild space
  system index        1 node + 3 pdisk                   limited by 
rebuild space

  vdisk               disk group fault tolerance         remarks
  ------------------  ---------------------------------  -------
  sf_g_01_logTip      1 pdisk
  sf_g_01_logTipBackup  0 pdisk
  sf_g_01_logHome     1 node + 2 pdisk                   limited by 
rebuild space
  sf_g_01_vdisk02     1 node + 1 pdisk                   limited by 
rebuild space
  sf_g_01_vdisk07     1 node + 1 pdisk                   limited by 
rebuild space
  sf_g_01_vdisk01     2 pdisk


Thanks,
Ivano




Il 16/11/17 13:03, Olaf Weiser ha scritto:
> Rjx, that makes it a bit clearer.. as  your vdisk  is big enough to span
> over all pdisks  in each of your test 1/1 or 1/2 or 1/4  of capacity...
> should bring the same performance. ..
>
> You mean something about vdisk Layout. ..
> So in your test,  for the full capacity test, you use just one vdisk per
> RG - so 2 in total for 'data' - right?
>
> What about Md .. did you create separate vdisk for MD  / what size then
> ?
>
> Gesendet von IBM Verse
>
> Ivano Talamo --- Re: [gpfsug-discuss] Write performances and filesystem
> size ---
>
> Von:	"Ivano Talamo" <Ivano.Talamo at psi.ch>
> An:	"gpfsug main discussion list" <gpfsug-discuss at spectrumscale.org>
> Datum:	Do. 16.11.2017 03:49
> Betreff:	Re: [gpfsug-discuss] Write performances and filesystem size
>
> ------------------------------------------------------------------------
>
> Hello Olaf,
>
> yes, I confirm that is the Lenovo version of the ESS GL2, so 2
> enclosures/4 drawers/166 disks in total.
>
> Each recovery group has one declustered array with all disks inside, so
> vdisks use all the physical ones, even in the case of a vdisk that is
> 1/4 of the total size.
>
> Regarding the layout allocation we used scatter.
>
> The tests were done on the just created filesystem, so no close-to-full
> effect. And we run gpfsperf write seq.
>
> Thanks,
> Ivano
>
>
> Il 16/11/17 04:42, Olaf Weiser ha scritto:
>> Sure... as long we assume that really all physical disk are used .. the
>> fact that  was told 1/2  or 1/4  might turn out that one / two complet
>> enclosures 're eliminated ... ?  ..that s why I was asking for  more
>> details ..
>>
>> I dont see this degration in my environments. . as long the vdisks are
>> big enough to span over all pdisks ( which should be the case for
>> capacity in a range of TB ) ... the performance stays the same
>>
>> Gesendet von IBM Verse
>>
>> Jan-Frode Myklebust --- Re: [gpfsug-discuss] Write performances and
>> filesystem size ---
>>
>> Von:    "Jan-Frode Myklebust" <janfrode at tanso.net>
>> An:    "gpfsug main discussion list" <gpfsug-discuss at spectrumscale.org>
>> Datum:    Mi. 15.11.2017 21:35
>> Betreff:    Re: [gpfsug-discuss] Write performances and filesystem size
>>
>> ------------------------------------------------------------------------
>>
>> Olaf, this looks like a Lenovo «ESS GLxS» version. Should be using same
>> number of spindles for any size filesystem, so I would also expect them
>> to perform the same.
>>
>>
>>
>> -jf
>>
>>
>> ons. 15. nov. 2017 kl. 11:26 skrev Olaf Weiser <olaf.weiser at de.ibm.com
>> <mailto:olaf.weiser at de.ibm.com>>:
>>
>>      to add a comment ...  .. very simply... depending on how you
>>     allocate the physical block storage .... if you - simply - using
>>     less physical resources when reducing the capacity (in the same
>>     ratio) .. you get , what you see....
>>
>>     so you need to tell us, how you allocate your block-storage .. (Do
>>     you using RAID controllers , where are your LUNs coming from, are
>>     then less RAID groups involved, when reducing the capacity ?...)
>>
>>     GPFS can be configured to give you pretty as much as what the
>>     hardware can deliver.. if you reduce resource.. ... you'll get less
>>     , if you enhance your hardware .. you get more... almost regardless
>>     of the total capacity in #blocks ..
>>
>>
>>
>>
>>
>>
>>     From:        "Kumaran Rajaram" <kums at us.ibm.com
>>     <mailto:kums at us.ibm.com>>
>>     To:        gpfsug main discussion list
>>     <gpfsug-discuss at spectrumscale.org
>>     <mailto:gpfsug-discuss at spectrumscale.org>>
>>     Date:        11/15/2017 11:56 AM
>>     Subject:        Re: [gpfsug-discuss] Write performances and
>>     filesystem size
>>     Sent by:        gpfsug-discuss-bounces at spectrumscale.org
>>     <mailto:gpfsug-discuss-bounces at spectrumscale.org>
>>
> ------------------------------------------------------------------------
>>
>>
>>
>>     Hi,
>>
>>     >>Am I missing something? Is this an expected behaviour and someone
>>     has an explanation for this?
>>
>>     Based on your scenario, write degradation as the file-system is
>>     populated is possible if you had formatted the file-system with "-j
>>     cluster".
>>
>>     For consistent file-system performance, we recommend *mmcrfs "-j
>>     scatter" layoutMap.*   Also, we need to ensure the mmcrfs "-n"  is
>>     set properly.
>>
>>     [snip from mmcrfs]/
>>     # mmlsfs <fs> | egrep 'Block allocation| Estimated number'
>>     -j                 scatter                  Block allocation type
>>     -n                 128                       Estimated number of
>>     nodes that will mount file system/
>>     [/snip]
>>
>>
>>     [snip from man mmcrfs]/
>>     *layoutMap={scatter|*//*cluster}*//
>>                      Specifies the block allocation map type. When
>>                      allocating blocks for a given file, GPFS first
>>                      uses a round‐robin algorithm to spread the data
>>                      across all disks in the storage pool. After a
>>                      disk is selected, the location of the data
>>                      block on the disk is determined by the block
>>                      allocation map type*. If cluster is
>>                      specified, GPFS attempts to allocate blocks in
>>                      clusters. Blocks that belong to a particular
>>                      file are kept adjacent to each other within
>>                      each cluster. If scatter is specified,
>>                      the location of the block is chosen randomly.*/
>>     /
>>                  *  The cluster allocation method may provide
>>                      better disk performance for some disk
>>                      subsystems in relatively small installations.
>>                      The benefits of clustered block allocation
>>                      diminish when the number of nodes in the
>>                      cluster or the number of disks in a file system
>>                      increases, or when the file system’s free space
>>                      becomes fragmented. *//The *cluster*//
>>                      allocation method is the default for GPFS
>>                      clusters with eight or fewer nodes and for file
>>                      systems with eight or fewer disks./
>>     /
>>                     *The scatter allocation method provides
>>                      more consistent file system performance by
>>                      averaging out performance variations due to
>>                      block location (for many disk subsystems, the
>>                      location of the data relative to the disk edge
>>                      has a substantial effect on performance).*//This
>>                      allocation method is appropriate in most cases
>>                      and is the default for GPFS clusters with more
>>                      than eight nodes or file systems with more than
>>                      eight disks./
>>     /
>>                      The block allocation map type cannot be changed
>>                      after the storage pool has been created./
>>
>>     */
>>     -n/*/*NumNodes*//
>>             The estimated number of nodes that will mount the file
>>             system in the local cluster and all remote clusters.
>>             This is used as a best guess for the initial size of
>>             some file system data structures. The default is 32.
>>             This value can be changed after the file system has been
>>             created but it does not change the existing data
>>             structures. Only the newly created data structure is
>>             affected by the new value. For example, new storage
>>             pool./
>>     /
>>             When you create a GPFS file system, you might want to
>>             overestimate the number of nodes that will mount the
>>             file system. GPFS uses this information for creating
>>             data structures that are essential for achieving maximum
>>             parallelism in file system operations (For more
>>             information, see GPFS architecture in IBM Spectrum
>>             Scale: Concepts, Planning, and Installation Guide ). If
>>             you are sure there will never be more than 64 nodes,
>>             allow the default value to be applied. If you are
>>             planning to add nodes to your system, you should specify
>>             a number larger than the default./
>>
>>     [/snip from man mmcrfs]
>>
>>     Regards,
>>     -Kums
>>
>>
>>
>>
>>
>>     From:        Ivano Talamo <Ivano.Talamo at psi.ch
>>     <mailto:Ivano.Talamo at psi.ch>>
>>     To:        <gpfsug-discuss at spectrumscale.org
>>     <mailto:gpfsug-discuss at spectrumscale.org>>
>>     Date:        11/15/2017 11:25 AM
>>     Subject:        [gpfsug-discuss] Write performances and filesystem
> size
>>     Sent by:        gpfsug-discuss-bounces at spectrumscale.org
>>     <mailto:gpfsug-discuss-bounces at spectrumscale.org>
>>
> ------------------------------------------------------------------------
>>
>>
>>
>>     Hello everybody,
>>
>>     together with my colleagues we are actually running some tests on
> a new
>>     DSS G220 system and we see some unexpected behaviour.
>>
>>     What we actually see is that write performances (we did not test read
>>     yet) decreases with the decrease of filesystem size.
>>
>>     I will not go into the details of the tests, but here are some
> numbers:
>>
>>     - with a filesystem using the full 1.2 PB space we get 14 GB/s as the
>>     sum of the disk activity on the two IO servers;
>>     - with a filesystem using half of the space we get 10 GB/s;
>>     - with a filesystem using 1/4 of the space we get 5 GB/s.
>>
>>     We also saw that performances are not affected by the vdisks layout,
>>     ie.
>>     taking the full space with one big vdisk or 2 half-size vdisks per RG
>>     gives the same performances.
>>
>>     To our understanding the IO should be spread evenly across all the
>>     pdisks in the declustered array, and looking at iostat all disks
>>     seem to
>>     be accessed. But so there must be some other element that affects
>>     performances.
>>
>>     Am I missing something? Is this an expected behaviour and someone
>>     has an
>>     explanation for this?
>>
>>     Thank you,
>>     Ivano
>>     _______________________________________________
>>     gpfsug-discuss mailing list
>>     gpfsug-discuss at spectrumscale.org <http://spectrumscale.org>_
>>
> __https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=McIf98wfiVqHU8ZygezLrQ&m=py_FGl3hi9yQsby94NZdpBFPwcUU0FREyMSSvuK_10U&s=Bq1J9eIXxadn5yrjXPHmKEht0CDBwfKJNH72p--T-6s&e=_
>>
>>
>>     _______________________________________________
>>     gpfsug-discuss mailing list
>>     gpfsug-discuss at spectrumscale.org <http://spectrumscale.org>
>>     http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>>
>>
>>     _______________________________________________
>>     gpfsug-discuss mailing list
>>     gpfsug-discuss at spectrumscale.org <http://spectrumscale.org>
>>     http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>>
>>
>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>



More information about the gpfsug-discuss mailing list