[gpfsug-discuss] very low read performance in simple spectrum scale/gpfs cluster with a storage-server SAN

Thu Jun 11 15:06:45 BST 2020

256K

Giovanni

On 11/06/20 10:01, Luis Bolinches wrote:
> On that RAID 6 what is the logical RAID block size? 128K, 256K, other?
> --
> Ystävällisin terveisin / Kind regards / Saludos cordiales / Salutations 
> / Salutacions
> Luis Bolinches
> Consultant IT Specialist
> IBM Spectrum Scale development
> ESS & client adoption teams
> Mobile Phone: +358503112585
> *https://www.youracclaim.com/user/luis-bolinches*
> Ab IBM Finland Oy
> Laajalahdentie 23
> 00330 Helsinki
> Uusimaa - Finland
> 
> *"If you always give you will always have" --  Anonymous*
> 
>     ----- Original message -----
>     From: Giovanni Bracco <giovanni.bracco at enea.it>
>     Sent by: gpfsug-discuss-bounces at spectrumscale.org
>     To: Jan-Frode Myklebust <janfrode at tanso.net>, gpfsug main discussion
>     list <gpfsug-discuss at spectrumscale.org>
>     Cc: Agostino Funel <agostino.funel at enea.it>
>     Subject: [EXTERNAL] Re: [gpfsug-discuss] very low read performance
>     in simple spectrum scale/gpfs cluster with a storage-server SAN
>     Date: Thu, Jun 11, 2020 10:53
>     Comments and updates in the text:
> 
>     On 05/06/20 19:02, Jan-Frode Myklebust wrote:
>      > fre. 5. jun. 2020 kl. 15:53 skrev Giovanni Bracco
>      > <giovanni.bracco at enea.it <mailto:giovanni.bracco at enea.it>>:
>      >
>      >     answer in the text
>      >
>      >     On 05/06/20 14:58, Jan-Frode Myklebust wrote:
>      >      >
>      >      > Could maybe be interesting to drop the NSD servers, and
>     let all
>      >     nodes
>      >      > access the storage via srp ?
>      >
>      >     no we can not: the production clusters fabric is a mix of a
>     QDR based
>      >     cluster and a OPA based cluster and NSD nodes provide the
>     service to
>      >     both.
>      >
>      >
>      > You could potentially still do SRP from QDR nodes, and via NSD
>     for your
>      > omnipath nodes. Going via NSD seems like a bit pointless indirection.
> 
>     not really: both clusters, the 400 OPA nodes and the 300 QDR nodes share
>     the same data lake in Spectrum Scale/GPFS so the NSD servers support the
>     flexibility of the setup.
> 
>     NSD servers make use of a IB SAN fabric (Mellanox FDR switch) where at
>     the moment 3 different generations of DDN storages are connected,
>     9900/QDR 7700/FDR and 7990/EDR. The idea was to be able to add some less
>     expensive storage, to be used when performance is not the first
>     priority.
> 
>      >
>      >
>      >
>      >      >
>      >      > Maybe turn off readahead, since it can cause performance
>     degradation
>      >      > when GPFS reads 1 MB blocks scattered on the NSDs, so that
>      >     read-ahead
>      >      > always reads too much. This might be the cause of the slow
>     read
>      >     seen —
>      >      > maybe you’ll also overflow it if reading from both
>     NSD-servers at
>      >     the
>      >      > same time?
>      >
>      >     I have switched the readahead off and this produced a small
>     (~10%)
>      >     increase of performances when reading from a NSD server, but
>     no change
>      >     in the bad behaviour for the GPFS clients
>      >
>      >
>      >      >
>      >      >
>      >      > Plus.. it’s always nice to give a bit more pagepool to hhe
>      >     clients than
>      >      > the default.. I would prefer to start with 4 GB.
>      >
>      >     we'll do also that and we'll let you know!
>      >
>      >
>      > Could you show your mmlsconfig? Likely you should set maxMBpS to
>      > indicate what kind of throughput a client can do (affects GPFS
>      > readahead/writebehind).  Would typically also increase
>     workerThreads on
>      > your NSD servers.
> 
>     At this moment this is the output of mmlsconfig
> 
>     # mmlsconfig
>     Configuration data for cluster GPFSEXP.portici.enea.it:
>     -------------------------------------------------------
>     clusterName GPFSEXP.portici.enea.it
>     clusterId 13274694257874519577
>     autoload no
>     dmapiFileHandleSize 32
>     minReleaseLevel 5.0.4.0
>     ccrEnabled yes
>     cipherList AUTHONLY
>     verbsRdma enable
>     verbsPorts qib0/1
>     [cresco-gpfq7,cresco-gpfq8]
>     verbsPorts qib0/2
>     [common]
>     pagepool 4G
>     adminMode central
> 
>     File systems in cluster GPFSEXP.portici.enea.it:
>     ------------------------------------------------
>     /dev/vsd_gexp2
>     /dev/vsd_gexp3
> 
> 
>      >
>      >
>      > 1 MB blocksize is a bit bad for your 9+p+q RAID with 256 KB strip
>     size.
>      > When you write one GPFS block, less than a half RAID stripe is
>     written,
>      > which means you  need to read back some data to calculate new
>     parities.
>      > I would prefer 4 MB block size, and maybe also change to 8+p+q so
>     that
>      > one GPFS is a multiple of a full 2 MB stripe.
>      >
>      >
>      >     -jf
> 
>     we have now added another file system based on 2 NSD on RAID6 8+p+q,
>     keeping the 1MB block size just not to change too many things at the
>     same time, but no substantial change in very low readout performances,
>     that are still of the order of 50 MB/s while write performance are
>     1000MB/s
> 
>     Any other suggestion is welcomed!
> 
>     Giovanni
> 
> 
> 
>     --
>     Giovanni Bracco
>     phone  +39 351 8804788
>     E-mail  giovanni.bracco at enea.it
>     WWW http://www.afs.enea.it/bracco
>     _______________________________________________
>     gpfsug-discuss mailing list
>     gpfsug-discuss at spectrumscale.org
>     http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> 
> 
> Ellei edellä ole toisin mainittu: / Unless stated otherwise above:
> Oy IBM Finland Ab
> PL 265, 00101 Helsinki, Finland
> Business ID, Y-tunnus: 0195876-3
> Registered in Finland
> 
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> 

-- 
Giovanni Bracco
phone  +39 351 8804788
E-mail  giovanni.bracco at enea.it
WWW http://www.afs.enea.it/bracco