[gpfsug-discuss] GPFS, Pagepool and Block size -> Perfomance reduces with larger block size

Tue Sep 18 18:13:44 BST 2018

Hello All,

This is a continuation to the previous discussion that i had with Sven.
However against what i had mentioned previously - i realize that this is “not” related to mmap, and i see it when doing random freads.

I see that block-size of the filesystem matters when reading from Page pool.
I see a major difference in performance when compared 1M to 16M, when doing lot of random small freads with all of the data in pagepool.

Performance for 1M is a magnitude “more” than the performance that i see for 16M.

The GPFS that we have currently is :
Version : 5.0.1-0.5
Filesystem version: 19.01 (5.0.1.0)
Block-size : 16M

I had made the filesystem block-size to be 16M, thinking that i would get the most performance for both random/sequential reads from 16M than the smaller block-sizes.
With GPFS 5.0, i made use the 1024 sub-blocks instead of 32 and thus not loose lot of storage space even with 16M.
I had run few benchmarks and i did see that 16M was performing better “when hitting storage/disks” with respect to bandwidth for random/sequential on small/large reads.

However, with this particular workload - where it freads a chunk of data randomly from hundreds of files -> I see that the number of page-faults increase with block-size and actually reduce the performance.
1M performs a lot better than 16M, and may be i will get better performance with less than 1M.
It gives the best performance when reading from local disk, with 4K block size filesystem.

What i mean by performance when it comes to this workload - is not the bandwidth but the amount of time that it takes to do each iteration/read batch of data.

I figure what is happening is:
fread is trying to read a full block size of 16M - which is good in a way, when it hits the hard disk.
But the application could be using just a small part of that 16M. Thus when randomly reading(freads) lot of data of 16M chunk size - it is page faulting a lot more and causing the performance to drop .
I could try to make the application do read instead of freads, but i fear that could be bad too since it might be hitting the disk with a very small block size and that is not good.

With the way i see things now -
I believe it could be best if the application does random reads of 4k/1M from pagepool but some how does 16M from rotating disks.

I don’t see any way of doing the above other than following a different approach where i create a filesystem with a smaller block size ( 1M or less than 1M ), on SSDs as a tier.

May i please ask for advise, if what i am understanding/seeing is right and the best solution possible for the above scenario.

Regards,
Lohit

On Apr 11, 2018, 10:36 AM -0400, Lohit Valleru <valleru at cbio.mskcc.org>, wrote:
> Hey Sven,
>
> This is regarding mmap issues and GPFS.
> We had discussed previously of experimenting with GPFS 5.
>
> I now have upgraded all of compute nodes and NSD nodes to GPFS 5.0.0.2
>
> I am yet to experiment with mmap performance, but before that - I am seeing weird hangs with GPFS 5 and I think it could be related to mmap.
>
> Have you seen GPFS ever hang on this syscall?
> [Tue Apr 10 04:20:13 2018] [<ffffffffa0a92155>] _ZN10gpfsNode_t8mmapLockEiiPKj+0xb5/0x140 [mmfs26]
>
> I see the above ,when kernel hangs and throws out a series of trace calls.
>
> I somehow think the above trace is related to processes hanging on GPFS forever. There are no errors in GPFS however.
>
> Also, I think the above happens only when the mmap threads go above a particular number.
>
> We had faced a similar issue in 4.2.3 and it was resolved in a patch to 4.2.3.2 . At that time , the issue happened when mmap threads go more than worker1threads. According to the ticket - it was a mmap race condition that GPFS was not handling well.
>
> I am not sure if this issue is a repeat and I am yet to isolate the incident and test with increasing number of mmap threads.
>
> I am not 100 percent sure if this is related to mmap yet but just wanted to ask you if you have seen anything like above.
>
> Thanks,
>
> Lohit
>
> On Feb 22, 2018, 3:59 PM -0500, Sven Oehme <oehmes at gmail.com>, wrote:
> > Hi Lohit,
> >
> > i am working with ray on a mmap performance improvement right now, which most likely has the same root cause as yours , see -->  http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html
> > the thread above is silent after a couple of back and rorth, but ray and i have active communication in the background and will repost as soon as there is something new to share.
> > i am happy to look at this issue after we finish with ray's workload if there is something missing, but first let's finish his, get you try the same fix and see if there is something missing.
> >
> > btw. if people would share their use of MMAP , what applications they use (home grown, just use lmdb which uses mmap under the cover, etc) please let me know so i get a better picture on how wide the usage is with GPFS. i know a lot of the ML/DL workloads are using it, but i would like to know what else is out there i might not think about. feel free to drop me a personal note, i might not reply to it right away, but eventually.
> >
> > thx. sven
> >
> >
> > > On Thu, Feb 22, 2018 at 12:33 PM <valleru at cbio.mskcc.org> wrote:
> > > > Hi all,
> > > >
> > > > I wanted to know, how does mmap interact with GPFS pagepool with respect to filesystem block-size?
> > > > Does the efficiency depend on the mmap read size and the block-size of the filesystem even if all the data is cached in pagepool?
> > > >
> > > > GPFS 4.2.3.2 and CentOS7.
> > > >
> > > > Here is what i observed:
> > > >
> > > > I was testing a user script that uses mmap to read from 100M to 500MB files.
> > > >
> > > > The above files are stored on 3 different filesystems.
> > > >
> > > > Compute nodes - 10G pagepool and 5G seqdiscardthreshold.
> > > >
> > > > 1. 4M block size GPFS filesystem, with separate metadata and data. Data on Near line and metadata on SSDs
> > > > 2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the required files fully cached" from the above GPFS cluster as home. Data and Metadata together on SSDs
> > > > 3. 16M block size GPFS filesystem, with separate metadata and data. Data on Near line and metadata on SSDs
> > > >
> > > > When i run the script first time for “each" filesystem:
> > > > I see that GPFS reads from the files, and caches into the pagepool as it reads, from mmdiag -- iohist
> > > >
> > > > When i run the second time, i see that there are no IO requests from the compute node to GPFS NSD servers, which is expected since all the data from the 3 filesystems is cached.
> > > >
> > > > However - the time taken for the script to run for the files in the 3 different filesystems is different - although i know that they are just "mmapping"/reading from pagepool/cache and not from disk.
> > > >
> > > > Here is the difference in time, for IO just from pagepool:
> > > >
> > > > 20s 4M block size
> > > > 15s 1M block size
> > > > 40S 16M block size.
> > > >
> > > > Why do i see a difference when trying to mmap reads from different block-size filesystems, although i see that the IO requests are not hitting disks and just the pagepool?
> > > >
> > > > I am willing to share the strace output and mmdiag outputs if needed.
> > > >
> > > > Thanks,
> > > > Lohit
> > > >
> > > > _______________________________________________
> > > > gpfsug-discuss mailing list
> > > > gpfsug-discuss at spectrumscale.org
> > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> > _______________________________________________
> > gpfsug-discuss mailing list
> > gpfsug-discuss at spectrumscale.org
> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180918/c0998d6e/attachment.htm>