[gpfsug-discuss] GPFS, Pagepool and Block size -> Perfomance reduces with larger block size

Sven Oehme oehmes at gmail.com
Wed Sep 19 19:11:42 BST 2018


seem like you never read my performance presentation from a few years ago
;-)

you can control this on a per node basis , either for all i/o :

   prefetchAggressiveness = X

or individual for reads or writes :

   prefetchAggressivenessRead = X
   prefetchAggressivenessWrite = X

for a start i would turn it off completely via :

mmchconfig prefetchAggressiveness=0 -I -N nodename

that will turn it off only for that node and only until you restart the
node.
then see what happens

sven


On Wed, Sep 19, 2018 at 11:07 AM <valleru at cbio.mskcc.org> wrote:

> Thank you Sven.
>
> I mostly think it could be 1. or some other issue.
> I don’t think it could be 2. , because i can replicate this issue no
> matter what is the size of the dataset. It happens for few files that could
> easily fit in the page pool too.
>
> I do see a lot more page faults for 16M compared to 1M, so it could be
> related to many threads trying to compete for the same buffer space.
>
> I will try to take the trace with trace=io option and see if can find
> something.
>
> How do i turn of prefetching? Can i turn it off for a single node/client?
>
> Regards,
> Lohit
>
> On Sep 18, 2018, 5:23 PM -0400, Sven Oehme <oehmes at gmail.com>, wrote:
>
> Hi,
>
> taking a trace would tell for sure, but i suspect what you might be
> hitting one or even multiple issues which have similar negative performance
> impacts but different root causes.
>
> 1. this could be serialization around buffer locks. as larger your
> blocksize gets as larger is the amount of data one of this pagepool buffers
> will maintain, if there is a lot of concurrency on smaller amount of data
> more threads potentially compete for the same buffer lock to copy stuff in
> and out of a particular buffer, hence things go slower compared to the same
> amount of data spread across more buffers, each of smaller size.
>
> 2. your data set is small'ish, lets say a couple of time bigger than the
> pagepool and you random access it with multiple threads. what will happen
> is that because it doesn't fit into the cache it will be read from the
> backend. if multiple threads hit the same 16 mb block at once with multiple
> 4k random reads, it will read the whole 16mb block because it thinks it
> will benefit from it later on out of cache, but because it fully random the
> same happens with the next block and the next and so on and before you get
> back to this block it was pushed out of the cache because of lack of enough
> pagepool.
>
> i could think of multiple other scenarios , which is why its so hard to
> accurately benchmark an application because you will design a benchmark to
> test an application, but it actually almost always behaves different then
> you think it does :-)
>
> so best is to run the real application and see under which configuration
> it works best.
>
> you could also take a trace with trace=io and then look at
>
> TRACE_VNOP: READ:
> TRACE_VNOP: WRITE:
>
> and compare them to
>
> TRACE_IO: QIO: read
> TRACE_IO: QIO: write
>
> and see if the numbers summed up for both are somewhat equal. if
> TRACE_VNOP is significant smaller than TRACE_IO you most likely do more i/o
> than you should and turning prefetching off might actually make things
> faster .
>
> keep in mind i am no longer working for IBM so all i say might be obsolete
> by now, i no longer have access to the one and only truth aka the source
> code ... but if i am wrong i am sure somebody will point this out soon ;-)
>
> sven
>
>
>
>
> On Tue, Sep 18, 2018 at 10:31 AM <valleru at cbio.mskcc.org> wrote:
>
>> Hello All,
>>
>> This is a continuation to the previous discussion that i had with Sven.
>> However against what i had mentioned previously - i realize that this is
>> “not” related to mmap, and i see it when doing random freads.
>>
>> I see that block-size of the filesystem matters when reading from Page
>> pool.
>> I see a major difference in performance when compared 1M to 16M, when
>> doing lot of random small freads with all of the data in pagepool.
>>
>> Performance for 1M is a magnitude “more” than the performance that i see
>> for 16M.
>>
>> The GPFS that we have currently is :
>> Version : 5.0.1-0.5
>> Filesystem version: 19.01 (5.0.1.0)
>> Block-size : 16M
>>
>> I had made the filesystem block-size to be 16M, thinking that i would get
>> the most performance for both random/sequential reads from 16M than the
>> smaller block-sizes.
>> With GPFS 5.0, i made use the 1024 sub-blocks instead of 32 and thus not
>> loose lot of storage space even with 16M.
>> I had run few benchmarks and i did see that 16M was performing better
>> “when hitting storage/disks” with respect to bandwidth for
>> random/sequential on small/large reads.
>>
>> However, with this particular workload - where it freads a chunk of data
>> randomly from hundreds of files -> I see that the number of page-faults
>> increase with block-size and actually reduce the performance.
>> 1M performs a lot better than 16M, and may be i will get better
>> performance with less than 1M.
>> It gives the best performance when reading from local disk, with 4K block
>> size filesystem.
>>
>> What i mean by performance when it comes to this workload - is not the
>> bandwidth but the amount of time that it takes to do each iteration/read
>> batch of data.
>>
>> I figure what is happening is:
>> fread is trying to read a full block size of 16M - which is good in a
>> way, when it hits the hard disk.
>> But the application could be using just a small part of that 16M. Thus
>> when randomly reading(freads) lot of data of 16M chunk size - it is page
>> faulting a lot more and causing the performance to drop .
>> I could try to make the application do read instead of freads, but i fear
>> that could be bad too since it might be hitting the disk with a very small
>> block size and that is not good.
>>
>> With the way i see things now -
>> I believe it could be best if the application does random reads of 4k/1M
>> from pagepool but some how does 16M from rotating disks.
>>
>> I don’t see any way of doing the above other than following a different
>> approach where i create a filesystem with a smaller block size ( 1M or less
>> than 1M ), on SSDs as a tier.
>>
>> May i please ask for advise, if what i am understanding/seeing is right
>> and the best solution possible for the above scenario.
>>
>> Regards,
>> Lohit
>>
>> On Apr 11, 2018, 10:36 AM -0400, Lohit Valleru <valleru at cbio.mskcc.org>,
>> wrote:
>>
>> Hey Sven,
>>
>> This is regarding mmap issues and GPFS.
>> We had discussed previously of experimenting with GPFS 5.
>>
>> I now have upgraded all of compute nodes and NSD nodes to GPFS 5.0.0.2
>>
>> I am yet to experiment with mmap performance, but before that - I am
>> seeing weird hangs with GPFS 5 and I think it could be related to mmap.
>>
>> Have you seen GPFS ever hang on this syscall?
>> [Tue Apr 10 04:20:13 2018] [<ffffffffa0a92155>]
>> _ZN10gpfsNode_t8mmapLockEiiPKj+0xb5/0x140 [mmfs26]
>>
>> I see the above ,when kernel hangs and throws out a series of trace calls.
>>
>> I somehow think the above trace is related to processes hanging on GPFS
>> forever. There are no errors in GPFS however.
>>
>> Also, I think the above happens only when the mmap threads go above a
>> particular number.
>>
>> We had faced a similar issue in 4.2.3 and it was resolved in a patch to
>> 4.2.3.2 . At that time , the issue happened when mmap threads go more than
>> worker1threads. According to the ticket - it was a mmap race condition that
>> GPFS was not handling well.
>>
>> I am not sure if this issue is a repeat and I am yet to isolate the
>> incident and test with increasing number of mmap threads.
>>
>> I am not 100 percent sure if this is related to mmap yet but just wanted
>> to ask you if you have seen anything like above.
>>
>> Thanks,
>>
>> Lohit
>>
>> On Feb 22, 2018, 3:59 PM -0500, Sven Oehme <oehmes at gmail.com>, wrote:
>>
>> Hi Lohit,
>>
>> i am working with ray on a mmap performance improvement right now, which
>> most likely has the same root cause as yours , see -->
>> http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html
>> the thread above is silent after a couple of back and rorth, but ray and
>> i have active communication in the background and will repost as soon as
>> there is something new to share.
>> i am happy to look at this issue after we finish with ray's workload if
>> there is something missing, but first let's finish his, get you try the
>> same fix and see if there is something missing.
>>
>> btw. if people would share their use of MMAP , what applications they use
>> (home grown, just use lmdb which uses mmap under the cover, etc) please let
>> me know so i get a better picture on how wide the usage is with GPFS. i
>> know a lot of the ML/DL workloads are using it, but i would like to know
>> what else is out there i might not think about. feel free to drop me a
>> personal note, i might not reply to it right away, but eventually.
>>
>> thx. sven
>>
>>
>> On Thu, Feb 22, 2018 at 12:33 PM <valleru at cbio.mskcc.org> wrote:
>>
>>> Hi all,
>>>
>>> I wanted to know, how does mmap interact with GPFS pagepool with respect
>>> to filesystem block-size?
>>> Does the efficiency depend on the mmap read size and the block-size of
>>> the filesystem even if all the data is cached in pagepool?
>>>
>>> GPFS 4.2.3.2 and CentOS7.
>>>
>>> Here is what i observed:
>>>
>>> I was testing a user script that uses mmap to read from 100M to 500MB
>>> files.
>>>
>>> The above files are stored on 3 different filesystems.
>>>
>>> Compute nodes - 10G pagepool and 5G seqdiscardthreshold.
>>>
>>> 1. 4M block size GPFS filesystem, with separate metadata and data. Data
>>> on Near line and metadata on SSDs
>>> 2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the
>>> required files fully cached" from the above GPFS cluster as home. Data and
>>> Metadata together on SSDs
>>> 3. 16M block size GPFS filesystem, with separate metadata and data. Data
>>> on Near line and metadata on SSDs
>>>
>>> When i run the script first time for “each" filesystem:
>>> I see that GPFS reads from the files, and caches into the pagepool as it
>>> reads, from mmdiag -- iohist
>>>
>>> When i run the second time, i see that there are no IO requests from the
>>> compute node to GPFS NSD servers, which is expected since all the data from
>>> the 3 filesystems is cached.
>>>
>>> However - the time taken for the script to run for the files in the 3
>>> different filesystems is different - although i know that they are just
>>> "mmapping"/reading from pagepool/cache and not from disk.
>>>
>>> Here is the difference in time, for IO just from pagepool:
>>>
>>> 20s 4M block size
>>> 15s 1M block size
>>> 40S 16M block size.
>>>
>>> Why do i see a difference when trying to mmap reads from different
>>> block-size filesystems, although i see that the IO requests are not hitting
>>> disks and just the pagepool?
>>>
>>> I am willing to share the strace output and mmdiag outputs if needed.
>>>
>>> Thanks,
>>> Lohit
>>>
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at spectrumscale.org
>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180919/10d2fbd1/attachment.htm>


More information about the gpfsug-discuss mailing list