[gpfsug-discuss] Questions about mmap GPFS and compression

Wed Feb 15 16:43:37 GMT 2017

Just checked, we are definitely using PROT_READ, and the users only
have read permission to the files, so it should be purely read. I
guess that furthers the concern since we shouldn't be seeing the IO
overhead as you mentioned.
We also use madvise.. not sure if that helps or hurts.

On Tue, Feb 14, 2017 at 7:14 PM, Leo Luan <leoluan at us.ibm.com> wrote:
> Does your internally developed application do only reads during in its
> monthly run?   If so, can you change it to use PROT_READ flag during the
> mmap call?  That way you will not get the 10-block decompression IO overhead
> and your files will remain compressed.  The decompression happens upon
> pagein's only if the mmap call includes the PROT_WRITE flag (or upon actual
> writes for non-mmap IOs).
>
> Leo
>
>
> ----- Original message -----
> From: Zachary Giles <zgiles at gmail.com>
> Sent by: gpfsug-discuss-bounces at spectrumscale.org
> To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> Cc:
> Subject: Re: [gpfsug-discuss] Questions about mmap GPFS and compression
> Date: Tue, Feb 14, 2017 8:10 AM
>
> Hi Leo,
>
> I agree with your view on compression and what it should be used for,
> in general. The read bandwidth amplification is definitely something
> we're seeing.
>
> Just a little more background on the files:
> The files themselves are not "cold" (archive), however, they are very
> lightly used. The data set is thousands of files that are each
> 100-200GB, totaling about a PB. the read pattern is a few GB from
> about 20% of the files once a month. So the total read is only several
> TB out of a PB every month. ( approximate ). We can get a compression
> of about 5:1 using GPFS with these files, so we can gain back 800TB
> with compression. The total run time of the app (reading all those all
> chunks, when uncompressed) is maybe an hour total.
>
> Although leaving the files uncompressed would let the app work,
> there's a huge gain to be had if we can make compression work by
> saving ~800TB As it's such a small amount of data read each time, and
> also not too predictable (it's semi-random historical), and as the
> length of the job is short enough, it's hard to justify decompressing
> large chunks of the system to run 1 job. I would have to decompress
> 200TB to read 10TB, recompress them, and decompress a different
> (overlapping) 200TB next month. The compression / decompression of
> sizable portions of the data takes days.
>
> I think there maybe more of an issue that just performance though..
> The decompression thread is running, internal file metadata is read
> fine, most of the file is read fine. Just at times it gets stuck.. the
> decompression thread is running in GPFS, the app is polling, it just
> never comes back with the block.  I feel like there's a race condition
> here where a block is read, available for the app, but thrown away
> before the app can read it, only to be decompressed again.
> It's strange how some block positions are slow (expected) and others
> just never come back (it will poll for days on a certain address).
> However, reading the file in-order is fine.
>
> Is this a block caching issue? Can we tune up the amount of blocks kept?
> I think with mmap the blocks are not kept in page pool, correct?
>
> -Zach
>
> On Sat, Feb 11, 2017 at 5:23 PM, Leo Luan <leoluan at us.ibm.com> wrote:
>> Hi Zachary,
>>
>> When a compressed file is mmapped, each 4K read in your tests causes the
>> accessed part of the file to be decompressed (in the granularity of 10
>> GPFS
>> blocks).  For usual file sizes, the parts being accessed will be
>> decompressed and IOs speed will be normal except for the first 4K IO in
>> each
>> 10-GPFS-block group.  For very large files, a large percentage of small
>> random IOs may keep getting amplified to 10-block decompression IO for a
>> long time.  This is probably what happened in your mmap application run.
>>
>> The suggestion is to not compress files until they have become cold (not
>> likely to be accessed any time soon) and avoid compressing very large
>> files
>> that may be accessed through mmap later.  The product already has a
>> built-in
>> protection preventing compression of files that are mmapped at compression
>> time.  You can add an exclude rule in the compression policy run for files
>> that are identified to have mmap performance issues (in case they get
>> mmapped after being compressed in a periodical policy run).
>>
>> Leo Luan
>>
>> From: Zachary Giles <zgiles at gmail.com>
>> To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
>> Date: 02/10/2017 01:57 PM
>> Subject: [gpfsug-discuss] Questions about mmap GPFS and compression
>> Sent by: gpfsug-discuss-bounces at spectrumscale.org
>> ________________________________
>>
>>
>>
>> Hello All,
>>
>> I've been seeing some less than desirable behavior with mmap and
>> compression in GPFS. Curious if others see similar or have any ideas
>> if this is accurate..
>> The guys here want me to open an IBM ticket, but I figured I'd see if
>> anyone has had this experience before.
>>
>> We have an internally developed app that runs on our cluster
>> referencing data sitting in GPFS. It is using mmap to access the files
>> due to a library we're using that requires it.
>>
>> If we run the app against some data on GPFS, it performs well..
>> finishing in a few minutes time -- Great. However, if we compress the
>> file (in GPFS), the app is still running after 2 days time.
>> stracing the app shows that is polling on a file descriptor, forever..
>> as if a data block is still pending.
>>
>> I know mmap is supported with compression according to the manual
>> (with some stipulations), and that performance is expected to be much
>> less since it's more large-block oriented due to decompressed in
>> groups.. no problem. But it seems like some data should get returned.
>>
>> I'm surprised to find that a very small amount of data is sitting in
>> the buffers (mmfsadm dump buffers) in reference to the inodes. The
>> decompression thread is running continuously, while the app is still
>> polling for data from memory and sleeping, retrying, sleeping, repeat.
>>
>> What I believe is happening is that the 4k pages are being pulled out
>> of large decompression groups from an mmap read request, put in the
>> buffer, then the compression group data is thrown away since it has
>> the result it wants, only to need another piece of data that would
>> have been in that group slightly later, which is recalled, put in the
>> buffer.. etc. Thus an infinite slowdown. Perhaps also the data is
>> expiring out of the buffer before the app has a chance to read it. I
>> can't tell.  In any case, the app makes zero progress.
>>
>> I tried without our app, using fio.. mmap on an uncompressed file with
>> 1 thread 1 iodepth, random read, 4k blocks, yields ~76MB/s (not
>> impressive). However, on a compressed file it is only 20KB/s max. (
>> far less impressive ). Reading a file using aio etc is over 3GB/s on a
>> single thread without even trying.
>>
>> What do you think?
>> Anyone see anything like this? Perhaps there are some tunings to waste
>> a bit more memory on cached blocks rather than make decompression
>> recycle?
>>
>> I've searched back the archives a bit. There's a May 2013 thread about
>> slowness as well. I think we're seeing much much less than that. Our
>> page pools are of decent size. Its not just slowness, it's as if the
>> app never gets a block back at all. ( We could handle slowness .. )
>>
>> Thanks. Open to ideas..
>>
>> -Zach Giles
>>
>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>
>
>
> --
> Zach Giles
> zgiles at gmail.com
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>

-- 
Zach Giles
zgiles at gmail.com