[gpfsug-discuss] Questions about mmap GPFS and compression

Zachary Giles zgiles at gmail.com
Tue Feb 14 16:10:13 GMT 2017

Hi Leo,

I agree with your view on compression and what it should be used for,
in general. The read bandwidth amplification is definitely something
we're seeing.

Just a little more background on the files:
The files themselves are not "cold" (archive), however, they are very
lightly used. The data set is thousands of files that are each
100-200GB, totaling about a PB. the read pattern is a few GB from
about 20% of the files once a month. So the total read is only several
TB out of a PB every month. ( approximate ). We can get a compression
of about 5:1 using GPFS with these files, so we can gain back 800TB
with compression. The total run time of the app (reading all those all
chunks, when uncompressed) is maybe an hour total.

Although leaving the files uncompressed would let the app work,
there's a huge gain to be had if we can make compression work by
saving ~800TB As it's such a small amount of data read each time, and
also not too predictable (it's semi-random historical), and as the
length of the job is short enough, it's hard to justify decompressing
large chunks of the system to run 1 job. I would have to decompress
200TB to read 10TB, recompress them, and decompress a different
(overlapping) 200TB next month. The compression / decompression of
sizable portions of the data takes days.

I think there maybe more of an issue that just performance though..
The decompression thread is running, internal file metadata is read
fine, most of the file is read fine. Just at times it gets stuck.. the
decompression thread is running in GPFS, the app is polling, it just
never comes back with the block.  I feel like there's a race condition
here where a block is read, available for the app, but thrown away
before the app can read it, only to be decompressed again.
It's strange how some block positions are slow (expected) and others
just never come back (it will poll for days on a certain address).
However, reading the file in-order is fine.

Is this a block caching issue? Can we tune up the amount of blocks kept?
I think with mmap the blocks are not kept in page pool, correct?


On Sat, Feb 11, 2017 at 5:23 PM, Leo Luan <leoluan at us.ibm.com> wrote:
> Hi Zachary,
> When a compressed file is mmapped, each 4K read in your tests causes the
> accessed part of the file to be decompressed (in the granularity of 10 GPFS
> blocks).  For usual file sizes, the parts being accessed will be
> decompressed and IOs speed will be normal except for the first 4K IO in each
> 10-GPFS-block group.  For very large files, a large percentage of small
> random IOs may keep getting amplified to 10-block decompression IO for a
> long time.  This is probably what happened in your mmap application run.
> The suggestion is to not compress files until they have become cold (not
> likely to be accessed any time soon) and avoid compressing very large files
> that may be accessed through mmap later.  The product already has a built-in
> protection preventing compression of files that are mmapped at compression
> time.  You can add an exclude rule in the compression policy run for files
> that are identified to have mmap performance issues (in case they get
> mmapped after being compressed in a periodical policy run).
> Leo Luan
> From: Zachary Giles <zgiles at gmail.com>
> To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> Date: 02/10/2017 01:57 PM
> Subject: [gpfsug-discuss] Questions about mmap GPFS and compression
> Sent by: gpfsug-discuss-bounces at spectrumscale.org
> ________________________________
> Hello All,
> I've been seeing some less than desirable behavior with mmap and
> compression in GPFS. Curious if others see similar or have any ideas
> if this is accurate..
> The guys here want me to open an IBM ticket, but I figured I'd see if
> anyone has had this experience before.
> We have an internally developed app that runs on our cluster
> referencing data sitting in GPFS. It is using mmap to access the files
> due to a library we're using that requires it.
> If we run the app against some data on GPFS, it performs well..
> finishing in a few minutes time -- Great. However, if we compress the
> file (in GPFS), the app is still running after 2 days time.
> stracing the app shows that is polling on a file descriptor, forever..
> as if a data block is still pending.
> I know mmap is supported with compression according to the manual
> (with some stipulations), and that performance is expected to be much
> less since it's more large-block oriented due to decompressed in
> groups.. no problem. But it seems like some data should get returned.
> I'm surprised to find that a very small amount of data is sitting in
> the buffers (mmfsadm dump buffers) in reference to the inodes. The
> decompression thread is running continuously, while the app is still
> polling for data from memory and sleeping, retrying, sleeping, repeat.
> What I believe is happening is that the 4k pages are being pulled out
> of large decompression groups from an mmap read request, put in the
> buffer, then the compression group data is thrown away since it has
> the result it wants, only to need another piece of data that would
> have been in that group slightly later, which is recalled, put in the
> buffer.. etc. Thus an infinite slowdown. Perhaps also the data is
> expiring out of the buffer before the app has a chance to read it. I
> can't tell.  In any case, the app makes zero progress.
> I tried without our app, using fio.. mmap on an uncompressed file with
> 1 thread 1 iodepth, random read, 4k blocks, yields ~76MB/s (not
> impressive). However, on a compressed file it is only 20KB/s max. (
> far less impressive ). Reading a file using aio etc is over 3GB/s on a
> single thread without even trying.
> What do you think?
> Anyone see anything like this? Perhaps there are some tunings to waste
> a bit more memory on cached blocks rather than make decompression
> recycle?
> I've searched back the archives a bit. There's a May 2013 thread about
> slowness as well. I think we're seeing much much less than that. Our
> page pools are of decent size. Its not just slowness, it's as if the
> app never gets a block back at all. ( We could handle slowness .. )
> Thanks. Open to ideas..
> -Zach Giles
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Zach Giles
zgiles at gmail.com

More information about the gpfsug-discuss mailing list