[gpfsug-discuss] Fwd: Blocksize

Yuri L Volobuev volobuev at us.ibm.com
Thu Sep 29 19:00:40 BST 2016


> to the question.  If I were to hypothetically use a 256K metadata
> block size, does the “1/32nd of a block” come into play like it does
> for “not metadata”?  I.e. 256 / 32 = 8K, so am I reading / writing
> *2* inodes (assuming 4K inode size) minimum?

I think the point of confusion here is minimum allocation size vs minimum
IO size -- those two are not one and the same.  In fact in GPFS those are
largely unrelated values.  For low-level metadata files where multiple
records are packed into the same block, it is possible to read/write either
an individual record (such as an inode), or an entire block of records
(which is what happens, for example, during inode copy-on-write).  The
minimum IO size in GPFS is 512 bytes.  On a "4K-aligned" file system, GPFS
vows to only do IOs in multiples of 4KiB.  For data, GPFS tracks what
portion of a given block is valid/dirty using an in-memory bitmap, and if
4K in the middle of a 16M block are modified, only 4K get written, not 16M
(although this is more complicated for sparse file writes and appends, when
some areas need to be zeroed out).  For metadata writes, entire metadata
objects are written, using the actual object size, rounded up to the
nearest 512B or 4K boundary, as needed.

So a single modified inode results in a single inode write, regardless of
the metadata block size.  If you have snapshots, and the inode being
modified needs to be copied to the previous snapshot, and happens to be the
first inode in the block that needs a COW, an entire block of inodes is
copied to the latest snapshot, as an optimization.

> And here’s a really off the wall question … yesterday we were
> discussing the fact that there is now a single inode file.
> Historically, we have always used RAID 1 mirrors (first with
> spinning disk, as of last fall now on SSD) for metadata and then use
> GPFS replication on top of that.  But given that there is a single
> inode file is that “old way” of doing things still the right way?
> In other words, could we potentially be better off by using a couple
> of 8+2P RAID 6 LUNs?

The old way is also the modern way in this case.  Using RAID1 LUNs for GPFS
metadata is still the right approach.  You don't want to use RAID erasure
codes that trigger read-modify-write for small IOs, which are typical for
metadata (unless your RAID array has so much cache as to make RMW a moot
point).

> One potential downside of that would be that we would then only have
> two NSD servers serving up metadata, so we discussed the idea of
> taking each RAID 6 LUN and splitting it up into multiple logical
> volumes (all that done on the storage array, of course) and then
> presenting those to GPFS as NSDs???

Like most performance questions, this one can ultimately only be answered
definitively by running tests, but offhand I would suspect that the
performance impact of RAID6, combined with extra contention for physical
disks, is going to more than offset the benefits of using more NSD servers.
Keep in mind that you aren't limited to 2 NSD servers per LUN.  If you
actually have the connectivity for more than 2 nodes on your RAID
controller, GPFS allows up to 8 simultaneously active NSD servers per NSD.

yuri

> On Sep 28, 2016, at 10:23 AM, Marc A Kaplan <makaplan at us.ibm.com> wrote:
>
> OKAY, I'll say it again.  inodes are PACKED into a single inode
> file.  So a 4KB inode takes 4KB, REGARDLESS of metadata blocksize.
> There is no wasted space.
>
> (Of course if you have metadata replication = 2, then yes, double
> that.  And yes, there overhead for indirect blocks (indices),
> allocation maps, etc, etc.)
>
> And your choice is not just 512 or 4096.  Maybe 1KB or 2KB is a good
> choice for your data distribution, to optimize packing of data and/
> or directories into inodes...
>
> Hmmm... I don't know why the doc leaves out 2048, perhaps a typo...
>
> mmcrfs x2K -i 2048
>
> [root at n2 charts]# mmlsfs x2K -i
> flag                value                    description
> ------------------- ------------------------
> -----------------------------------
>  -i                 2048                     Inode size in bytes
>
> Works for me!
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>

> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160929/f7f972dc/attachment.htm>


More information about the gpfsug-discuss mailing list