[gpfsug-discuss] Fwd: Blocksize

Marc A Kaplan makaplan at us.ibm.com
Thu Sep 29 16:32:47 BST 2016


Frankly, I just don't "get" what it is you seem not to be "getting"  - 
perhaps someone else who does "get" it can rephrase:  FORGET about 
Subblocks when thinking about inodes being packed into the file of all 
inodes. 

Additional facts that may address some of the other concerns:

I started working on GPFS at version 3.1 or so.  AFAIK GPFS always had and 
has one file of inodes, "packed", with no wasted space between inodes. 
Period. Full Stop.

RAID!  Now we come to a mistake that I've seen made by more than a handful 
of customers!

It is generally a mistake to use RAID with parity (such as classic RAID5) 
to store metadata.

Why?  Because metadata is often updated with "small writes"  - for example 
suppose we have to update some fields in an inode, or an indirect block, 
or append a log record...
For RAID with parity and large stripe sizes -- this means that updating 
just one disk sector can cost a full stripe read + writing the changed 
data and parity sectors.

SO, if you want protection against storage failures for your metadata, use 
either RAID mirroring/replication and/or GPFS metadata replication.  (belt 
and/or suspenders)
(Arguments against relying solely on RAID mirroring:  single enclosure/box 
failure (fire!), single hardware design (bugs or defects), single 
firmware/microcode(bugs.))

Yes, GPFS is part of "the cyber."  We're making it stronger everyday. But 
it already is great. 

--marc



From:   "Buterbaugh, Kevin L" <Kevin.Buterbaugh at Vanderbilt.Edu>
To:     gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date:   09/29/2016 11:03 AM
Subject:        [gpfsug-discuss] Fwd:  Blocksize
Sent by:        gpfsug-discuss-bounces at spectrumscale.org



Resending from the right e-mail address...

Begin forwarded message:

From: gpfsug-discuss-owner at spectrumscale.org
Subject: Re: [gpfsug-discuss] Blocksize
Date: September 29, 2016 at 10:00:36 AM CDT
To: klb at accre.vanderbilt.edu

You are not allowed to post to this mailing list, and your message has
been automatically rejected.  If you think that your messages are
being rejected in error, contact the mailing list owner at
gpfsug-discuss-owner at spectrumscale.org.


From: "Kevin L. Buterbaugh" <klb at accre.vanderbilt.edu>
Subject: Re: [gpfsug-discuss] Blocksize
Date: September 29, 2016 at 10:00:29 AM CDT
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>


Hi Marc and others, 

I understand … I guess I did a poor job of wording my question, so I’ll 
try again.  The IBM recommendation for metadata block size seems to be 
somewhere between 256K - 1 MB, depending on who responds to the question. 
If I were to hypothetically use a 256K metadata block size, does the 
“1/32nd of a block” come into play like it does for “not metadata”?  I.e. 
256 / 32 = 8K, so am I reading / writing *2* inodes (assuming 4K inode 
size) minimum?

And here’s a really off the wall question … yesterday we were discussing 
the fact that there is now a single inode file.  Historically, we have 
always used RAID 1 mirrors (first with spinning disk, as of last fall now 
on SSD) for metadata and then use GPFS replication on top of that.  But 
given that there is a single inode file is that “old way” of doing things 
still the right way?  In other words, could we potentially be better off 
by using a couple of 8+2P RAID 6 LUNs?

One potential downside of that would be that we would then only have two 
NSD servers serving up metadata, so we discussed the idea of taking each 
RAID 6 LUN and splitting it up into multiple logical volumes (all that 
done on the storage array, of course) and then presenting those to GPFS as 
NSDs???

Or have I gone from merely asking stupid questions to Trump-level 
craziness????  ;-)

Kevin

On Sep 28, 2016, at 10:23 AM, Marc A Kaplan <makaplan at us.ibm.com> wrote:

OKAY, I'll say it again.  inodes are PACKED into a single inode file.  So 
a 4KB inode takes 4KB, REGARDLESS of metadata blocksize.  There is no 
wasted space.

(Of course if you have metadata replication = 2, then yes, double that. 
And yes, there overhead for indirect blocks (indices), allocation maps, 
etc, etc.)

And your choice is not just 512 or 4096.  Maybe 1KB or 2KB is a good 
choice for your data distribution, to optimize packing of data and/or 
directories into inodes...

Hmmm... I don't know why the doc leaves out 2048, perhaps a typo...

mmcrfs x2K -i 2048

[root at n2 charts]# mmlsfs x2K -i
flag                value                    description
------------------- ------------------------ 
-----------------------------------
 -i                 2048                     Inode size in bytes

Works for me!
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160929/c4b04709/attachment.htm>


More information about the gpfsug-discuss mailing list