[gpfsug-discuss] RAID type for system pool

Thu Sep 6 18:06:17 BST 2018

Answers inline based on my recollection of experiences we've had here:

On 9/6/18 12:19 PM, Bryan Banister wrote:
> I have questions about how the GPFS metadata replication of 3 works.
> 
>  1. Is it basically the same as replication of 2 but just have one more
>     copy, making recovery much more likely?

That's my understanding.

>  2. If there is nothing that is checking that the data was correctly
>     read off of the device (e.g. CRC checking ON READS like the DDNs do,
>     T10PI or Data Integrity Field) then how does GPFS handle a corrupted
>     read of the data?
>     - unlikely with SSD but head could be off on a NLSAS read, no
>     errors, but you get some garbage instead, plus no auto retries

The inode itself is checksummed:

# /usr/lpp/mmfs/bin/tsdbfs mysuperawesomespacefs
Enter command or null to read next sector.  Type ? for help.
inode 20087366
Inode 20087366 [20087366] snap 0 (index 582 in block 9808):
   Inode address: 30:263275078 32:263264838 size 512 nAddrs 32
   indirectionLevel=3 status=USERFILE
   objectVersion=49352 generation=0x2B519B3 nlink=1
   owner uid=8675309 gid=999 mode=0200100600: -rw-------
   blocksize code=5 (32 subblocks)
   lastBlockSubblocks=1
   checksum=0xF2EF3427 is Valid
...
   Disk pointers [32]:
     0:  31:217629376    1:  30:217632960    2: (null)         ...
    31: (null)

as are indirect blocks (I'm sure that's not an exhaustive list of 
checksummed metadata structures):

ind 31:217629376
Indirect block starting in sector 31:217629376:
   magic=0x112DF307 generation=0x2B519B3 blockNum=0 inodeNum=20087366
   indirection level=2
   checksum=0x6BDAA92A
   CalcChecksum(0x5B6DC9FC000, 32768, 20)=0x6BDAA92A
   Data pointers:

>  3. Does GPFS read at least two of the three replicas and compares them
>     to ensure the data is correct?
>     - expensive operation, so very unlikely

I don't know, but I do know it verifies the checksum and I believe if 
that's wrong it will try another replica.

>  4. If not reading multiple replicas for comparison, are reads round
>     robin across all three copies?

I feel like we see pretty even distribution of reads across all replicas 
of our metadata LUNs, although this is looking overall at the array 
level so it may be a red herring.

>  5. If one replica is corrupted (bad blocks) what does GPFS do to
>     recover this metadata copy?  Is this automatic or does this require
>     a manual `mmrestripefs -c` operation or something?
>     - If not, seems like a pretty simple idea and maybe an RFE worthy
>     submission

My experience has been it will attempt to correct it (and maybe log an 
fsstruct error?). This was in the 3.5 days, though.

>  6. Would the idea of an option to run “background scrub/verifies” of
>     the data/metadata be worthwhile to ensure no hidden bad blocks?
>     - Using QoS this should be relatively painless

If you don't have array-level background scrubbing, this is what I'd 
suggest. (e.g. mmrestripefs -c --metadata-only).

>  7. With a drive failure do you have to delete the NSD from the file
>     system and cluster, recreate the NSD, add it back to the FS, then
>     again run the `mmrestripefs -c` operation to restore the replication?
>     - As Kevin mentions this will end up being a FULL file system scan
>     vs. a block-based scan and replication.  That could take a long time
>     depending on number of inodes and type of storage!
> 
> Thanks for any insight,
> 
> -Bryan
> 
> *From:* gpfsug-discuss-bounces at spectrumscale.org 
> <gpfsug-discuss-bounces at spectrumscale.org> *On Behalf Of *Buterbaugh, 
> Kevin L
> *Sent:* Thursday, September 6, 2018 9:59 AM
> *To:* gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> *Subject:* Re: [gpfsug-discuss] RAID type for system pool
> 
> /Note: External Email/
> 
> ------------------------------------------------------------------------
> 
> Hi All,
> 
> Wow - my query got more responses than I expected and my sincere thanks 
> to all who took the time to respond!
> 
> At this point in time we do have two GPFS filesystems … one which is 
> basically “/home” and some software installations and the other which is 
> “/scratch” and “/data” (former backed up, latter not).  Both of them 
> have their metadata on SSDs set up as RAID 1 mirrors and replication set 
> to two.  But at this point in time all of the SSDs are in a single 
> storage array (albeit with dual redundant controllers) … so the storage 
> array itself is my only SPOF.
> 
> As part of the hardware purchase we are in the process of making we will 
> be buying a 2nd storage array that can house 2.5” SSDs.  Therefore, we 
> will be splitting our SSDs between chassis and eliminating that last 
> SPOF.  Of course, this includes the new SSDs we are getting for our new 
> /home filesystem.
> 
> Our plan right now is to buy 10 SSDs, which will allow us to test 3 
> configurations:
> 
> 1) two 4+1P RAID 5 LUNs split up into a total of 8 LV’s (with each of my 
> 8 NSD servers as primary for one of those LV’s and the other 7 as 
> backups) and GPFS metadata replication set to 2.
> 
> 2) four RAID 1 mirrors (which obviously leaves 2 SSDs unused) and GPFS 
> metadata replication set to 2.  This would mean that only 4 of my 8 NSD 
> servers would be a primary.
> 
> 3) nine RAID 0 / bare drives with GPFS metadata replication set to 3 
> (which leaves 1 SSD unused).  All 8 NSD servers primary for one SSD and 
> 1 serving up two.
> 
> The responses I received concerning RAID 5 and performance were not a 
> surprise to me.  The main advantage that option gives is the most usable 
> storage space for the money (in fact, it gives us way more storage space 
> than we currently need) … but if it tanks performance, then that’s a 
> deal breaker.
> 
> Personally, I like the four RAID 1 mirrors config like we’ve been using 
> for years, but it has the disadvantage of giving us the least usable 
> storage space … that config would give us the minimum we need for right 
> now, but doesn’t really allow for much future growth.
> 
> I have no experience with metadata replication of 3 (but had actually 
> thought of that option, so feel good that others suggested it) so option 
> 3 will be a brand new experience for us.  It is the most optimal in 
> terms of meeting current needs plus allowing for future growth without 
> giving us way more space than we are likely to need).  I will be curious 
> to see how long it takes GPFS to re-replicate the data when we simulate 
> a drive failure as opposed to how long a RAID rebuild takes.
> 
> I am a big believer in Murphy’s Law (Sunday I paid off a bill, Wednesday 
> my refrigerator died!) … and also believe that the definition of a 
> pessimist is “someone with experience” <grin> … so we will definitely 
> not set GPFS metadata replication to less than two, nor will we use 
> non-Enterprise class SSDs for metadata … but I do still appreciate the 
> suggestions.
> 
> If there is interest, I will report back on our findings.  If anyone has 
> any additional thoughts or suggestions, I’d also appreciate hearing 
> them.  Again, thank you!
> 
> Kevin
> 
> —
> 
> Kevin Buterbaugh - Senior System Administrator
> 
> Vanderbilt University - Advanced Computing Center for Research and Education
> 
> Kevin.Buterbaugh at vanderbilt.edu 
> <mailto:Kevin.Buterbaugh at vanderbilt.edu> - (615)875-9633
> 
> 
> ------------------------------------------------------------------------
> 
> Note: This email is for the confidential use of the named addressee(s) 
> only and may contain proprietary, confidential, or privileged 
> information and/or personal data. If you are not the intended recipient, 
> you are hereby notified that any review, dissemination, or copying of 
> this email is strictly prohibited, and requested to notify the sender 
> immediately and destroy this email and any attachments. Email 
> transmission cannot be guaranteed to be secure or error-free. The 
> Company, therefore, does not make any guarantees as to the completeness 
> or accuracy of this email or any attachments. This email is for 
> informational purposes only and does not constitute a recommendation, 
> offer, request, or solicitation of any kind to buy, sell, subscribe, 
> redeem, or perform any type of transaction of a financial product. 
> Personal data, as defined by applicable data privacy laws, contained in 
> this email may be processed by the Company, and any of its affiliated or 
> related companies, for potential ongoing compliance and/or 
> business-related purposes. You may have rights regarding your personal 
> data; for information on exercising these rights or the Company’s 
> treatment of personal data, please email datarequests at jumptrading.com.
> 
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> 

-- 
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776