[gpfsug-discuss] Migration to separate metadata and data disks

Yuri L Volobuev volobuev at us.ibm.com
Wed Sep 7 17:38:03 BST 2016


Hi Miroslav,

The mmdf output is very helpful.  It suggests very strongly what the
problem is:

> ssd_5_5                   80G        3 Yes      No           22.31G
( 28%)        7.155G ( 9%)
> ssd_4_4                   80G        3 Yes      No           22.21G
( 28%)        7.196G ( 9%)
> ssd_3_3                   80G        3 Yes      No            22.2G
( 28%)        7.239G ( 9%)
> ssd_2_2                   80G        3 Yes      No           22.24G
( 28%)        7.146G ( 9%)
> ssd_1_1                   80G        3 Yes      No           22.29G
( 28%)        7.134G ( 9%)
>...
> ==================== ===================
> (data)                 1.535P                                698.9T
( 44%)         4.17T ( 0%)
> (metadata)             262.3T                                92.96T
( 35%)        3.621T ( 1%)
>...
> Number of allocated inodes:  161947648
> Maximum number of inodes:   1342177280

You have 5 80G SSDs.  That's not enough.  Even with metadata spread across
a couple dozen more SATA disks, SSDs are over 3/4 full.  There's no way to
accurately estimate the amount of metadata in this file system with the
data at hand, but if we (very conservatively) assume that each SATA disk
has only as much metadata as each SSD, i.e. ~57G, that would greatly exceed
the amount of free space available on your SSDs.  You need more free
metadata space.  Another way to look at this: you got 1.5PB of data under
management.  A reasonable rule-of-thumb estimate for the amount of metadata
is 1-2% of the data (this is a typical ratio, but of course every file
system is different, and large deviations are possible.  A degenerate case
is an fs containing nothing but directories, and in this case metadata
usage is 100%).  So you have to have at least a few TB of metadata storage.
5 80G SSDs aren't enough for an fs of this size.

> Could you please evaluate more on the performance overhead with
> having metadata
> on SSD+SATA? Are the read operations automatically directed to
> faster disks by GPFS?
> Is each write operation waiting for write to be finished by SATA disks?

Mixing disks with sharply different performance characteristics within a
single storage pool is detrimental to performance.  GPFS stripes blocks
across all disks in a storage pool, expecting all of them to be equally
suitable.  If SSDs are mixed with SATA disks, the overall metadata write
performance is going to be bottlenecked by SATA drives.  On reads, given a
choice of two replicas, GPFS V4.1.1+ picks the the replica residing on the
fastest disk, but given that SSDs represent only a small fraction of your
total metadata usage, this likely doesn't help a whole lot.  You're on the
right track in trying to shift all metadata to SSDs and away from SATA, the
overall file system performance will improve as the result.

yuri

>
> Thank you,
> --
> Miroslav Bauer
> On 09/06/2016 09:06 PM, Yuri L Volobuev wrote:
> The correct way to accomplish what you're looking for (in
> particular, changing the fs-wide level of replication) is
> mmrestripefs -R. This command also takes care of moving data off
> disks now marked metadataOnly.
>
> The restripe job hits an error trying to move blocks of the inode
> file, i.e. before it gets to actual user data blocks. Note that at
> this point the metadata replication factor is still 2. This suggests
> one of two possibilities: (1) there isn't enough actual free space
> on the remaining metadataOnly disks, (2) there isn't enough space in
> some failure groups to allocate two replicas.
>
> All of this assumes you're operating within a single storage pool.
> If multiple storage pools are in play, there are other possibilities.
>
> 'mmdf' output would be helpful in providing more helpful advice.
> With the information at hand, I can only suggest trying to
> accomplish the task in two phases: (a) deallocated extra metadata
> replicas, by doing mmchfs -m 1 + mmrestripefs -R (b) move metadata
> off SATA disks. I do want to point out that metadata replication is
> a highly recommended insurance policy to have for your file system.
> As with other kinds of insurance, you may or may not need it, but if
> you do end up needing it, you'll be very glad you have it. The
> costs, in terms of extra metadata space and performance overhead,
> are very reasonable.
>
> yuri
>
>
> Miroslav Bauer ---09/01/2016 07:29:06 AM---Yes, failure group id is
> exactly what I meant :). Unfortunately, mmrestripefs with -R
>
> From: Miroslav Bauer <bauer at cesnet.cz>
> To: gpfsug-discuss at spectrumscale.org,
> Date: 09/01/2016 07:29 AM
> Subject: Re: [gpfsug-discuss] Migration to separate metadata and data
disks
> Sent by: gpfsug-discuss-bounces at spectrumscale.org
>
>
>
> Yes, failure group id is exactly what I meant :). Unfortunately,
> mmrestripefs with -R
> behaves the same as with -r. I also believed that mmrestripefs -R is the
> correct tool for
> fixing the replication settings on inodes (according to manpages), but I
> will try possible
> solutions you and Marc suggested and let you know how it went.
>
> Thank you,
> --
> Miroslav Bauer
>
> On 09/01/2016 04:02 PM, Aaron Knister wrote:
> > Oh! I think you've already provided the info I was looking for :) I
> > thought that failGroup=3 meant there were 3 failure groups within the
> > SSDs. I suspect that's not at all what you meant and that actually is
> > the failure group of all of those disks. That I think explains what's
> > going on-- there's only one failure group's worth of metadata-capable
> > disks available and as such GPFS can't place the 2nd replica for
> > existing files.
> >
> > Here's what I would suggest:
> >
> > - Create at least 2 failure groups within the SSDs
> > - Put the default metadata replication factor back to 2
> > - Run a restripefs -R to shuffle files around and restore the metadata
> > replication factor of 2 to any files created while it was set to 1
> >
> > If you're not interested in replication for metadata then perhaps all
> > you need to do is the mmrestripefs -R. I think that should
> > un-replicate the file from the SATA disks leaving the copy on the SSDs.
> >
> > Hope that helps.
> >
> > -Aaron
> >
> > On 9/1/16 9:39 AM, Aaron Knister wrote:
> >> By the way, I suspect the no space on device errors are because GPFS
> >> believes for some reason that it is unable to maintain the metadata
> >> replication factor of 2 that's likely set on all previously created
> >> inodes.
> >>
> >> On 9/1/16 9:36 AM, Aaron Knister wrote:
> >>> I must admit, I'm curious as to the reason you're dropping the
> >>> replication factor from 2 down to 1. There are some serious
advantages
> >>> we've seen to having multiple metadata replicas, as far as error
> >>> recovery is concerned.
> >>>
> >>> Could you paste an output of mmlsdisk for the filesystem?
> >>>
> >>> -Aaron
> >>>
> >>> On 9/1/16 9:30 AM, Miroslav Bauer wrote:
> >>>> Hello,
> >>>>
> >>>> I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the
> >>>> filesystem metadata from state:
> >>>> -m = 2 (default metadata replicas)
> >>>> - SATA disks (dataAndMetadata, failGroup=1)
> >>>> - SSDs (metadataOnly, failGroup=3)
> >>>> to the desired state:
> >>>> -m = 1
> >>>> - SATA disks (dataOnly, failGroup=1)
> >>>> - SSDs (metadataOnly, failGroup=3)
> >>>>
> >>>> I have done the following steps in the following order:
> >>>> 1) change SATA disks to dataOnly (stanza file modifies the 'usage'
> >>>> attribute only):
> >>>> # mmchdisk fs1 change -F dataOnly_disks.stanza
> >>>> Attention: Disk parameters were changed.
> >>>>   Use the mmrestripefs command with the -r option to relocate data
and
> >>>> metadata.
> >>>> Verifying file system configuration information ...
> >>>> mmchdisk: Propagating the cluster configuration data to all
> >>>>   affected nodes.  This is an asynchronous process.
> >>>>
> >>>> 2) change default metadata replicas number 2->1
> >>>> # mmchfs fs1 -m 1
> >>>>
> >>>> 3) run mmrestripefs as suggested by output of 1)
> >>>> # mmrestripefs fs1 -r
> >>>> Scanning file system metadata, phase 1 ...
> >>>> Error processing inodes.
> >>>> No space left on device
> >>>> mmrestripefs: Command failed.  Examine previous error messages to
> >>>> determine cause.
> >>>>
> >>>> It is, however, still possible to create new files on the
filesystem.
> >>>> When I return one of the SATA disks as a dataAndMetadata disk, the
> >>>> mmrestripefs
> >>>> command stops complaining about No space left on device. Both df and
> >>>> mmdf
> >>>> say that there is enough space both for data (SATA) and metadata
> >>>> (SSDs).
> >>>> Does anyone have an idea why is it complaining?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> --
> >>>> Miroslav Bauer
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> gpfsug-discuss mailing list
> >>>> gpfsug-discuss at spectrumscale.org
> >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> >>>>
> >>>
> >>
> >
>
>
> [attachment "smime.p7s" deleted by Yuri L Volobuev/Austin/IBM]
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>

> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss

> [attachment "smime.p7s" deleted by Yuri L Volobuev/Austin/IBM]
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160907/b2afd799/attachment.htm>


More information about the gpfsug-discuss mailing list