[gpfsug-discuss] Disabling individual Storage Pools by themselves?

Zachary Giles zgiles at gmail.com
Thu Jun 18 17:02:54 BST 2015


Sorry to hear about the problems you've had recently. It's frustrating
when that happens.

I didn't have the exactly the same situation, but we had something
similar which may bring some light to a missing disks situation:
We had a dataOnly storage pool backed by a few building blocks each
consisting of several RAID controllers that were each direct attached
to a few servers. We had several of these sets all in one pool. Thus,
if a server failed it was fine, if a single link failed, it was fine.
Potentially we could do copies=2 and have multiple failure groups in a
single pool. If anything in the RAID arrays themselves failed, it was
OK, but a single whole RAID controller going down would take that
section of disks down. The number of copies was set to 1 on this pool.

One RAID controller went down, but the file system as a whole stayed
online. Our user experience was that Some users got IO errors of a
"file inaccessible" type (I don't remember the exact code). Other
users, and especially those mostly in other tiers continued to work as
normal. As we had mostly small files across this tier ( much smaller
than the GPFS block size ), most of the files were in one of the RAID
controllers or another, thus not striping really, so even the files in
other controllers on the same tier were also fine and accessible.
Bottom line is: Only the files that were missing gave errors, the
others were fine. Additionally, for missing files errors were reported
which apps could capture and do something about, wait, or retry later
-- not a D state process waiting forever or stale file handles.

I'm not saying this is the best way. We didn't intend for this to
happen. I suspect that stopping the disk would result in a similar
experience but more safely.
We asked GPFS devs if we needed to fsck after this since the tier just
went offline directly and we continued to use the rest of the system
while it was gone.. they said no it should be fine and missing blocks
will be taken care of. I assume this is true, but I have no explicit
proof, except that it's still working and nothing seemed to be
missing.

I guess some questions for the dev's would be:
* Is this safe / advisable to do the above either directly or via a
stop and then down the array?
* Given that there is some client-side write caching in GPFS, if a
file is being written and an expected final destination goes offline
mid-write, where does the block go?
  + If a whole pool goes offline, will it pick another pool or error?
  + If it's a disk in a pool, will it reevaluate and round-robin to
the next disk, or just fail since it had already decided where to
write?

Hope this helps a little.



On Thu, Jun 18, 2015 at 10:08 AM, Wahl, Edward <ewahl at osc.edu> wrote:
>  We had a unique situation with one of our many storage arrays occur in the past couple of days and it brought up a question I've had before.   Is there a better way to disable a Storage Pool by itself rather than 'mmchdisk stop' the entire list of disks from that pool or mmfsctl and exclude things, etc?  Thoughts?
>
>  In our case our array lost all raid protection in a certain pool (8+2) due to a hardware failure, and started showing drive checkcondition errors on other drives in the array. Yikes!  This pool itself is only about 720T and is backed by tape, but who wants to restore that?  Even with SOBAR/HSM that would be a loooong week. ^_^  We made the decision to take the entire file system offline during the repair/rebuild, but I would like to have run all other pools save this one in a simpler manner than we took to get there.
>
>  I'm interested in people's experiences here for future planning and disaster recovery.  GPFS itself worked exactly as we had planned and expected but I think there is room to improve the file system if I could "turn down" an entire Storage Pool that did not have metadata for other pools on it, in a simpler manner.
>
> I may not be expressing myself here in the best manner possible.  Bit of sleep deprivation after the last couple of days. ;)
>
> Ed Wahl
> OSC
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss



-- 
Zach Giles
zgiles at gmail.com



More information about the gpfsug-discuss mailing list