[gpfsug-discuss] SOBAR

Fri Feb 8 09:40:27 GMT 2013

On Thu, 2013-02-07 at 13:56 +0000, Orlando Richards wrote: 
> On 07/02/13 13:51, Orlando Richards wrote:
> > On 07/02/13 13:40, Jonathan Buzzard wrote:
> >> On Thu, 2013-02-07 at 12:47 +0000, Orlando Richards wrote:
> >>
> >> [SNIP]
> >>
> >>> Nice - good to see this kind of thing coming from IBM - restore of huge
> >>> filesystems from traditional backup really doesn't make much sense
> >>> nowadays - it'd just take too long.
> >>
> >> Define too long?
> 
> I can tell you speak from (bitter?) experience :)

Done two large GPFS restores. The first was to migrate a HSM file system
to completely new hardware, new TSM version and new GPFS version. IBM
would not warrant an upgrade procedure so we "restored" from tape onto
the new hardware and then did rsync's to get it "identical". Big problem
was the TSM server hardware at the time (a p630) just gave up the ghost
about 5TB into the restore repeatedly. Had do it a user at a time which
made it take *much* longer as I was repeatedly going over the same
tapes.

The second was from bitter experience. Someone else in a moment of
complete and utter stupidity wiped some ~30 NSD's of their descriptors.
Two file systems an instant and complete loss. Well not strictly true it
was several days before it manifested itself when one of the NSD servers
was rebooted. A day was then wasted working out what the hell had
happened to the file system that could have gone to the restore.

Took about three weeks to get back completely. Could have been done a
lot lot faster if I had had more tape drives on day one and/or made a
better job of getting more in, had not messed about prioritizing
restores of particular individuals, and not had capacity issues on the
TSM server to boot (it was scheduled for upgrade anyway and a CPU failed
mid restore).

I think TSM 6.x would have been faster as well as it has faster DB
performance, and the restore consisted of some 50 million files in about
30TB and it was the number of files that was the killer for speed. It
would be nice in a disaster scenario if TSM would also use the tapes in
the copy pools for restore, especially when they are in a different
library. Not sure if the automatic failover procedure in 6.3 does that.

For large file systems I would seriously consider using virtual mount
points in TSM and then collocating the file systems. I would also look
to match my virtual mount points to file sets.

The basic problem is that most people don't have the spare hardware to
even try disaster recovery, and even then you are not going to be doing
it under the same pressure, hindsight is always 20/20.

> Oh - for us, this is rapidly approaching "anything more than a day, and 
> can you do it faster than that please". Not much appetite for the costs 
> of full replication though.
> 

Remember you can have any two of cheap, fast and reliable. If you want
it back in a day or less then that almost certainly requires a full
mirror and is going to be expensive.

Noting of course if it ain't offline it ain't backed up. See above if
some numpty can wipe the NSD descriptors on your file systems then can
do it to your replicated file system at the same time.

JAB.

-- 
Jonathan A. Buzzard                 Email: jonathan (at) buzzard.me.uk
Fife, United Kingdom.