[gpfsug-discuss] ILM and Backup Question

Simon Thompson (Research Computing - IT Services) S.J.Thompson at bham.ac.uk
Mon Oct 26 20:15:26 GMT 2015


Hi Kristy,

Yes thanks for picking this up.

So we (UoB) have 3 GPFS environments, each with different approaches.

1. OpenStack (GPFS as infrastructure) - we don't back this up at all.
Partly this is because we are still in pilot phase, and partly because we
also have ~7PB CEPH over 4 sites for this project, and the longer term aim
is for us to ensure data sets and important VM images are copied into the
CEPH store (and then replicated to at least 1 other site).

We have some challenges with this, how should we do this? We're sorta
thinging about maybe going down the irods route for this, policy scan the
FS maybe, add xattr onto important data, and use that to get irods to send
copies into CEPH (somehow). So this would be a bit of a hybrid home-grown
solution going on here. Anyone got suggestions about how to approach this?
I know IBM are now an irods consortium member, so any magic coming from
IBM to integrate GFPS and irods?


2. HPC. We differentiate on our HPC file-system between backed up and non
backed up space. Mostly its non backed up, where we encourage users to
keep scratch data sets. We provide a small(ish) home directory which is
backed up with TSM to tape, and also backup applications and system
configs of the system. We use a bunch of jobs to sync some configs into
local git which also is stored in the backed up part of the FS, so things
like switch configs, icinga config can be backed up sanely.


3. Research Data Storage. This is a large bulk data storage solution. So
far its not actually that large (few hundred TB), so we take the
traditional TSM back to tape approach (its also sync replicated between
data centres). We're already starting to see some possible slowness on
this with data ingest and we've only just launched the service. Maybe that
is a cause of launching that we suddenly see high data ingest. We are also
experimenting with HSM to tape, but other than that we have no other ILM
policies - only two tiers of disk, SAS for metadata and NL-SAS for bulk
data. I'd like to see a flash tier in there for Metadata, which would free
SAS drives and so we might be more into ILM policies. We have some more
testing with snapshots to do, and have some questions about recovery of
HSM files if the FS is snapshotted. Anyone any experience with this with
4.1 upwards versions of GPFS? Straight TSM backup for us means we can end
up with 6 copies of data - once per data centre, backup + offsite backup
tape set, HSM pool + offsite copy of HSM pool. (If an HSM tape fails, how
do we know what to restore from backup? Hence we make copies of the HSM
tapes as well).


As our backups run on TSM, it uses the policy engine and mmbackup, so we
only backup changes and new files, and never backup twice from the FS.

Does anyone know how TSM backups handle XATTRs? This is one of the
questions that was raised at meet the devs. Or even other attributes like
immutability, as unless you are in complaint mode, its possible for
immutable files to be deleted in some cases. In fact this is an
interesting topic, it just occurred to me, what happens if your HSM tape
fails and it contained immutable files. Would it be possible to recover
these files if you don't have a copy of the HSM tape? - can you do a
synthetic recreate of the TSM HSM tape from backups?


We typically tell users that backups are for DR purposes, but that we'll
make efforts to try and restore files subject to resource availability.

Is anyone using SOBAR? What is your rationale for this? I can see that at
scale, there are lot of benefits to this. But how do you handle users
corrupting/deleting files etc? My understanding of SOBAR is that it
doesn't give you the same ability to recover versions of files, deletions
etc that straight TSM backup does. (this is something I've been meaning to
raise for a while here).


So what do others do? Do you have similar approaches to not backing up
some types of data/areas? Do you use TSM or home-grown solutions? Or even
other commercial backup solutions? What are your rationales for making
decisions on backup approaches? Has anyone built their own DMAPI type
interface for doing these sorts of things? Snapshots only? Do you allow
users to restore themselves? If you are using ILM, are you doing it with
straight policy, or is TSM playing part of the game?

(If people want to comment anonymously on this without committing their
company on list, happy to take email to the chair@ address and forward on
anonymously to the group).

Simon

On 26/10/2015, 02:38, "gpfsug-discuss-bounces at spectrumscale.org on behalf
of Kallback-Rose, Kristy A" <gpfsug-discuss-bounces at spectrumscale.org on
behalf of kallbac at iu.edu> wrote:

>Simon wrote recently in the GPFS UG Blog: "We also got into discussion on
>backup and ILM, and I think its amazing how everyone does these things in
>their own slightly different way. I think this might be an interesting
>area for discussion over on the group mailing list. There's a lot of
>options and different ways to do things!²
>
>Yes, please! I¹m *very* interested in what others are doing.
>
>We (IU) are currently doing a POC with GHI for DR backups (GHI=GPFS HPSS
>Integration‹we have had HPSS for a very long time), but I¹m interested
>what others are doing with either ILM or other methods to brew their own
>backup solutions, how much they are backing up and with what regularity,
>what resources it takes, etc.
>
>If you have anything going on at your site that¹s relevant, can you
>please share?
>
>Thanks,
>Kristy
>
>Kristy Kallback-Rose
>Manager, Research Storage
>Indiana University




More information about the gpfsug-discuss mailing list