[gpfsug-discuss] GPFS(snapshot, backup) vs. GPFS(backup scripts) vs. TSM(backup)

Wed Mar 16 20:03:09 GMT 2016

> Under both 3.2 and 3.3 mmbackup would always lock up our cluster when
> using snapshot. I never understood the behavior without snapshot, and
> the lock up was intermittent in the carved-out small test cluster, so
> I never felt confident enough to deploy over the larger 4000+ clients
> cluster.

Back then, GPFS code had a deficiency: migrating very large files didn't
work well with snapshots (and some operation mm commands).  In order to
create a snapshot, we have to have the file system in a consistent state
for a moment, and we get there by performing a "quiesce" operation.  This
is done by flushing all dirty buffers to disk, stopping any new incoming
file system operations at the gates, and waiting for all in-flight
operations to finish.  This works well when all in-flight operations
actually finish reasonably quickly.  That assumption was broken if an
external utility, e.g. mmapplypolicy, used gpfs_restripe_file API on a very
large file, e.g. to migrate the file's blocks to a different storage pool.
The quiesce operation would need to wait for that API call to finish, as
it's an in-flight operation, but migrating a multi-TB file could take a
while, and during this time all new file system ops would be blocked.  This
was solved several years ago by changing the API and its callers to do the
migration one block range at a time, thus making each individual syscall
short and allowing quiesce to barge in and do its thing.  All currently
supported levels of GPFS have this fix.  I believe mmbackup was affected by
the same GPFS deficiency and benefited from the same fix.

yuri
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160316/d447d4cd/attachment.htm>