[gpfsug-discuss] gpfs waiters debugging

Tue Jun 6 19:15:22 BST 2017

All,

mmlsnode -N waiters is great … I also appreciate the “-s” option to it.  Very helpful when you know the problem started say, slightly more than half an hour ago and you therefore don’t care about sub-1800 second waiters…

Kevin

On Jun 6, 2017, at 11:54 AM, Frederick Stock <stockf at us.ibm.com<mailto:stockf at us.ibm.com>> wrote:

On recent releases you can accomplish the same with the command, "mmlsnode -N waiters -L".

Fred
__________________________________________________
Fred Stock | IBM Pittsburgh Lab | 720-430-8821
stockf at us.ibm.com<mailto:stockf at us.ibm.com>

From:        valdis.kletnieks at vt.edu<mailto:valdis.kletnieks at vt.edu>
To:        gpfsug main discussion list <gpfsug-discuss at spectrumscale.org<mailto:gpfsug-discuss at spectrumscale.org>>
Date:        06/06/2017 12:46 PM
Subject:        Re: [gpfsug-discuss] gpfs waiters debugging
Sent by:        gpfsug-discuss-bounces at spectrumscale.org<mailto:gpfsug-discuss-bounces at spectrumscale.org>
________________________________

On Tue, 06 Jun 2017 15:06:57 +0200, Stijn De Weirdt said:
> oh sure, i meant waiters that last > 300 seconds or so (something that
> could trigger deadlock). obviously we're not interested in debugging the
> short ones, it's not that gpfs doesn't work or anything ;)

At least at one time, a lot of the mm(whatever) administrative commands
would leave one dangling waiter for the duration of the command - which
could be a while if the command was mmdeldisk or mmrestripefs. I admit
not having specifically checked for gpfs 4.2, but it was true for 3.2 through
4.1....

And my addition to the collective debugging knowledge:  A bash one-liner to
dump all the waiters across a cluster, sorted by wait time.  Note that
our clusters tend to be 5-8 servers, this may be painful for those of you
who have 400+ node clusters. :)

##!/bin/bash
for i in ` mmlsnode | tail -1 | sed 's/^[ ]*[^ ]*[ ]*//'`; do  ssh $i /usr/lpp/mmfs/bin/mmfsadm dump waiters | sed "s/^/$i /"; done | sort -n -r -k 3 -t' '

We've found it useful - if you have 1 waiter on one node that's 1278 seconds
old, and 3 other nodes have waiters that are 1275 seconds old, it's a good
chance the other 3 nodes waiters are waiting on the first node's waiter to
resolve itself....
[attachment "attltepl.dat" deleted by Frederick Stock/Pittsburgh/IBM] _______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

—
Kevin Buterbaugh - Senior System Administrator
Vanderbilt University - Advanced Computing Center for Research and Education
Kevin.Buterbaugh at vanderbilt.edu<mailto:Kevin.Buterbaugh at vanderbilt.edu> - (615)875-9633

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170606/555b6a2a/attachment.htm>