[gpfsug-discuss] Problem Determination

Zachary Giles zgiles at gmail.com
Fri Oct 2 21:27:17 BST 2015


I would like to see better performance metrics / counters from GPFS.
I know we already have mmpmon, which is generally really good -- I've
done some fun things with it and it has been a great tool. And, I
realize that there is supposedly a new monitoring framework in 4.x..
which I haven't played with yet.

But,
Generally it would be extremely helpful to get synchronized (across
all nodes) high accuracy counters of data flow, number of waiters,
page pool stats, distribution of data from one layer to another down
to NSDs.. etc etc etc.  I believe many of these counters already
exist, but they're hidden in some mmfsadm xx command that one needs to
troll through with possible performance implications. mmpmon can do
some of this, but it's only a handful of counters, it's hard to say
how synchronized the counters are across nodes, and I've personally
seen an mmpmon run go bad and take down a cluster.  It would be nice
if it were pushed out, or provided in a safe manner with the design
and expectation of "log-everything forever continuously".

As GSS/ESS systems start popping up, I realize they have this other
monitoring framework to watch the VD throughputs.. which is great.
But, that doesn't allow us to monitor more traditional types.
Would be nice to monitor it all together the same way so we don't
miss-out on monitoring half the infrastructure or buying a cluster
with some fancy GUI that can't do what we want..

-Zach


On Fri, Oct 2, 2015 at 2:00 PM, Wahl, Edward <ewahl at osc.edu> wrote:
> I'm not yet in the 4.x release stream so this may be taken with a grain (or
> more) of salt as we say.
>
> PLEASE keep the ability of commands to set -x or dump debug when the env
> DEBUG=1 is set.  This has been extremely useful over the years.   Granted
> I've never worked out why sometimes we see odd little  things like machines
> deciding they suddenly need an FPO license or one nsd server suddenly
> decides it's name is part of the FQDN instead of just it's hostname and only
> for certain commands, but it's DAMN useful.  Minor issues especially can be
> tracked down with it.
>
> Undocumented features and logged items abound.  I'd say start there.  This
> is one area where it is definitely more art than science with Spectrum Scale
> (meh GPFS still sounds better. So does Shark. Can we go back to calling it
> the Shark Server Project?)
>
>   Complete failure of the verbs layer and fallback to other defined networks
> would be nice to know about during operation. It's excellent about telling
> you at startup but not so much during operation, at least in 3.5.
>
>  I imagine with the 'automated compatibility layer building' I'll be looking
> for some serious amounts of PD for the issues we _will_ see there.  We
> frequently build against kernels we are not yet running at this site, so
> this needs well documented PD and resolution.
>
> Ed Wahl
> OSC
>
>
> ________________________________
> From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org]
> on behalf of Patrick Byrne [PATBYRNE at uk.ibm.com]
> Sent: Thursday, October 01, 2015 6:09 AM
> To: gpfsug-discuss at gpfsug.org
> Subject: [gpfsug-discuss] Problem Determination
>
> Hi all,
>
> As I'm sure some of you aware, problem determination is an area that we are
> looking to try and make significant improvements to over the coming releases
> of Spectrum Scale. To help us target the areas we work to improve and make
> it as useful as possible I am trying to get as much feedback as I can about
> different problems users have, and how people go about solving them.
>
> I am interested in hearing everything from day to day annoyances to problems
> that have caused major frustration in trying to track down the root cause.
> Where possible it would be great to hear how the problems were dealt with as
> well, so that others can benefit from your experience. Feel free to reply to
> the mailing list - maybe others have seen similar problems and could provide
> tips for the future - or to me directly if you'd prefer
> (patbyrne at uk.ibm.com).
>
> On a related note, in 4.1.1 there was a component added that monitors the
> state of the various protocols that are now supported (NFS, SMB, Object).
> The output from this is available with the 'mmces state' and 'mmces events'
> CLIs and I would like to get feedback from anyone who has had the chance
> make use of this. Is it useful? How could it be improved? We are looking at
> the possibility of extending this component to cover more than just
> protocols, so any feedback would be greatly appreciated.
>
> Thanks in advance,
>
> Patrick Byrne
> IBM Spectrum Scale - Development Engineer
> IBM Systems - Manchester Lab
> IBM UK Limited
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>



-- 
Zach Giles
zgiles at gmail.com



More information about the gpfsug-discuss mailing list