[gpfsug-discuss] Potential problems - leaving trace enabled in over-write mode?

Wed Mar 8 06:53:17 GMT 2017

We're in the same boat...gpfs snap hangs when the cluster / node is unresponsive but they don't know how to give us a root cause without one. Very frustrating.

-----Original Message-----
From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of valdis.kletnieks at vt.edu
Sent: 07 March 2017 21:37
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Subject: Re: [gpfsug-discuss] Potential problems - leaving trace enabled in over-write mode?

On Tue, 07 Mar 2017 21:17:35 +0000, Bryan Banister said:

> Just depends on how your problem is detected??? is it in a log?  Is it 
> found by running a command (.e.g mm*)?  Is it discovered in `ps` 
> output?  Is your scheduler failing jobs?

I think the problem here is that if you have a sudden cataclysmic event, you want to have been in flight-recorder mode and be able to look at the last 5 or
10 seconds of trace *before* you became aware that your filesystem just went walkies.  Sure, you can start tracing when the filesystem dies - but at that point you just get a whole mess of failed I/O requests in the trace, and no hint of where things went south...