[gpfsug-discuss] High I/O wait times

Buterbaugh, Kevin L Kevin.Buterbaugh at Vanderbilt.Edu
Mon Jul 9 22:44:07 BST 2018


Hi All,

Time for a daily update on this saga…

First off, responses to those who have responded to me:

Yaron - we have QLogic switches, but I’ll RTFM and figure out how to clear the counters … with a quick look via the CLI interface to one of them I don’t see how to even look at those counters, must less clear them, but I’ll do some digging.  QLogic does have a GUI app, but given that the Mac version is PowerPC only I think that’s a dead end!  :-O

Jonathan - understood.  We were just wanting to eliminate as much hardware as potential culprits as we could.  The storage arrays will all get a power-cycle this Sunday when we take a downtime to do firmware upgrades on them … the vendor is basically refusing to assist further until we get on the latest firmware.

So … we had noticed that things seem to calm down starting Friday evening and continuing throughout the weekend.  We have a script that runs every half hour and if there’s any NSD servers where “mmdiag —iohist” shows an I/O > 1,000 ms, we get an alert (again, designed to alert us of a CBM failure).  We only got three all weekend long (as opposed to last week, when the alerts were coming every half hour round the clock).

Then, this morning I repeated the “dd” test that I had run before and after replacing the FC cables going to “eon34” and which had showed very typical I/O rates for all the NSDs except for the 4 in eon34, which were quite poor (~1.5 - 10 MB/sec).  I ran the new tests this morning from different NSD servers and with a higher “count” passed to dd to eliminate any potential caching effects.  I ran the test twice from two different NSD servers and this morning all NSDs - including those on eon34 - showed normal I/O rates!

Argh - so do we have a hardware problem or not?!?

I still think we do, but am taking *nothing* for granted at this point!   So today we also used another script we’ve written to do some investigation … basically we took the script which runs “mmdiag —iohist” and added some options to it so that for every I/O greater than the threshold it will see which client issued the I/O.  It then queries SLURM to see what jobs are running on that client.

Interestingly enough, one user showed up waaaayyyyyy more often than anybody else.  And many times she was on a node with only one other user who we know doesn’t access the GPFS filesystem and other times she was the only user on the node.

We certainly recognize that correlation is not causation (she could be a victim and not the culprit), but she was on so many of the reported clients that we decided to investigate further … but her jobs seem to have fairly modest I/O requirements.  Each one processes 4 input files, which are basically just gzip’d text files of 1.5 - 5 GB in size.  This is what, however, prompted my other query to the list about determining which NSDs a given file has its’ blocks on.  I couldn’t see how files of that size could have all their blocks on only a couple of NSDs in the pool (out of 19 total!) but wanted to verify that.  The files that I have looked at are evenly spread out across the NSDs.

So given that her files are spread across all 19 NSDs in the pool and the high I/O wait times are almost always only on LUNs in eon34 (and, more specifically, on two of the four LUNs in eon34) I’m pretty well convinced it’s not her jobs causing the problems … I’m back to thinking a weird hardware issue.

But if anyone wants to try to convince me otherwise, I’ll listen…

Thanks!

Kevin

On Jul 8, 2018, at 12:32 PM, Yaron Daniel <YARD at il.ibm.com<mailto:YARD at il.ibm.com>> wrote:

Hi

Clean all counters on the FC switches and see which port have errors .

For brocade run :

slotstatsclear
statsclear
porterrshow

For cisco run:

clear countersall

There might be bad gbic/cable/Storage gbic, which can affect the performance, if there is something like that - u can see which ports have errors grow over time.
Regards

________________________________



Yaron Daniel     94 Em Ha'Moshavot Rd
<ATT00001.gif>

Storage Architect – IL Lab Services (Storage)    Petach Tiqva, 49527
IBM Global Markets, Systems HW Sales     Israel

Phone:  +972-3-916-5672
Fax:    +972-3-916-5672
Mobile: +972-52-8395593
e-mail: yard at il.ibm.com<mailto:yard at il.ibm.com>
IBM Israel<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.ibm.com%2Fil%2Fhe%2F&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C7c1ced16f6d44055c63408d5e4fa7d2e%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636666686866046739&sdata=fpkPC3%2FjrhpFp1iLq3THOlRQTCGFdAInnRjsIs9zFEc%3D&reserved=0>



<ATT00002.gif><ATT00003.gif><ATT00004.gif><ATT00005.gif> <ATT00006.gif>      <ATT00007.jpeg>



From:        Jonathan Buzzard <jonathan.buzzard at strath.ac.uk<mailto:jonathan.buzzard at strath.ac.uk>>
To:        gpfsug-discuss at spectrumscale.org<mailto:gpfsug-discuss at spectrumscale.org>
Date:        07/07/2018 11:43 AM
Subject:        Re: [gpfsug-discuss] High I/O wait times
Sent by:        gpfsug-discuss-bounces at spectrumscale.org<mailto:gpfsug-discuss-bounces at spectrumscale.org>
________________________________



On 07/07/18 01:28, Buterbaugh, Kevin L wrote:

[SNIP]

>
> So, to try to rule out everything but the storage array we replaced the
> FC cables going from the SAN switches to the array, plugging the new
> cables into different ports on the SAN switches.  Then we repeated the
> dd tests from a different NSD server, which both eliminated the NSD
> server and its’ FC cables as a potential cause … and saw results
> virtually identical to the previous test.  Therefore, we feel pretty
> confident that it is the storage array and have let the vendor know all
> of this.

I was not thinking of doing anything quite as drastic as replacing
stuff, more look into the logs on the switches in the FC network and
examine them for packet errors. The above testing didn't eliminate bad
optics in the storage array itself for example, though it does appear to
be the storage arrays themselves. Sounds like they could do with a power
cycle...

JAB.

--
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org>
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=Bn1XE9uK2a9CZQ8qKnJE3Q&m=TM-kJsvzTX9cq_xmR5ITHclBCfO4FDvZ3ZxyugfJCfQ&s=Ass164qVEhb9fC4_VCmzfZeYd_BLOv9cZsfkrzqi8pM&e=<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss%26d%3DDwIGaQ%26c%3Djf_iaSHvJObTbx-siA1ZOg%26r%3DBn1XE9uK2a9CZQ8qKnJE3Q%26m%3DTM-kJsvzTX9cq_xmR5ITHclBCfO4FDvZ3ZxyugfJCfQ%26s%3DAss164qVEhb9fC4_VCmzfZeYd_BLOv9cZsfkrzqi8pM%26e%3D&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C7c1ced16f6d44055c63408d5e4fa7d2e%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636666686866046739&sdata=B%2F2Q9L1bwUvPHv858hLhTzt1hFT%2BMhCIOVeqGvLv3Rg%3D&reserved=0>




_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org>
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C7c1ced16f6d44055c63408d5e4fa7d2e%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636666686866066749&sdata=Viltitj3L9aScuuVKCLSp9FKkj7xdzWxsvvPVDSUqHw%3D&reserved=0

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180709/c8627dfe/attachment.htm>


More information about the gpfsug-discuss mailing list