[gpfsug-discuss] High I/O wait times

Tue Jul 3 22:53:19 BST 2018

How many NSDs are served by the NSD servers and what is your maximum file 
system block size?  Have you confirmed that you have sufficient NSD worker 
threads to handle the maximum number of IOs you are configured to have 
active?  That would be the number of NSDs served times 12 (you have 12 
threads per queue).

Fred
__________________________________________________
Fred Stock | IBM Pittsburgh Lab | 720-430-8821
stockf at us.ibm.com

From:   "Buterbaugh, Kevin L" <Kevin.Buterbaugh at Vanderbilt.Edu>
To:     gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date:   07/03/2018 05:41 PM
Subject:        Re: [gpfsug-discuss] High I/O wait times
Sent by:        gpfsug-discuss-bounces at spectrumscale.org

Hi Fred, 

Thanks for the response.  I have been looking at the “mmfsadm dump nsd” 
data from the two NSD servers that serve up the two NSDs that most 
commonly experience high wait times (although, again, this varies from 
time to time).  In addition, I have been reading:

https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/NSD%20Server%20Design%20and%20Tuning

And:

https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/NSD%20Server%20Tuning

Which seem to be the most relevant documents on the Wiki.

I would like to do a more detailed analysis of the “mmfsadm dump nsd” 
output, but my preliminary looks at it seems to indicate that I see I/O’s 
queueing in the 50 - 100 range for the small queues and the 60 - 200 range 
on the large queues.

In addition, I am regularly seeing all 12 threads on the LARGE queues 
active, while it is much more rare that I see all - or even close to all - 
the threads on the SMALL queues active.

As far as the parameters Scott and Yuri mention, on our cluster they are 
set thusly:

[common]
nsdMaxWorkerThreads 640
[<all the GPFS servers listed here>]
nsdMaxWorkerThreads 1024
[common]
nsdThreadsPerQueue 4
[<all the GPFS servers listed here>]
nsdThreadsPerQueue 12
[common]
nsdSmallThreadRatio 3
[<all the GPFS servers listed here>]
nsdSmallThreadRatio 1

So to me it sounds like I need more resources on the LARGE queue side of 
things … i.e. it sure doesn’t sound like I want to change my small thread 
ratio.  If I increase the amount of threads it sounds like that might 
help, but that also takes more pagepool, and I’ve got limited RAM in these 
(old) NSD servers.  I do have nsdbufspace set to 70, but I’ve only got 
16-24 GB RAM each in these NSD servers.  And a while back I did try 
increase the page pool on them (very slightly) and ended up causing 
problems because then they ran out of physical RAM.

Thoughts?  Followup questions?  Thanks!

Kevin

On Jul 3, 2018, at 3:11 PM, Frederick Stock <stockf at us.ibm.com> wrote:

Are you seeing similar values for all the nodes or just some of them?  One 
possible issue is how the NSD queues are configured on the NSD servers. 
You can see this with the output of "mmfsadm dump nsd".  There are queues 
for LARGE IOs (greater than 64K) and queues for SMALL IOs (64K or less). 
Check the highest pending values to see if many IOs are queueing.  There 
are a couple of options to fix this but rather than explain them I suggest 
you look for information about NSD queueing on the developerWorks site. 
There has been information posted there that should prove helpful.

Fred
__________________________________________________
Fred Stock | IBM Pittsburgh Lab | 720-430-8821
stockf at us.ibm.com

From:        "Buterbaugh, Kevin L" <Kevin.Buterbaugh at Vanderbilt.Edu>
To:        gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date:        07/03/2018 03:49 PM
Subject:        [gpfsug-discuss] High I/O wait times
Sent by:        gpfsug-discuss-bounces at spectrumscale.org

Hi all, 

We are experiencing some high I/O wait times (5 - 20 seconds!) on some of 
our NSDs as reported by “mmdiag —iohist" and are struggling to understand 
why.  One of the confusing things is that, while certain NSDs tend to show 
the problem more than others, the problem is not consistent … i.e. the 
problem tends to move around from NSD to NSD (and storage array to storage 
array) whenever we check … which is sometimes just a few minutes apart.

In the past when I have seen “mmdiag —iohist” report high wait times like 
this it has *always* been hardware related.  In our environment, the most 
common cause has been a battery backup unit on a storage array controller 
going bad and the storage array switching to write straight to disk.  But 
that’s *not* happening this time. 

Is there anything within GPFS / outside of a hardware issue that I should 
be looking for??  Thanks!

—
Kevin Buterbaugh - Senior System Administrator
Vanderbilt University - Advanced Computing Center for Research and 
Education
Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7Cd3d7ff675bb440286cb908d5e1212b66%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636662454938076014&sdata=wIyB66HoqvL13I3LX0Ott%2Btr7HQQdInZ028QUp0QMhE%3D&reserved=0
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180703/56127d7c/attachment.htm>