[gpfsug-discuss] iowait?

Aaron Knister aaron.s.knister at nasa.gov
Mon Aug 29 23:58:34 BST 2016


Thanks Yuri!

I thought calling io_schedule was the right thing to do because the nfs 
client in the kernel did this directly until fairly recently. Now it 
calls wait_on_bit_io which I believe ultimately calls io_schedule.

Do you see a more targeted approach for having GPFS register IO wait as 
something that's feasible? (e.g. not registering iowait for locks, as 
you suggested, but doing so for file/directory operations such as 
read/write/readdir?)

-Aaron

On 8/29/16 4:31 PM, Yuri L Volobuev wrote:
> I would advise caution on using "mmdiag --iohist" heavily. In more
> recent code streams (V4.1, V4.2) there's a problem with internal locking
> that could, under certain conditions could lead to the symptoms that
> look very similar to sporadic network blockage. Basically, if "mmdiag
> --iohist" gets blocked for long periods of time (e.g. due to local
> disk/NFS performance issues), this may end up blocking an mmfsd receiver
> thread, delaying RPC processing. The problem was discovered fairly
> recently, and the fix hasn't made it out to all service streams yet.
>
> More generally, IO history is a valuable tool for troubleshooting disk
> IO performance issues, but the tool doesn't have the right semantics for
> regular, systemic IO performance sampling and monitoring. The query
> operation is too expensive, the coverage is subject to load, and the
> output is somewhat unstructured. With some effort, one can still build
> some form of a roll-your-own monitoring implement, but this is certainly
> not an optimal way of approaching the problem. The data should be
> available in a structured form, through a channel that supports
> light-weight, flexible querying that doesn't impact mainline IO
> processing. In Spectrum Scale, this type of data is fed from mmfsd to
> Zimon, via an mmpmon interface, and end users can then query Zimon for
> raw or partially processed data. Where it comes to high-volume stats,
> retaining raw data at its full resolution is only practical for
> relatively short periods of time (seconds, or perhaps a small number of
> minutes), and some form of aggregation is necessary for covering longer
> periods of time (hours to days). In the current versions of the product,
> there's a very similar type of data available this way: RPC stats. There
> are plans to make IO history data available in a similar fashion. The
> entire approach may need to be re-calibrated, however. Making RPC stats
> available doesn't appear to have generated a surge of user interest.
> This is probably because the data is too complex for casual processing,
> and while without doubt a lot of very valuable insight can be gained by
> analyzing RPC stats, the actual effort required to do so is too much for
> most users. That is, we need to provide some tools for raw data
> analytics. Largely the same argument applies to IO stats. In fact, on an
> NSD client IO stats are actually a subset of RPC stats. With some
> effort, one can perform a comprehensive analysis of NSD client IO stats
> by analyzing NSD client-to-server RPC traffic. One can certainly argue
> that the effort required is a bit much though.
>
> Getting back to the original question: would the proposed
> cxiWaitEventWait() change work? It'll likely result in nr_iowait being
> incremented every time a thread in GPFS code performs an uninterruptible
> wait. This could be an act of performing an actual IO request, or
> something else, e.g. waiting for a lock. Those may be the desirable
> semantics in some scenarios, but I wouldn't agree that it's the right
> behavior for any uninterruptible wait. io_schedule() is intended for use
> for block device IO waits, so using it this way is not in line with the
> code intent, which is never a good idea. Besides, relative to
> schedule(), io_schedule() has some overhead that could have performance
> implications of an uncertain nature.
>
> yuri
>
> Inactive hide details for Bryan Banister ---08/29/2016 11:06:59 AM---Try
> this: mmchconfig ioHistorySize=1024 # Or however big yBryan Banister
> ---08/29/2016 11:06:59 AM---Try this: mmchconfig ioHistorySize=1024 # Or
> however big you want!
>
> From: Bryan Banister <bbanister at jumptrading.com>
> To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>,
> Date: 08/29/2016 11:06 AM
> Subject: Re: [gpfsug-discuss] iowait?
> Sent by: gpfsug-discuss-bounces at spectrumscale.org
>
> ------------------------------------------------------------------------
>
>
>
> Try this:
>
> mmchconfig ioHistorySize=1024 # Or however big you want!
>
> Cheers,
> -Bryan
>
> -----Original Message-----
> From: gpfsug-discuss-bounces at spectrumscale.org
> [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Aaron Knister
> Sent: Monday, August 29, 2016 1:05 PM
> To: gpfsug main discussion list
> Subject: Re: [gpfsug-discuss] iowait?
>
> That's an interesting idea. I took a look at mmdig --iohist on a busy
> node it doesn't seem to capture more than literally 1 second of history.
> Is there a better way to grab the data or have gpfs capture more of it?
>
> Just to give some more context, as part of our monthly reporting
> requirements we calculate job efficiency by comparing the number of cpu
> cores requested by a given job with the cpu % utilization during that
> job's time window. Currently a job that's doing a sleep 9000 would show
> up the same as a job blocked on I/O. Having GPFS wait time included in
> iowait would allow us to easily make this distinction.
>
> -Aaron
>
> On 8/29/16 1:56 PM, Bryan Banister wrote:
>> There is the iohist data that may have what you're looking for, -Bryan
>>
>> -----Original Message-----
>> From: gpfsug-discuss-bounces at spectrumscale.org
>> [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Aaron
>> Knister
>> Sent: Monday, August 29, 2016 12:54 PM
>> To: gpfsug-discuss at spectrumscale.org
>> Subject: Re: [gpfsug-discuss] iowait?
>>
>> Sure, we can and we do use both iostat/sar and collectl to collect disk utilization on our nsd servers. That doesn't give us insight, though, into any individual client node of which we've got 3500. We do log mmpmon data from each node but that doesn't give us any insight into how much time is being spent waiting on I/O. Having GPFS report iowait on client nodes would give us this insight.
>>
>> On 8/29/16 1:50 PM, Alex Chekholko wrote:
>>> Any reason you can't just use iostat or collectl or any of a number
>>> of other standards tools to look at disk utilization?
>>>
>>> On 08/29/2016 10:33 AM, Aaron Knister wrote:
>>>> Hi Everyone,
>>>>
>>>> Would it be easy to have GPFS report iowait values in linux? This
>>>> would be a huge help for us in determining whether a node's low
>>>> utilization is due to some issue with the code running on it or if
>>>> it's blocked on I/O, especially in a historical context.
>>>>
>>>> I naively tried on a test system changing schedule() in
>>>> cxiWaitEventWait() on line ~2832 in gpl-linux/cxiSystem.c to this:
>>>>
>>>> again:
>>>>   /* call the scheduler */
>>>>   if ( waitFlags & INTERRUPTIBLE )
>>>>     schedule();
>>>>   else
>>>>     io_schedule();
>>>>
>>>> Seems to actually do what I'm after but generally bad things happen
>>>> when I start pretending I'm a kernel developer.
>>>>
>>>> Any thoughts? If I open an RFE would this be something that's
>>>> relatively easy to implement (not asking for a commitment *to*
>>>> implement it, just that I'm not asking for something seemingly
>>>> simple that's actually fairly hard to implement)?
>>>>
>>>> -Aaron
>>>>
>>>
>>
>> --
>> Aaron Knister
>> NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight
>> Center
>> (301) 286-2776
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>> ________________________________
>>
>> Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product.
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>
> --
> Aaron Knister
> NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center
> (301) 286-2776
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
> ________________________________
>
> Note: This email is for the confidential use of the named addressee(s)
> only and may contain proprietary, confidential or privileged
> information. If you are not the intended recipient, you are hereby
> notified that any review, dissemination or copying of this email is
> strictly prohibited, and to please notify the sender immediately and
> destroy this email and any attachments. Email transmission cannot be
> guaranteed to be secure or error-free. The Company, therefore, does not
> make any guarantees as to the completeness or accuracy of this email or
> any attachments. This email is for informational purposes only and does
> not constitute a recommendation, offer, request or solicitation of any
> kind to buy, sell, subscribe, redeem or perform any type of transaction
> of a financial product.
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>

-- 
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776



More information about the gpfsug-discuss mailing list