[gpfsug-discuss] Monitor NSD server queue?

Aaron Knister aaron.s.knister at nasa.gov
Fri Aug 19 23:06:57 BST 2016


Thanks everyone! I also have a PMR open for this, so hopefully the RFE 
gets some traction.

On 8/18/16 11:14 AM, McPheeters, Gordon wrote:
> Got my vote -  thanks Robert.
>
>
> Gordon McPheeters
> ALCF Storage
> (630) 252-6430
> gmcpheeters at anl.gov <mailto:gmcpheeters at anl.gov>
>
>
>
>> On Aug 18, 2016, at 10:00 AM, Bryan Banister
>> <bbanister at jumptrading.com <mailto:bbanister at jumptrading.com>> wrote:
>>
>> Great stuff… I added my vote,
>> -Bryan
>>
>> *From:* gpfsug-discuss-bounces at spectrumscale.org
>> <mailto:gpfsug-discuss-bounces at spectrumscale.org> [mailto:gpfsug-discuss-bounces at spectrumscale.org] *On
>> Behalf Of *Oesterlin, Robert
>> *Sent:* Thursday, August 18, 2016 9:47 AM
>> *To:* gpfsug main discussion list
>> *Subject:* Re: [gpfsug-discuss] Monitor NSD server queue?
>>
>> Done.
>>
>> Notification generated at: 18 Aug 2016, 10:46 AM Eastern Time (ET)
>>
>> ID:                                                93260
>> Headline:                                    Give sysadmin insight
>> into the inner workings of the NSD server machinery, in particular the
>> queue dynamics
>> Submitted on:                            18 Aug 2016, 10:46 AM Eastern
>> Time (ET)
>> Brand:                                          Servers and Systems
>> Software
>> Product:                                      Spectrum Scale (formerly
>> known as GPFS) - Public RFEs
>>
>> Link:
>>  http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=93260
>>
>>
>> Bob Oesterlin
>> Sr Storage Engineer, Nuance HPC Grid
>> 507-269-0413
>>
>>
>> *From: *<gpfsug-discuss-bounces at spectrumscale.org
>> <mailto:gpfsug-discuss-bounces at spectrumscale.org>> on behalf of Yuri L
>> Volobuev <volobuev at us.ibm.com <mailto:volobuev at us.ibm.com>>
>> *Reply-To: *gpfsug main discussion list
>> <gpfsug-discuss at spectrumscale.org
>> <mailto:gpfsug-discuss at spectrumscale.org>>
>> *Date: *Wednesday, August 17, 2016 at 3:34 PM
>> *To: *gpfsug main discussion list <gpfsug-discuss at spectrumscale.org
>> <mailto:gpfsug-discuss at spectrumscale.org>>
>> *Subject: *[EXTERNAL] Re: [gpfsug-discuss] Monitor NSD server queue?
>>
>>
>> Unfortunately, at the moment there's no safe mechanism to show the
>> usage statistics for different NSD queues. "mmfsadm saferdump nsd" as
>> implemented doesn't acquire locks when parsing internal data
>> structures. Now, NSD data structures are fairly static, as much things
>> go, so the risk of following a stale pointer and hitting a segfault
>> isn't particularly significant. I don't think I remember ever seeing
>> mmfsd crash with NSD dump code on the stack. That said, this isn't
>> code that's tested and known to be safe for production use. I haven't
>> seen a case myself where an mmfsd thread gets stuck running this dump
>> command, either, but Bob has. If that condition ever reoccurs, I'd be
>> interested in seeing debug data.
>>
>> I agree that there's value in giving a sysadmin insight into the inner
>> workings of the NSD server machinery, in particular the queue
>> dynamics. mmdiag should be enhanced to allow this. That'd be a very
>> reasonable (and doable) RFE.
>>
>> yuri
>>
>> <image001.gif>"Oesterlin, Robert" ---08/17/2016 04:45:30 AM---Hi Aaron
>> You did a perfect job of explaining a situation I've run into time
>> after time - high latenc
>>
>> From: "Oesterlin, Robert" <Robert.Oesterlin at nuance.com
>> <mailto:Robert.Oesterlin at nuance.com>>
>> To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org
>> <mailto:gpfsug-discuss at spectrumscale.org>>,
>> Date: 08/17/2016 04:45 AM
>> Subject: Re: [gpfsug-discuss] Monitor NSD server queue?
>> Sent by: gpfsug-discuss-bounces at spectrumscale.org
>> <mailto:gpfsug-discuss-bounces at spectrumscale.org>
>>
>> ------------------------------------------------------------------------
>>
>>
>>
>>
>> Hi Aaron
>>
>> You did a perfect job of explaining a situation I've run into time
>> after time - high latency on the disk subsystem causing a backup in
>> the NSD queues. I was doing what you suggested not to do - "mmfsadm
>> saferdump nsd' and looking at the queues. In my case 'mmfsadm
>> saferdump" would usually work or hang, rather than kill mmfsd. But -
>> the hang usually resulted it a tied up thread in mmfsd, so that's no
>> good either.
>>
>> I wish I had better news - this is the only way I've found to get
>> visibility to these queues. IBM hasn't seen fit to gives us a way to
>> safely look at these. I personally think it's a bug that we can't
>> safely dump these structures, as they give insight as to what's
>> actually going on inside the NSD server.
>>
>> Yuri, Sven - thoughts?
>>
>>
>> Bob Oesterlin
>> Sr Storage Engineer, Nuance HPC Grid
>>
>>
>>
>> *From: *<gpfsug-discuss-bounces at spectrumscale.org
>> <mailto:gpfsug-discuss-bounces at spectrumscale.org>> on behalf of
>> "Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]"
>> <aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>>*
>> Reply-To: *gpfsug main discussion list
>> <gpfsug-discuss at spectrumscale.org
>> <mailto:gpfsug-discuss at spectrumscale.org>>*
>> Date: *Tuesday, August 16, 2016 at 8:46 PM*
>> To: *gpfsug main discussion list <gpfsug-discuss at spectrumscale.org
>> <mailto:gpfsug-discuss at spectrumscale.org>>*
>> Subject: *[EXTERNAL] [gpfsug-discuss] Monitor NSD server queue?
>>
>> Hi Everyone,
>>
>> We ran into a rather interesting situation over the past week. We had
>> a job that was pounding the ever loving crap out of one of our
>> filesystems (called dnb02) doing about 15GB/s of reads. We had other
>> jobs experience a slowdown on a different filesystem (called dnb41)
>> that uses entirely separate backend storage. What I can't figure out
>> is why this other filesystem was affected. I've checked IB bandwidth
>> and congestion, Fibre channel bandwidth and errors, Ethernet bandwidth
>> congestion, looked at the mmpmon nsd_ds counters (including disk
>> request wait time), and checked out the disk iowait values from
>> collectl. I simply can't account for the slowdown on the other
>> filesystem. The only thing I can think of is the high latency on
>> dnb02's NSDs caused the mmfsd NSD queues to back up.
>>
>> Here's my question-- how can I monitor the state of th NSD queues? I
>> can't find anything in mmdiag. An mmfsadm saferdump NSD shows me the
>> queues and their status. I'm just not sure calling saferdump NSD every
>> 10 seconds to monitor this data is going to end well. I've seen
>> saferdump NSD cause mmfsd to die and that's from a task we only run
>> every 6 hours that calls saferdump NSD.
>>
>> Any thoughts/ideas here would be great.
>>
>> Thanks!
>>
>> -Aaron_______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org <http://spectrumscale.org/>
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=CwMFAg&c=djjh8EKwHtOepW4Bjau0lKhLlu-DxM1dlgP0rrLsOzY&r=LPDewt1Z4o9eKc86MXmhqX-45Cz1yz1ylYELF9olLKU&m=D8iCz340ioiUrtGkAFdKjfgfitPkpOr1nRkkxTRCBn0&s=ncd-C59bavCSUTkgYH1vH4ewOM12Hajhy-KhFtKZK68&e=>
>>
>>
>> ------------------------------------------------------------------------
>>
>> Note: This email is for the confidential use of the named addressee(s)
>> only and may contain proprietary, confidential or privileged
>> information. If you are not the intended recipient, you are hereby
>> notified that any review, dissemination or copying of this email is
>> strictly prohibited, and to please notify the sender immediately and
>> destroy this email and any attachments. Email transmission cannot be
>> guaranteed to be secure or error-free. The Company, therefore, does
>> not make any guarantees as to the completeness or accuracy of this
>> email or any attachments. This email is for informational purposes
>> only and does not constitute a recommendation, offer, request or
>> solicitation of any kind to buy, sell, subscribe, redeem or perform
>> any type of transaction of a financial product.
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org <http://spectrumscale.org/>
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>

-- 
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776



More information about the gpfsug-discuss mailing list