[gpfsug-discuss] strange waiters + filesystem deadlock

Fri Mar 24 18:03:22 GMT 2017

7PB filesystem and only 28 million inodes in use? What is your average file size? Our large filesystem is 7.5P (currently 71% used) with over 1 billion inodes in use.

--

Jonathan Fosburgh
Principal Application Systems Analyst
Storage Team
IT Operations
jfosburg at mdanderson.org<mailto:jfosburg at mdanderson.org>
(713) 745-9346

-----Original Message-----

Date: Fri, 24 Mar 2017 13:58:18 -0400
Subject: Re: [gpfsug-discuss] strange waiters + filesystem deadlock
To: gpfsug-discuss at spectrumscale.org<mailto:gpfsug-discuss at spectrumscale.org>
Reply-to: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
From: Aaron Knister <aaron.s.knister at nasa.gov<mailto:Aaron%20Knister%20%3caaron.s.knister at nasa.gov%3e>>

I feel a little awkward about posting wlists of IP's and hostnames on
the mailing list (even though they're all internal) but I'm happy to
send to you directly. I've attached both an lsfs and an mmdf output of
the fs in question here since that may be useful for others to see. Just
a note about disk d23_02_021-- it's been evacuated for several weeks now
due to a hardware issue in the disk enclosure.

The fs is rather full percentage wise (93%) but in terms of capacity
there's a good amount free. 93% full of a 7PB filesystem still leaves
551T. Metadata, as you'll see, is 31% free (roughly 800GB).

The fs has 40M inodes allocated and 12M free.

-Aaron

On 3/24/17 1:41 PM, Sven Oehme wrote:

ok, that seems a different problem then i was thinking.
can you send output of mmlscluster, mmlsconfig, mmlsfs all ?
also are you getting close to fill grade on inodes or capacity on any of
the filesystems ?

sven

On Fri, Mar 24, 2017 at 10:34 AM Aaron Knister <aaron.s.knister at nasa.gov<mailto:aaron.s.knister at nasa.gov>
<mailto:aaron.s.knister at nasa.gov>> wrote:

    Here's the screenshot from the other node with the high cpu utilization.

    On 3/24/17 1:32 PM, Aaron Knister wrote:
    > heh, yep we're on sles :)
    >
    > here's a screenshot of the fs manager from the deadlocked filesystem. I
    > don't think there's an nsd server or manager node that's running full
    > throttle across all cpus. There is one that's got relatively high CPU
    > utilization though (300-400%). I'll send a screenshot of it in a sec.
    >
    > no zimon yet but we do have other tools to see cpu utilization.
    >
    > -Aaron
    >
    > On 3/24/17 1:22 PM, Sven Oehme wrote:
    >> you must be on sles as this segfaults only on sles to my knowledge :-)
    >>
    >> i am looking for a NSD or manager node in your cluster that runs at 100%
    >> cpu usage.
    >>
    >> do you have zimon deployed to look at cpu utilization across your nodes ?
    >>
    >> sven
    >>
    >>
    >>
    >> On Fri, Mar 24, 2017 at 10:08 AM Aaron Knister <aaron.s.knister at nasa.gov<mailto:aaron.s.knister at nasa.gov> <mailto:aaron.s.knister at nasa.gov>
    >> <mailto:aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>>> wrote:
    >>
    >>     Hi Sven,
    >>
    >>     Which NSD server should I run top on, the fs manager? If so the
    >> CPU load
    >>     is about 155%. I'm working on perf top but not off to a great
    >> start...
    >>
    >>     # perf top
    >>         PerfTop:    1095 irqs/sec  kernel:61.9%  exact:  0.0% [1000Hz
    >>     cycles],  (all, 28 CPUs)
    >>
    >> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    >>
    >>     Segmentation fault
    >>
    >>     -Aaron
    >>
    >>     On 3/24/17 1:04 PM, Sven Oehme wrote:
    >>     > while this is happening  run top and see if there is very high cpu
    >>     > utilization at this time on the NSD Server.
    >>     >
    >>     > if there is , run perf top (you might need to install perf
    >> command) and
    >>     > see if the top cpu contender is a spinlock . if so send a
    >> screenshot of
    >>     > perf top as i may know what that is and how to fix.
    >>     >
    >>     > sven
    >>     >
    >>     >
    >>     > On Fri, Mar 24, 2017 at 9:43 AM Aaron Knister
    >> <aaron.s.knister at nasa.gov<mailto:aaron.s.knister at nasa.gov> <mailto:aaron.s.knister at nasa.gov>
    <mailto:aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>>
    >>     > <mailto:aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>
    >> <mailto:aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>>>> wrote:
    >>     >
    >>     >     Since yesterday morning we've noticed some deadlocks on one
    >> of our
    >>     >     filesystems that seem to be triggered by writing to it. The
    >> waiters on
    >>     >     the clients look like this:
    >>     >
    >>     >     0x19450B0 (   6730) waiting 2063.294589599 seconds,
    >> SyncHandlerThread:
    >>     >     on ThCond 0x1802585CB10 (0xFFFFC9002585CB10)
    >> (InodeFlushCondVar), reason
    >>     >     'waiting for the flush flag to commit metadata'
    >>     >     0x7FFFDA65E200 (  22850) waiting 0.000246257 seconds,
    >>     >     AllocReduceHelperThread: on ThCond 0x7FFFDAC7FE28
    >> (0x7FFFDAC7FE28)
    >>     >     (MsgRecordCondvar), reason 'RPC wait' for
    >> allocMsgTypeRelinquishRegion
    >>     >     on node 10.1.52.33 <c0n3271>
    >>     >     0x197EE70 (   6776) waiting 0.000198354 seconds,
    >>     >     FileBlockWriteFetchHandlerThread: on ThCond 0x7FFFF00CD598
    >>     >     (0x7FFFF00CD598) (MsgRecordCondvar), reason 'RPC wait' for
    >>     >     allocMsgTypeRequestRegion on node 10.1.52.33 <c0n3271>
    >>     >
    >>     >     (10.1.52.33/c0n3271 <http://10.1.52.33/c0n3271>
    <http://10.1.52.33/c0n3271>
    >>     <http://10.1.52.33/c0n3271> is the fs manager
    >>     >     for the filesystem in question)
    >>     >
    >>     >     there's a single process running on this node writing to the
    >> filesystem
    >>     >     in question (well, trying to write, it's been blocked doing
    >> nothing for
    >>     >     half an hour now). There are ~10 other client nodes in this
    >> situation
    >>     >     right now. We had many more last night before the problem
    >> seemed to
    >>     >     disappear in the early hours of the morning and now its back.
    >>     >
    >>     >     Waiters on the fs manager look like this. While the
    >> individual waiter is
    >>     >     short it's a near constant stream:
    >>     >
    >>     >     0x7FFF60003540 (   8269) waiting 0.001151588 seconds, Msg
    >> handler
    >>     >     allocMsgTypeRequestRegion: on ThMutex 0x1802163A2E0
    >> (0xFFFFC9002163A2E0)
    >>     >     (AllocManagerMutex)
    >>     >     0x7FFF601C8860 (  20606) waiting 0.001115712 seconds, Msg
    >> handler
    >>     >     allocMsgTypeRelinquishRegion: on ThMutex 0x1802163A2E0
    >>     >     (0xFFFFC9002163A2E0) (AllocManagerMutex)
    >>     >     0x7FFF91C10080 (  14723) waiting 0.000959649 seconds, Msg
    >> handler
    >>     >     allocMsgTypeRequestRegion: on ThMutex 0x1802163A2E0
    >> (0xFFFFC9002163A2E0)
    >>     >     (AllocManagerMutex)
    >>     >     0x7FFFB03C2910 (  12636) waiting 0.000769611 seconds, Msg
    >> handler
    >>     >     allocMsgTypeRequestRegion: on ThMutex 0x1802163A2E0
    >> (0xFFFFC9002163A2E0)
    >>     >     (AllocManagerMutex)
    >>     >     0x7FFF8C092850 (  18215) waiting 0.000682275 seconds, Msg
    >> handler
    >>     >     allocMsgTypeRelinquishRegion: on ThMutex 0x1802163A2E0
    >>     >     (0xFFFFC9002163A2E0) (AllocManagerMutex)
    >>     >     0x7FFF9423F730 (  12652) waiting 0.000641915 seconds, Msg
    >> handler
    >>     >     allocMsgTypeRequestRegion: on ThMutex 0x1802163A2E0
    >> (0xFFFFC9002163A2E0)
    >>     >     (AllocManagerMutex)
    >>     >     0x7FFF9422D770 (  12625) waiting 0.000494256 seconds, Msg
    >> handler
    >>     >     allocMsgTypeRequestRegion: on ThMutex 0x1802163A2E0
    >> (0xFFFFC9002163A2E0)
    >>     >     (AllocManagerMutex)
    >>     >     0x7FFF9423E310 (  12651) waiting 0.000437760 seconds, Msg
    >> handler
    >>     >     allocMsgTypeRelinquishRegion: on ThMutex 0x1802163A2E0
    >>     >     (0xFFFFC9002163A2E0) (AllocManagerMutex)
    >>     >
    >>     >     I don't know if this data point is useful but both yesterday
    >> and today
    >>     >     the metadata NSDs for this filesystem have had a constant
    >> aggregate
    >>     >     stream of 25MB/s 4kop/s reads during each episode (very low
    >> latency
    >>     >     though so I don't believe the storage is a bottleneck here).
    >> Writes are
    >>     >     only a few hundred ops and didn't strike me as odd.
    >>     >
    >>     >     I have a PMR open for this but I'm curious if folks have
    >> seen this in
    >>     >     the wild and what it might mean.
    >>     >
    >>     >     -Aaron
    >>     >
    >>     >     --
    >>     >     Aaron Knister
    >>     >     NASA Center for Climate Simulation (Code 606.2)
    >>     >     Goddard Space Flight Center
    >>     >     (301) 286-2776 <tel:(301)%20286-2776> <tel:(301)%20286-2776>
    <tel:(301)%20286-2776>
    >>     >     _______________________________________________
    >>     >     gpfsug-discuss mailing list
    >>     >     gpfsug-discuss at spectrumscale.org <http://spectrumscale.org>
    >> <http://spectrumscale.org> <http://spectrumscale.org>
    >>     >     http://gpfsug.org/mailman/listinfo/gpfsug-discuss
    >>     >
    >>     >
    >>     >
    >>     > _______________________________________________
    >>     > gpfsug-discuss mailing list
    >>     > gpfsug-discuss at spectrumscale.org <http://spectrumscale.org> <http://spectrumscale.org>
    >>     > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
    >>     >
    >>
    >>     --
    >>     Aaron Knister
    >>     NASA Center for Climate Simulation (Code 606.2)
    >>     Goddard Space Flight Center
    >>     (301) 286-2776 <tel:(301)%20286-2776> <tel:(301)%20286-2776>
    >>     _______________________________________________
    >>     gpfsug-discuss mailing list
    >>     gpfsug-discuss at spectrumscale.org <http://spectrumscale.org> <http://spectrumscale.org>
    >>     http://gpfsug.org/mailman/listinfo/gpfsug-discuss
    >>
    >>
    >>
    >> _______________________________________________
    >> gpfsug-discuss mailing list
    >> gpfsug-discuss at spectrumscale.org <http://spectrumscale.org>
    >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
    >>
    >
    >
    >
    > _______________________________________________
    > gpfsug-discuss mailing list
    > gpfsug-discuss at spectrumscale.org <http://spectrumscale.org>
    > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
    >

    --
    Aaron Knister
    NASA Center for Climate Simulation (Code 606.2)
    Goddard Space Flight Center
    (301) 286-2776 <tel:(301)%20286-2776>
    _______________________________________________
    gpfsug-discuss mailing list
    gpfsug-discuss at spectrumscale.org <http://spectrumscale.org>
    http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

The information contained in this e-mail message may be privileged, confidential, and/or protected from disclosure. This e-mail message may contain protected health information (PHI); dissemination of PHI should comply with applicable federal and state laws. If you are not the intended recipient, or an authorized representative of the intended recipient, any further review, disclosure, use, dissemination, distribution, or copying of this message or any attachment (or the information contained therein) is strictly prohibited. If you think that you have received this e-mail message in error, please notify the sender by return e-mail and delete all references to it and its contents from your systems.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170324/4075a804/attachment.htm>