[gpfsug-discuss] strange waiters + filesystem deadlock

Aaron Knister aaron.s.knister at nasa.gov
Fri Mar 24 18:05:26 GMT 2017


It's large, I do know that much. I'll defer to one of our other storage 
admins. Jordan, do you have that number handy?

-Aaron

On 3/24/17 2:03 PM, Fosburgh,Jonathan wrote:
> 7PB filesystem and only 28 million inodes in use? What is your average
> file size? Our large filesystem is 7.5P (currently 71% used) with over 1
> billion inodes in use.
>
> --
>
> Jonathan Fosburgh
> Principal Application Systems Analyst
> Storage Team
> IT Operations
> jfosburg at mdanderson.org <mailto:jfosburg at mdanderson.org>
> (713) 745-9346
>
> -----Original Message-----
>
> *Date*: Fri, 24 Mar 2017 13:58:18 -0400
> *Subject*: Re: [gpfsug-discuss] strange waiters + filesystem deadlock
> *To*: gpfsug-discuss at spectrumscale.org
> <mailto:gpfsug-discuss at spectrumscale.org>
> Reply-to: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> *From*: Aaron Knister <aaron.s.knister at nasa.gov
> <mailto:Aaron%20Knister%20%3caaron.s.knister at nasa.gov%3e>>
>
> I feel a little awkward about posting wlists of IP's and hostnames on
> the mailing list (even though they're all internal) but I'm happy to
> send to you directly. I've attached both an lsfs and an mmdf output of
> the fs in question here since that may be useful for others to see. Just
> a note about disk d23_02_021-- it's been evacuated for several weeks now
> due to a hardware issue in the disk enclosure.
>
> The fs is rather full percentage wise (93%) but in terms of capacity
> there's a good amount free. 93% full of a 7PB filesystem still leaves
> 551T. Metadata, as you'll see, is 31% free (roughly 800GB).
>
> The fs has 40M inodes allocated and 12M free.
>
> -Aaron
>
> On 3/24/17 1:41 PM, Sven Oehme wrote:
>> ok, that seems a different problem then i was thinking. can you send
>> output of mmlscluster, mmlsconfig, mmlsfs all ? also are you getting
>> close to fill grade on inodes or capacity on any of the filesystems ?
>> sven On Fri, Mar 24, 2017 at 10:34 AM Aaron Knister
>> <aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>
>> <mailto:aaron.s.knister at nasa.gov>> wrote: Here's the screenshot from
>> the other node with the high cpu utilization. On 3/24/17 1:32 PM,
>> Aaron Knister wrote: > heh, yep we're on sles :) > > here's a
>> screenshot of the fs manager from the deadlocked filesystem. I > don't
>> think there's an nsd server or manager node that's running full >
>> throttle across all cpus. There is one that's got relatively high CPU
>> > utilization though (300-400%). I'll send a screenshot of it in a
>> sec. > > no zimon yet but we do have other tools to see cpu
>> utilization. > > -Aaron > > On 3/24/17 1:22 PM, Sven Oehme wrote: >>
>> you must be on sles as this segfaults only on sles to my knowledge :-)
>> >> >> i am looking for a NSD or manager node in your cluster that runs
>> at 100% >> cpu usage. >> >> do you have zimon deployed to look at cpu
>> utilization across your nodes ? >> >> sven >> >> >> >> On Fri, Mar 24,
>> 2017 at 10:08 AM Aaron Knister <aaron.s.knister at nasa.gov
>> <mailto:aaron.s.knister at nasa.gov> <mailto:aaron.s.knister at nasa.gov> >>
>> <mailto:aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>>>
>> wrote: >> >> Hi Sven, >> >> Which NSD server should I run top on, the
>> fs manager? If so the >> CPU load >> is about 155%. I'm working on
>> perf top but not off to a great >> start... >> >> # perf top >>
>> PerfTop: 1095 irqs/sec kernel:61.9% exact: 0.0% [1000Hz >> cycles],
>> (all, 28 CPUs) >> >>
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >> >> Segmentation fault >> >> -Aaron >> >> On 3/24/17 1:04 PM, Sven
>> Oehme wrote: >> > while this is happening run top and see if there is
>> very high cpu >> > utilization at this time on the NSD Server. >> > >>
>> > if there is , run perf top (you might need to install perf >>
>> command) and >> > see if the top cpu contender is a spinlock . if so
>> send a >> screenshot of >> > perf top as i may know what that is and
>> how to fix. >> > >> > sven >> > >> > >> > On Fri, Mar 24, 2017 at 9:43
>> AM Aaron Knister >> <aaron.s.knister at nasa.gov
>> <mailto:aaron.s.knister at nasa.gov> <mailto:aaron.s.knister at nasa.gov>
>> <mailto:aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>> >>
>> > <mailto:aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>
>> >> <mailto:aaron.s.knister at nasa.gov
>> <mailto:aaron.s.knister at nasa.gov>>>> wrote: >> > >> > Since yesterday
>> morning we've noticed some deadlocks on one >> of our >> > filesystems
>> that seem to be triggered by writing to it. The >> waiters on >> > the
>> clients look like this: >> > >> > 0x19450B0 ( 6730) waiting
>> 2063.294589599 seconds, >> SyncHandlerThread: >> > on ThCond
>> 0x1802585CB10 (0xFFFFC9002585CB10) >> (InodeFlushCondVar), reason >> >
>> 'waiting for the flush flag to commit metadata' >> > 0x7FFFDA65E200 (
>> 22850) waiting 0.000246257 seconds, >> > AllocReduceHelperThread: on
>> ThCond 0x7FFFDAC7FE28 >> (0x7FFFDAC7FE28) >> > (MsgRecordCondvar),
>> reason 'RPC wait' for >> allocMsgTypeRelinquishRegion >> > on node
>> 10.1.52.33 <c0n3271> >> > 0x197EE70 ( 6776) waiting 0.000198354
>> seconds, >> > FileBlockWriteFetchHandlerThread: on ThCond
>> 0x7FFFF00CD598 >> > (0x7FFFF00CD598) (MsgRecordCondvar), reason 'RPC
>> wait' for >> > allocMsgTypeRequestRegion on node 10.1.52.33 <c0n3271>
>> >> > >> > (10.1.52.33/c0n3271 <http://10.1.52.33/c0n3271>
>> <http://10.1.52.33/c0n3271> >> <http://10.1.52.33/c0n3271> is the fs
>> manager >> > for the filesystem in question) >> > >> > there's a
>> single process running on this node writing to the >> filesystem >> >
>> in question (well, trying to write, it's been blocked doing >> nothing
>> for >> > half an hour now). There are ~10 other client nodes in this
>> >> situation >> > right now. We had many more last night before the
>> problem >> seemed to >> > disappear in the early hours of the morning
>> and now its back. >> > >> > Waiters on the fs manager look like this.
>> While the >> individual waiter is >> > short it's a near constant
>> stream: >> > >> > 0x7FFF60003540 ( 8269) waiting 0.001151588 seconds,
>> Msg >> handler >> > allocMsgTypeRequestRegion: on ThMutex
>> 0x1802163A2E0 >> (0xFFFFC9002163A2E0) >> > (AllocManagerMutex) >> >
>> 0x7FFF601C8860 ( 20606) waiting 0.001115712 seconds, Msg >> handler >>
>> > allocMsgTypeRelinquishRegion: on ThMutex 0x1802163A2E0 >> >
>> (0xFFFFC9002163A2E0) (AllocManagerMutex) >> > 0x7FFF91C10080 ( 14723)
>> waiting 0.000959649 seconds, Msg >> handler >> >
>> allocMsgTypeRequestRegion: on ThMutex 0x1802163A2E0 >>
>> (0xFFFFC9002163A2E0) >> > (AllocManagerMutex) >> > 0x7FFFB03C2910 (
>> 12636) waiting 0.000769611 seconds, Msg >> handler >> >
>> allocMsgTypeRequestRegion: on ThMutex 0x1802163A2E0 >>
>> (0xFFFFC9002163A2E0) >> > (AllocManagerMutex) >> > 0x7FFF8C092850 (
>> 18215) waiting 0.000682275 seconds, Msg >> handler >> >
>> allocMsgTypeRelinquishRegion: on ThMutex 0x1802163A2E0 >> >
>> (0xFFFFC9002163A2E0) (AllocManagerMutex) >> > 0x7FFF9423F730 ( 12652)
>> waiting 0.000641915 seconds, Msg >> handler >> >
>> allocMsgTypeRequestRegion: on ThMutex 0x1802163A2E0 >>
>> (0xFFFFC9002163A2E0) >> > (AllocManagerMutex) >> > 0x7FFF9422D770 (
>> 12625) waiting 0.000494256 seconds, Msg >> handler >> >
>> allocMsgTypeRequestRegion: on ThMutex 0x1802163A2E0 >>
>> (0xFFFFC9002163A2E0) >> > (AllocManagerMutex) >> > 0x7FFF9423E310 (
>> 12651) waiting 0.000437760 seconds, Msg >> handler >> >
>> allocMsgTypeRelinquishRegion: on ThMutex 0x1802163A2E0 >> >
>> (0xFFFFC9002163A2E0) (AllocManagerMutex) >> > >> > I don't know if
>> this data point is useful but both yesterday >> and today >> > the
>> metadata NSDs for this filesystem have had a constant >> aggregate >>
>> > stream of 25MB/s 4kop/s reads during each episode (very low >>
>> latency >> > though so I don't believe the storage is a bottleneck
>> here). >> Writes are >> > only a few hundred ops and didn't strike me
>> as odd. >> > >> > I have a PMR open for this but I'm curious if folks
>> have >> seen this in >> > the wild and what it might mean. >> > >> >
>> -Aaron >> > >> > -- >> > Aaron Knister >> > NASA Center for Climate
>> Simulation (Code 606.2) >> > Goddard Space Flight Center >> > (301)
>> 286-2776 <tel:(301)%20286-2776> <tel:(301)%20286-2776>
>> <tel:(301)%20286-2776> >> >
>> _______________________________________________ >> > gpfsug-discuss
>> mailing list >> > gpfsug-discuss at spectrumscale.org
>> <http://spectrumscale.org> >> <http://spectrumscale.org>
>> <http://spectrumscale.org> >> >
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > >> > >> > >> >
>> _______________________________________________ >> > gpfsug-discuss
>> mailing list >> > gpfsug-discuss at spectrumscale.org
>> <http://spectrumscale.org> <http://spectrumscale.org> >> >
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > >> >> -- >>
>> Aaron Knister >> NASA Center for Climate Simulation (Code 606.2) >>
>> Goddard Space Flight Center >> (301) 286-2776 <tel:(301)%20286-2776>
>> <tel:(301)%20286-2776> >>
>> _______________________________________________ >> gpfsug-discuss
>> mailing list >> gpfsug-discuss at spectrumscale.org
>> <http://spectrumscale.org> <http://spectrumscale.org> >>
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >>
>> _______________________________________________ >> gpfsug-discuss
>> mailing list >> gpfsug-discuss at spectrumscale.org
>> <http://spectrumscale.org> >>
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > >
>> _______________________________________________ > gpfsug-discuss
>> mailing list > gpfsug-discuss at spectrumscale.org
>> <http://spectrumscale.org> >
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister
>> NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight
>> Center (301) 286-2776 <tel:(301)%20286-2776>
>> _______________________________________________ gpfsug-discuss mailing
>> list gpfsug-discuss at spectrumscale.org <http://spectrumscale.org>
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>> _______________________________________________ gpfsug-discuss mailing
>> list gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
> The information contained in this e-mail message may be privileged,
> confidential, and/or protected from disclosure. This e-mail message may
> contain protected health information (PHI); dissemination of PHI should
> comply with applicable federal and state laws. If you are not the
> intended recipient, or an authorized representative of the intended
> recipient, any further review, disclosure, use, dissemination,
> distribution, or copying of this message or any attachment (or the
> information contained therein) is strictly prohibited. If you think that
> you have received this e-mail message in error, please notify the sender
> by return e-mail and delete all references to it and its contents from
> your systems.
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>

-- 
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776



More information about the gpfsug-discuss mailing list