[gpfsug-discuss] Hanging file-systems

Tue Nov 27 16:14:20 GMT 2018

if this happens you should check a couple of things :

1. are you under memory pressure or even worse started swapping .
2. is there any core running at ~ 0% idle - run top , press 1 and check the
idle column.
3. is there any single thread running at ~100%  - run top , press shift - h
and check what the CPU % shows for the top 5 processes.

if you want to go the extra mile, you could run perf top -p $PID_OF_MMFSD
and check what the top cpu consumers are.
confirming and providing data to any of the above being true could be the
missing piece why nobody was able to find it, as this is stuff unfortunate
nobody ever looks at. even a trace won't help if any of the above is true
as all you see is that the system behaves correct according to the trace,
its doesn't appear busy,

Sven

On Tue, Nov 27, 2018 at 8:03 AM Oesterlin, Robert <
Robert.Oesterlin at nuance.com> wrote:

> I have seen something like this in the past, and I have resorted to a
> cluster restart as well.  :-( IBM and I could never really track it down,
> because I could not get a dump at the time of occurrence. However, you
> might take a look at your NSD servers, one at a time. As I recall, we
> thought it was a stuck thread on one of the NSD servers, and when we
> restarted the “right” one it cleared the block.
>
>
>
> The other thing I’ve done in the past to isolate problems like this (since
> this is related to tokens) is to look at the “token revokes” on each node,
> looking for ones that are sticking around for a long time. I tossed
> together a quick script and ran it via mmdsh on all the node. Not pretty,
> but it got the job done. Run this a few times, see if any of the revokes
> are sticking around for a long time
>
>
>
> #!/bin/sh
>
> rm -f /tmp/revokelist
>
> /usr/lpp/mmfs/bin/mmfsadm dump tokenmgr | grep -A 2 'revokeReq list' >
> /tmp/revokelist 2> /dev/null
>
> if [ $? -eq 0 ]; then
>
>   /usr/lpp/mmfs/bin/mmfsadm dump tscomm > /tmp/tscomm.out
>
>   for n in `cat /tmp/revokelist  | grep msgHdr | awk '{print $5}'`; do
>
>    grep $n /tmp/tscomm.out | tail -1
>
>   done
>
>   rm -f /tmp/tscomm.out
>
> fi
>
>
>
>
>
> Bob Oesterlin
>
> Sr Principal Storage Engineer, Nuance
>
>
>
>
>
>
>
> *From: *<gpfsug-discuss-bounces at spectrumscale.org> on behalf of Simon
> Thompson <S.J.Thompson at bham.ac.uk>
> *Reply-To: *gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> *Date: *Tuesday, November 27, 2018 at 9:27 AM
> *To: *"gpfsug-discuss at spectrumscale.org" <gpfsug-discuss at spectrumscale.org
> >
> *Subject: *[EXTERNAL] [gpfsug-discuss] Hanging file-systems
>
>
>
> I have a file-system which keeps hanging over the past few weeks. Right
> now, its offline and taken a bunch of services out with it.
>
>
>
> (I have a ticket with IBM open about this as well)
>
>
>
> We see for example:
>
> Waiting 305.0391 sec since 15:17:02, monitored, thread 24885
> SharedHashTabFetchHandlerThread: on ThCond 0x7FE30000B408
> (MsgRecordCondvar), re
>
> ason 'RPC wait' for tmMsgTellAcquire1 on node 10.10.12.42 <c1n9>
>
>
>
> and on that node:
>
> Waiting 292.4581 sec since 15:17:22, monitored, thread 20368
> SharedHashTabFetchHandlerThread: on ThCond 0x7F3C2929719
>
> 8 (TokenCondvar), reason 'wait for SubToken to become stable'
>
>
>
> On this node, if you dump tscomm, you see entries like:
>
> Pending messages:
>
>   msg_id 376617, service 13.1, msg_type 20 'tmMsgTellAcquire1', n_dest 1,
> n_pending 1
>
>   this 0x7F3CD800B930, n_xhold 1, cl 0, cbFn 0x0, age 303 sec
>
>     sent by 'SharedHashTabFetchHandlerThread' (0x7F3DD800A6C0)
>
>     dest <c0n9>          status pending   , err 0, reply len 0 by TCP
> connection
>
>
>
> c0n9 is itself.
>
>
>
> This morning when this happened, the only way to get the FS back online
> was to shutdown the entire cluster.
>
>
>
> Any pointers for next place to look/how to fix?
>
>
>
> Simon
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20181127/b94ace35/attachment.htm>