[gpfsug-discuss] gpfs client expels

Salvatore Di Nardo sdinardo at ebi.ac.uk
Wed Aug 20 09:03:03 BST 2014


Another interesting case about a specific waiter:

was looking the waiters on GSS until i found those( i got those info 
collecting from all the servers with a script i did, so i was able to 
trace hanging connection while they was happening):

                gss03b.ebi.ac.uk:*235.373993397*(MsgRecordCondvar),
                reason 'RPC wait' for getData on node 10.7.37.109 <c0n675>
                gss03b.ebi.ac.uk:*235.152271998*(MsgRecordCondvar),
                reason 'RPC wait' for getData on node 10.7.37.109 <c0n675>
                gss02a.ebi.ac.uk:*214.079093620 *(MsgRecordCondvar),
                reason 'RPC wait' for tmMsgRevoke on node 10.7.34.109
                <c0n656>
                gss02a.ebi.ac.uk:*213.580199240 *(MsgRecordCondvar),
                reason 'RPC wait' for tmMsgRevoke on node 10.7.37.109
                <c0n675>
                gss03b.ebi.ac.uk:*132.375138082*(MsgRecordCondvar),
                reason 'RPC wait' for getData on node 10.7.37.109 <c0n675>
                gss03b.ebi.ac.uk:*132.374973884 *(MsgRecordCondvar),
                reason 'RPC wait' for commMsgCheckMessages on node
                10.7.37.109 <c0n675>



the bolted number are seconds. looking at this page:
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+%28GPFS%29/page/Interpreting+GPFS+Waiter+Information

The web page claim that's, probably a network congestion, but i managed 
to login quick enough to the client and there the waiters was:

                [root at ebi5-236 ~]# mmdiag --waiters

                === mmdiag: waiters ===
                0x7F6690073460 waiting 147.973009173 seconds,
                RangeRevokeWorkerThread: on ThCond 0x1801E43F6A0
                (0xFFFFC9001E43F6A0) (LkObjCondvar), reason 'waiting for
                LX lock'
                0x7F65100036D0 waiting 140.458589856 seconds,
                WritebehindWorkerThread: on ThCond 0x7F6500000F98
                (0x7F6500000F98) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F63A0001080 waiting 245.153055801 seconds,
                WritebehindWorkerThread: on ThCond 0x7F65D8002CF8
                (0x7F65D8002CF8) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F674C03D3D0 waiting 245.750977203 seconds,
                CleanBufferThread: on ThCond 0x7F64880079E8
                (0x7F64880079E8) (LogFileBufferDescriptorCondvar),
                reason 'force wait for buffer write to complete'
                0x7F674802E360 waiting 244.159861966 seconds,
                WritebehindWorkerThread: on ThCond 0x7F65E0002358
                (0x7F65E0002358) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F674C038810 waiting 251.086748430 seconds,
                SGExceptionLogBufferFullThread: on ThCond 0x7F64EC001398
                (0x7F64EC001398) (MsgRecordCondvar), reason 'RPC wait'
                for I/O completion on node 10.7.28.35 <c1n5>
                0x7F674C036230 waiting 139.556735095 seconds,
                CleanBufferThread: on ThCond 0x7F65CC004C78
                (0x7F65CC004C78) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F674C031670 waiting 144.327593052 seconds,
                WritebehindWorkerThread: on ThCond 0x7F672402D1A8
                (0x7F672402D1A8) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F674C02A4D0 waiting 145.202712821 seconds,
                WritebehindWorkerThread: on ThCond 0x7F65440018F8
                (0x7F65440018F8) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F674C0291E0 waiting 247.131569232 seconds,
                PrefetchWorkerThread: on ThCond 0x7F65740016C8
                (0x7F65740016C8) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F6748025BD0 waiting 11.631381523 seconds,
                replyCleanupThread: on ThCond 0x7F65E000A1F8
                (0x7F65E000A1F8) (MsgRecordCondvar), reason 'RPC wait'
                0x7F6748022300 waiting 245.616267612 seconds,
                WritebehindWorkerThread: on ThCond 0x7F6470001468
                (0x7F6470001468) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F6748021010 waiting 230.769670930 seconds,
                InodeAllocRevokeWorkerThread: on ThCond 0x7F64880079E8
                (0x7F64880079E8) (LogFileBufferDescriptorCondvar),
                reason 'force wait for buffer write to complete'
                0x7F674801B160 waiting 245.830554594 seconds,
                UnusedInodePrefetchThread: on ThCond 0x7F65B8004438
                (0x7F65B8004438) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F674800A820 waiting 252.332932000 seconds, Msg
                handler getData: for poll on sock 109
                0x7F63F4023090 waiting 253.073535042 seconds,
                WritebehindWorkerThread: on ThCond 0x7F65C4000CC8
                (0x7F65C4000CC8) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F64A4000CE0 waiting 145.049659249 seconds,
                WritebehindWorkerThread: on ThCond 0x7F6560000A98
                (0x7F6560000A98) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F6778006D00 waiting 142.124664264 seconds,
                WritebehindWorkerThread: on ThCond 0x7F63DC000C08
                (0x7F63DC000C08) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F67780046D0 waiting 251.751439453 seconds,
                WritebehindWorkerThread: on ThCond 0x7F6454000A98
                (0x7F6454000A98) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F67780E4B70 waiting 142.431051232 seconds,
                WritebehindWorkerThread: on ThCond 0x7F63C80010D8
                (0x7F63C80010D8) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F67780E50D0 waiting 244.339624817 seconds,
                WritebehindWorkerThread: on ThCond 0x7F65BC001B98
                (0x7F65BC001B98) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F6434000B40 waiting 145.343700410 seconds,
                WritebehindWorkerThread: on ThCond 0x7F63B00036E8
                (0x7F63B00036E8) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F670C0187A0 waiting 244.903963969 seconds,
                WritebehindWorkerThread: on ThCond 0x7F65F0000FB8
                (0x7F65F0000FB8) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F671C04E2F0 waiting 245.837137631 seconds,
                PrefetchWorkerThread: on ThCond 0x7F65A4000A98
                (0x7F65A4000A98) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F671C04AA20 waiting 139.713993908 seconds,
                WritebehindWorkerThread: on ThCond 0x7F6454002478
                (0x7F6454002478) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F671C049730 waiting 252.434187472 seconds,
                WritebehindWorkerThread: on ThCond 0x7F65F4003708
                (0x7F65F4003708) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F671C044B70 waiting 131.515829048 seconds, Msg
                handler ccMsgPing: on ThCond 0x7F64DC1D4888
                (0x7F64DC1D4888) (InuseCondvar), reason 'waiting for
                exclusive use of connection for sending msg'
                0x7F6758008DE0 waiting 149.548547226 seconds, Msg
                handler getData: on ThCond 0x7F645C002458
                (0x7F645C002458) (InuseCondvar), reason 'waiting for
                exclusive use of connection for sending msg'
                0x7F67580071D0 waiting 149.548543118 seconds, Msg
                handler commMsgCheckMessages: on ThCond 0x7F6450001C48
                (0x7F6450001C48) (InuseCondvar), reason 'waiting for
                exclusive use of connection for sending msg'
                0x7F65A40052B0 waiting 11.498507001 seconds, Msg handler
                ccMsgPing: on ThCond 0x7F644C103F88 (0x7F644C103F88)
                (InuseCondvar), reason 'waiting for exclusive use of
                connection for sending msg'
                0x7F6448001620 waiting 139.844870446 seconds,
                WritebehindWorkerThread: on ThCond 0x7F65F0003098
                (0x7F65F0003098) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F63F4000F80 waiting 245.044791905 seconds,
                WritebehindWorkerThread: on ThCond 0x7F6450001188
                (0x7F6450001188) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F659C0033A0 waiting 243.464399305 seconds,
                PrefetchWorkerThread: on ThCond 0x7F6554002598
                (0x7F6554002598) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F6514001690 waiting 245.826160463 seconds,
                PrefetchWorkerThread: on ThCond 0x7F65A4004558
                (0x7F65A4004558) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F64800012B0 waiting 253.174835511 seconds,
                WritebehindWorkerThread: on ThCond 0x7F65E0000FB8
                (0x7F65E0000FB8) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F6510000EE0 waiting 140.746696039 seconds,
                WritebehindWorkerThread: on ThCond 0x7F647C000CC8
                (0x7F647C000CC8) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F6754001BB0 waiting 246.336055629 seconds,
                PrefetchWorkerThread: on ThCond 0x7F6594002498
                (0x7F6594002498) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F6420000930 waiting 140.606777450 seconds,
                WritebehindWorkerThread: on ThCond 0x7F6578002498
                (0x7F6578002498) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F6744009110 waiting 137.466372831 seconds,
                FileBlockReadFetchHandlerThread: on ThCond
                0x7F65F4007158 (0x7F65F4007158) (MsgRecordCondvar),
                reason 'RPC wait' for NSD I/O completion on node
                10.7.28.35 <c1n5>
                0x7F67280119F0 waiting 144.173427360 seconds,
                WritebehindWorkerThread: on ThCond 0x7F6504000AE8
                (0x7F6504000AE8) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F672800BB40 waiting 145.804301887 seconds,
                WritebehindWorkerThread: on ThCond 0x7F6550001038
                (0x7F6550001038) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F6728000910 waiting 252.601993452 seconds,
                WritebehindWorkerThread: on ThCond 0x7F6450000A98
                (0x7F6450000A98) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F6744007E20 waiting 251.603329204 seconds,
                WritebehindWorkerThread: on ThCond 0x7F6570004C18
                (0x7F6570004C18) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F64AC002EF0 waiting 139.205774422 seconds,
                FileBlockWriteFetchHandlerThread: on ThCond
                0x18020AF0260 (0xFFFFC90020AF0260)
                (FetchFlowControlCondvar), reason 'wait for buffer for
                fetch'
                0x7F6724013050 waiting 71.501580932 seconds, Msg handler
                ccMsgPing: on ThCond 0x7F6580006608 (0x7F6580006608)
                (InuseCondvar), reason 'waiting for exclusive use of
                connection for sending msg'
                0x7F661C000DA0 waiting 245.654985276 seconds,
                PrefetchWorkerThread: on ThCond 0x7F6570005288
                (0x7F6570005288) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F671C00F440 waiting 251.096002003 seconds,
                FileBlockReadFetchHandlerThread: on ThCond
                0x7F65BC002878 (0x7F65BC002878) (MsgRecordCondvar),
                reason 'RPC wait' for NSD I/O completion on node
                10.7.28.35 <c1n5>
                0x7F671C00E150 waiting 144.034006970 seconds,
                WritebehindWorkerThread: on ThCond 0x7F6528001548
                (0x7F6528001548) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F67A02FCD20 waiting 142.324070945 seconds,
                WritebehindWorkerThread: on ThCond 0x7F6580002A98
                (0x7F6580002A98) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F67A02FA330 waiting 200.670114385 seconds,
                EEWatchDogThread: on ThCond 0x7F65B0000A98
                (0x7F65B0000A98) (MsgRecordCondvar), reason 'RPC wait'
                0x7F67A02BF050 waiting 252.276161189 seconds,
                WritebehindWorkerThread: on ThCond 0x7F6584003998
                (0x7F6584003998) (MsgRecordCondvar), reason 'RPC wait'
                for NSD I/O completion on node 10.7.28.35 <c1n5>
                0x7F67A0004160 waiting 251.173651822 seconds,
                SyncHandlerThread: on ThCond 0x7F64880079E8
                (0x7F64880079E8) (LogFileBufferDescriptorCondvar),
                reason 'force wait on force active buffer write'


So from the client side its the client that's waiting the server. I 
managed also to ping, ssh, and   tcpdump each other before the node got 
expelled and discovered that ping works fine, ssh work fine , beside my 
tests there are  0 packet passing between them, LITERALLY.

So there is no congestion, no network issues, but the server waits for 
the client and the client waits the server. This happens until we reach 
350 secs ( 10 times the lease time) , then client get expelled.
There are no local io waiters that indicates that gss is struggling, 
there is plenty of bandwith and CPU resources and no network congestion.

Seems some sort of deadlock to me, but how can this be explained and 
hopefully fixed?

Regards,
Salvatore
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140820/77aceb5a/attachment.htm>


More information about the gpfsug-discuss mailing list