[gpfsug-discuss] spontaneous tracing?
Aaron Knister
aaron.s.knister at nasa.gov
Tue Mar 13 04:49:33 GMT 2018
Thanks!
I admit, I'm confused because "/usr/lpp/mmfs/bin/mmcommon
notifyOverload" does in fact start tracing for me on one of our clusters
(technically 2, one in dev, one in prod). It did *not* start it on
another test cluster. It looks to me like the difference is the
mmsdrservport settings. On clusters where it's set to 0 tracing *does*
start. On clusters where it's set to the default of 1191 (didn't try any
other value) tracing *does not* start. I can toggle the behavior by
changing the value of mmsdrservport back and forth.
I do have a PMR open for this so I'll follow up there too. Thanks again
for the help.
-Aaron
On 3/12/18 11:13 AM, IBM Spectrum Scale wrote:
> /usr/lpp/mmfs/bin/mmcommon notifyOverload will not cause tracing to be
> started. One can verify that using the underlying command being called
> as shown in the following example with /tmp/n containing node names one
> each line that will get the notification and the IP address being the
> file system manager from which the command is issued.
>
> */usr/lpp/mmfs/bin/mmsdrcli notifyOverload /tmp/n 1191 192.168.117.131 3 8*
>
> The only case that deadlock detection code will initiate tracing is that
> debugDataControl is set to "heavy" and tracing is not started. Then on
> deadlock detection tracing is turned on for 20 seconds and turned off.
>
> That can be tested using command like
> */usr/lpp/mmfs/bin/mmsdrcli notifyDeadlock /tmp/n 1191 192.168.117.131 3 8*
>
> And then mmfs.log will tell you what's going on. That's not a silent action.
>
> *2018-03-12_10:16:11.243-0400: [N] sdrServ: Received deadlock
> notification from 192.168.117.131*
> *2018-03-12_10:16:11.243-0400: [N] GPFS will attempt to collect debug
> data on this node.*
> *2018-03-12_10:16:11.953-0400: [I] Tracing in overwrite mode <== tracing
> started*
> *Trace started: Wait 20 seconds before cut and stop trace*
> *2018-03-12_10:16:37.147-0400: [I] Tracing disabled <== tracing stopped
> 20 seconds later*
> *mmtrace: move /tmp/mmfs/lxtrace.trc.c69bc2xn01.cpu0
> /tmp/mmfs/trcfile.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01.cpu0*
> *mmtrace: formatting
> /tmp/mmfs/trcfile.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01 to
> /tmp/mmfs/trcrpt.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01.gz*
>
> > What's odd is there are no log events to indicate an overload occurred.
>
> Overload msg is only seen in mmfs.log when debugDataControl is "heavy".
> mmdiag --deadlock shows overload related info starting from 4.2.3.
>
> *# mmdiag --deadlock*
>
> *=== mmdiag: deadlock ===*
>
> *Effective deadlock detection threshold on c69bc2xn01 is 1800 seconds*
> *Effective deadlock detection threshold on c69bc2xn01 is 360 seconds for
> short waiters*
>
> *Cluster c69bc2xn01.gpfs.net is overloaded. The overload index on
> c69bc2xn01 is 0.01812 <==*
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
More information about the gpfsug-discuss
mailing list