[gpfsug-discuss] spontaneous tracing?

Aaron Knister aaron.s.knister at nasa.gov
Tue Mar 13 04:49:33 GMT 2018


Thanks!

I admit, I'm confused because "/usr/lpp/mmfs/bin/mmcommon 
notifyOverload" does in fact start tracing for me on one of our clusters 
(technically 2, one in dev, one in prod). It did *not* start it on 
another test cluster. It looks to me like the difference is the 
mmsdrservport settings. On clusters where it's set to 0 tracing *does* 
start. On clusters where it's set to the default of 1191 (didn't try any 
other value) tracing *does not* start. I can toggle the behavior by 
changing the value of mmsdrservport back and forth.

I do have a PMR open for this so I'll follow up there too. Thanks again 
for the help.

-Aaron

On 3/12/18 11:13 AM, IBM Spectrum Scale wrote:
> /usr/lpp/mmfs/bin/mmcommon notifyOverload will not cause tracing to be 
> started.  One can verify that using the underlying command being called 
> as shown in the following example with /tmp/n containing node names one 
> each line that will get the notification and the IP address being the 
> file system manager from which the command is issued.
> 
> */usr/lpp/mmfs/bin/mmsdrcli notifyOverload /tmp/n 1191 192.168.117.131 3 8*
> 
> The only case that deadlock detection code will initiate tracing is that 
> debugDataControl is set to "heavy" and tracing is not started. Then on 
> deadlock detection tracing is turned on for 20 seconds and turned off.
> 
> That can be tested using command like
> */usr/lpp/mmfs/bin/mmsdrcli notifyDeadlock /tmp/n 1191 192.168.117.131 3 8*
> 
> And then mmfs.log will tell you what's going on. That's not a silent action.
> 
> *2018-03-12_10:16:11.243-0400: [N] sdrServ: Received deadlock 
> notification from 192.168.117.131*
> *2018-03-12_10:16:11.243-0400: [N] GPFS will attempt to collect debug 
> data on this node.*
> *2018-03-12_10:16:11.953-0400: [I] Tracing in overwrite mode <== tracing 
> started*
> *Trace started: Wait 20 seconds before cut and stop trace*
> *2018-03-12_10:16:37.147-0400: [I] Tracing disabled <== tracing stopped 
> 20 seconds later*
> *mmtrace: move /tmp/mmfs/lxtrace.trc.c69bc2xn01.cpu0 
> /tmp/mmfs/trcfile.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01.cpu0*
> *mmtrace: formatting 
> /tmp/mmfs/trcfile.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01 to 
> /tmp/mmfs/trcrpt.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01.gz*
> 
>  > What's odd is there are no log events to indicate an overload occurred.
> 
> Overload msg is only seen in mmfs.log when debugDataControl is "heavy". 
> mmdiag --deadlock shows overload related info starting from 4.2.3.
> 
> *# mmdiag --deadlock*
> 
> *=== mmdiag: deadlock ===*
> 
> *Effective deadlock detection threshold on c69bc2xn01 is 1800 seconds*
> *Effective deadlock detection threshold on c69bc2xn01 is 360 seconds for 
> short waiters*
> 
> *Cluster c69bc2xn01.gpfs.net is overloaded. The overload index on 
> c69bc2xn01 is 0.01812 <==*
> 
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> 

-- 
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776



More information about the gpfsug-discuss mailing list