[gpfsug-discuss] mmhealth - where is the info hiding?
Norbert Schuld
NSCHULD at de.ibm.com
Tue Jul 24 08:45:03 BST 2018
Hi,
that message is still in memory.
"mmhealth node eventlog --clear" deletes all old events but those which are
currently active are not affected.
I think this is related to multiple Collector Nodes, will dig deeper into
that code to find out if some issue lurks there.
As a stop-gap measure one could execute "mmsysmoncontrol restart" on the
affected node(s) as this stops the monitoring process and doing so clears
the event in memory.
The data used for the event comes from mmlspool (should be close or
identical to mmdf)
Mit freundlichen Grüßen / Kind regards
Norbert Schuld
From: valdis.kletnieks at vt.edu
To: gpfsug-discuss at spectrumscale.org
Date: 20/07/2018 00:15
Subject: [gpfsug-discuss] mmhealth - where is the info hiding?
Sent by: gpfsug-discuss-bounces at spectrumscale.org
So I'm trying to tidy up things like 'mmhealth' etc. Got most of it fixed,
but stuck on
one thing..
Note: I already did a 'mmhealth node eventlog --clear -N all' yesterday,
which
cleaned out a bunch of other long-past events that were "stuck" as failed /
degraded even though they were corrected days/weeks ago - keep this in mind
as
you read on....
# mmhealth cluster show
Component Total Failed Degraded Healthy
Other
-------------------------------------------------------------------------------------
NODE 10 0 0 10
0
GPFS 10 0 0 10
0
NETWORK 10 0 0 10
0
FILESYSTEM 1 0 1 0
0
DISK 102 0 0 102
0
CES 4 0 0 4
0
GUI 1 0 0 1
0
PERFMON 10 0 0 10
0
THRESHOLD 10 0 0 10
0
Great. One hit for 'degraded' filesystem.
# mmhealth node show --unhealthy -N all
(skipping all the nodes that show healthy)
Node name: arnsd3-vtc.nis.internal
Node status: HEALTHY
Status Change: 21 hours ago
Component Status Status Change Reasons
-----------------------------------------------------------------------------------
FILESYSTEM FAILED 24 days ago pool-data_high_error
(archive/system)
(...)
Node name: arproto2-isb.nis.internal
Node status: HEALTHY
Status Change: 21 hours ago
Component Status Status Change Reasons
----------------------------------------------------------------------------------
FILESYSTEM DEGRADED 6 days ago pool-data_high_warn
(archive/system)
mmdf tells me:
nsd_isb_01 13103005696 1 No Yes 1747905536 ( 13%)
111667200 ( 1%)
nsd_isb_02 13103005696 1 No Yes 1748245504 ( 13%)
111724384 ( 1%)
(94 more LUNs all within 0.2% of these for usage - data is striped out
pretty well)
There's also 6 SSD LUNs for metadata:
nsd_isb_flash_01 2956984320 1 Yes No 2116091904 ( 72%)
26996992 ( 1%)
(again, evenly striped)
So who is remembering that status, and how to clear it?
[attachment "attccdgx.dat" deleted by Norbert Schuld/Germany/IBM]
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180724/a42f629f/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180724/a42f629f/attachment.gif>
More information about the gpfsug-discuss
mailing list