[gpfsug-discuss] mmhealth - where is the info hiding?

Tue Jul 24 08:45:03 BST 2018

Hi,

that message is still in memory.
"mmhealth node eventlog --clear" deletes all old events but those which are
currently active are not affected.

I think this is related to multiple Collector Nodes, will dig deeper into
that code to find out if some issue lurks there.
As a stop-gap measure one could execute "mmsysmoncontrol restart" on the
affected node(s) as this stops the monitoring process and doing so clears
the event in memory.

The data used for the event comes from mmlspool (should be close or
identical to mmdf)

Mit freundlichen Grüßen / Kind regards

Norbert Schuld

From:	valdis.kletnieks at vt.edu
To:	gpfsug-discuss at spectrumscale.org
Date:	20/07/2018 00:15
Subject:	[gpfsug-discuss] mmhealth - where is the info hiding?
Sent by:	gpfsug-discuss-bounces at spectrumscale.org

So I'm trying to tidy up things like 'mmhealth' etc.  Got most of it fixed,
but stuck on
one thing..

Note: I already did a 'mmhealth node eventlog --clear -N all' yesterday,
which
cleaned out a bunch of other long-past events that were "stuck" as failed /
degraded even though they were corrected days/weeks ago - keep this in mind
as
you read on....

# mmhealth cluster show

Component           Total         Failed       Degraded        Healthy
Other
-------------------------------------------------------------------------------------

NODE                   10              0              0             10
0
GPFS                   10              0              0             10
0
NETWORK                10              0              0             10
0
FILESYSTEM              1              0              1              0
0
DISK                  102              0              0            102
0
CES                     4              0              0              4
0
GUI                     1              0              0              1
0
PERFMON                10              0              0             10
0
THRESHOLD              10              0              0             10
0

Great.  One hit for 'degraded' filesystem.

# mmhealth node show --unhealthy -N all
(skipping all the nodes that show healthy)

Node name:      arnsd3-vtc.nis.internal
Node status:    HEALTHY
Status Change:  21 hours ago

Component      Status        Status Change     Reasons
-----------------------------------------------------------------------------------

FILESYSTEM     FAILED        24 days ago       pool-data_high_error
(archive/system)
(...)
Node name:      arproto2-isb.nis.internal
Node status:    HEALTHY
Status Change:  21 hours ago

Component      Status        Status Change     Reasons
----------------------------------------------------------------------------------

FILESYSTEM     DEGRADED      6 days ago        pool-data_high_warn
(archive/system)

mmdf tells me:
nsd_isb_01        13103005696        1 No       Yes      1747905536 ( 13%)
111667200 ( 1%)
nsd_isb_02        13103005696        1 No       Yes      1748245504 ( 13%)
111724384 ( 1%)
(94 more LUNs all within 0.2% of these for usage - data is striped out
pretty well)

There's also 6 SSD LUNs for metadata:
nsd_isb_flash_01    2956984320        1 Yes      No       2116091904 ( 72%)
26996992 ( 1%)
(again, evenly striped)

So who is remembering that status, and how to clear it?
[attachment "attccdgx.dat" deleted by Norbert Schuld/Germany/IBM]
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180724/a42f629f/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180724/a42f629f/attachment.gif>