[gpfsug-discuss] mmhealth - where is the info hiding?

valdis.kletnieks at vt.edu valdis.kletnieks at vt.edu
Thu Jul 19 22:25:23 BST 2018


So I'm trying to tidy up things like 'mmhealth' etc.  Got most of it fixed, but stuck on
one thing..

Note: I already did a 'mmhealth node eventlog --clear -N all' yesterday, which
cleaned out a bunch of other long-past events that were "stuck" as failed /
degraded even though they were corrected days/weeks ago - keep this in mind as
you read on....

# mmhealth cluster show

Component           Total         Failed       Degraded        Healthy          Other
-------------------------------------------------------------------------------------
NODE                   10              0              0             10              0
GPFS                   10              0              0             10              0
NETWORK                10              0              0             10              0
FILESYSTEM              1              0              1              0              0
DISK                  102              0              0            102              0
CES                     4              0              0              4              0
GUI                     1              0              0              1              0
PERFMON                10              0              0             10              0
THRESHOLD              10              0              0             10              0

Great.  One hit for 'degraded' filesystem.

# mmhealth node show --unhealthy -N all
(skipping all the nodes that show healthy)

Node name:      arnsd3-vtc.nis.internal
Node status:    HEALTHY
Status Change:  21 hours ago

Component      Status        Status Change     Reasons
-----------------------------------------------------------------------------------
FILESYSTEM     FAILED        24 days ago       pool-data_high_error(archive/system)
(...)
Node name:      arproto2-isb.nis.internal
Node status:    HEALTHY
Status Change:  21 hours ago

Component      Status        Status Change     Reasons
----------------------------------------------------------------------------------
FILESYSTEM     DEGRADED      6 days ago        pool-data_high_warn(archive/system)

mmdf tells me:
nsd_isb_01        13103005696        1 No       Yes      1747905536 ( 13%)     111667200 ( 1%)
nsd_isb_02        13103005696        1 No       Yes      1748245504 ( 13%)     111724384 ( 1%)
(94 more LUNs all within 0.2% of these for usage - data is striped out pretty well)

There's also 6 SSD LUNs for metadata:
nsd_isb_flash_01    2956984320        1 Yes      No       2116091904 ( 72%)      26996992 ( 1%)
(again, evenly striped)

So who is remembering that status, and how to clear it?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 486 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180719/7d91b8f1/attachment.sig>


More information about the gpfsug-discuss mailing list