[gpfsug-discuss] How to get rid of very old mmhealth events

Dorigo Alvise (PSI) alvise.dorigo at psi.ch
Thu Jun 28 09:02:07 BST 2018


Dear experts,
I've e GL2 IBM system running SpectrumScale v4.2.3-6 (RHEL 7.3).
The system is working properly but I get a DEGRADED status report for the NETWORK running the command mmhealth:

[root at sf-gssio1 ~]# mmhealth node show

Node name:      sf-gssio1.psi.ch
Node status:    DEGRADED
Status Change:  23 min. ago

Component       Status        Status Change     Reasons
-------------------------------------------------------------------------------------------------------------------------------------------
GPFS            HEALTHY       22 min. ago       -
NETWORK         DEGRADED      145 days ago      ib_rdma_link_down(mlx5_0/2), ib_rdma_nic_down(mlx5_0/2), ib_rdma_nic_unrecognized(mlx5_0/2)
[...]

This event is clearly an outlier because the network, verbs and IB are correctly working:

[root at sf-gssio1 ~]# mmfsadm test verbs status
VERBS RDMA status: started

[root at sf-gssio1 ~]# mmlsconfig verbsPorts|grep gssio1
verbsPorts mlx5_0/1 [sf-ems1,sf-gssio1,sf-gssio2]

[root at sf-gssio1 ~]# mmdiag --config|grep verbsPorts
 ! verbsPorts mlx5_0/1

[root at sf-gssio1 ~]# ibstat  mlx5_0
CA 'mlx5_0'
    CA type: MT4113
    Number of ports: 2
    Firmware version: 10.16.1020
    Hardware version: 0
    Node GUID: 0xec0d9a03002b5db0
    System image GUID: 0xec0d9a03002b5db0
    Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 56
        Base lid: 42
        LMC: 0
        SM lid: 1
        Capability mask: 0x26516848
        Port GUID: 0xec0d9a03002b5db0
        Link layer: InfiniBand
    Port 2:
        State: Down
        Physical state: Disabled
        Rate: 10
        Base lid: 65535
        LMC: 0
        SM lid: 0
        Capability mask: 0x26516848
        Port GUID: 0xec0d9a03002b5db8
        Link layer: InfiniBand

That event is there since 145 days and I didn't go away after a daemon restart (mmshutdown/mmstartup).
My question is: how I can get rid of this event and restore the mmhealth's output to HEALTHY ? This is important because I've nagios sensors that periodically parse the "mmhealth -Y ..." output and at the moment I've to disable their email notification (which is not good if some real bad event happens).

Thanks,

  Alvise
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180628/013aec6b/attachment.htm>


More information about the gpfsug-discuss mailing list