[gpfsug-discuss] How to get rid of very old mmhealth events

Thu Jun 28 10:39:35 BST 2018

Hi Andrew, thanks for the naswer.
No, the port #2 (on all the nodes) is not cabled.

Regards,

   Alvise
________________________________
From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Andrew Beattie [abeattie at au1.ibm.com]
Sent: Thursday, June 28, 2018 10:15 AM
To: gpfsug-discuss at spectrumscale.org
Subject: Re: [gpfsug-discuss] How to get rid of very old mmhealth events

Do you know if there is actually a cable plugged into port 2?

The system will work fine as long as there is network connectivity, but you may have an issue with redundancy or loss of bandwidth if you do not have every port cabled and configured correctly.

Regards
Andrew Beattie
Software Defined Storage  - IT Specialist
Phone: 614-2133-7927
E-mail: abeattie at au1.ibm.com<mailto:abeattie at au1.ibm.com>

----- Original message -----
From: "Dorigo Alvise (PSI)" <alvise.dorigo at psi.ch>
Sent by: gpfsug-discuss-bounces at spectrumscale.org
To: "gpfsug-discuss at spectrumscale.org" <gpfsug-discuss at spectrumscale.org>
Cc:
Subject: [gpfsug-discuss] How to get rid of very old mmhealth events
Date: Thu, Jun 28, 2018 6:08 PM

Dear experts,
I've e GL2 IBM system running SpectrumScale v4.2.3-6 (RHEL 7.3).
The system is working properly but I get a DEGRADED status report for the NETWORK running the command mmhealth:

[root at sf-gssio1 ~]# mmhealth node show

Node name:      sf-gssio1.psi.ch
Node status:    DEGRADED
Status Change:  23 min. ago

Component       Status        Status Change     Reasons
-------------------------------------------------------------------------------------------------------------------------------------------
GPFS            HEALTHY       22 min. ago       -
NETWORK         DEGRADED      145 days ago      ib_rdma_link_down(mlx5_0/2), ib_rdma_nic_down(mlx5_0/2), ib_rdma_nic_unrecognized(mlx5_0/2)
[...]

This event is clearly an outlier because the network, verbs and IB are correctly working:

[root at sf-gssio1 ~]# mmfsadm test verbs status
VERBS RDMA status: started

[root at sf-gssio1 ~]# mmlsconfig verbsPorts|grep gssio1
verbsPorts mlx5_0/1 [sf-ems1,sf-gssio1,sf-gssio2]

[root at sf-gssio1 ~]# mmdiag --config|grep verbsPorts
 ! verbsPorts mlx5_0/1

[root at sf-gssio1 ~]# ibstat  mlx5_0
CA 'mlx5_0'
    CA type: MT4113
    Number of ports: 2
    Firmware version: 10.16.1020
    Hardware version: 0
    Node GUID: 0xec0d9a03002b5db0
    System image GUID: 0xec0d9a03002b5db0
    Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 56
        Base lid: 42
        LMC: 0
        SM lid: 1
        Capability mask: 0x26516848
        Port GUID: 0xec0d9a03002b5db0
        Link layer: InfiniBand
    Port 2:
        State: Down
        Physical state: Disabled
        Rate: 10
        Base lid: 65535
        LMC: 0
        SM lid: 0
        Capability mask: 0x26516848
        Port GUID: 0xec0d9a03002b5db8
        Link layer: InfiniBand

That event is there since 145 days and I didn't go away after a daemon restart (mmshutdown/mmstartup).
My question is: how I can get rid of this event and restore the mmhealth's output to HEALTHY ? This is important because I've nagios sensors that periodically parse the "mmhealth -Y ..." output and at the moment I've to disable their email notification (which is not good if some real bad event happens).

Thanks,

  Alvise
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180628/c56d406e/attachment.htm>