[gpfsug-discuss] How to get rid of very old mmhealth events
Dorigo Alvise (PSI)
alvise.dorigo at psi.ch
Thu Jul 5 09:28:51 BST 2018
Hello Daniel,
I've solved my problem disabling the check (I've gpfs v4.2.3-5) by putting
ib_rdma_enable_monitoring=False
in the [network] section of the file /var/mmfs/mmsysmon/mmsysmonitor.conf, and restarting the mmsysmonitor.
There was a thread in this group about this problem.
A
________________________________
From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Yaron Daniel [YARD at il.ibm.com]
Sent: Sunday, July 01, 2018 7:17 PM
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] How to get rid of very old mmhealth events
Hi
There is was issue with Scale 5.x GUI error - ib_rdma_nic_unrecognized(mlx5_0/2)
Check if you have the patch:
[root at gssio1 ~]# diff /usr/lpp/mmfs/lib/mmsysmon/NetworkService.py /tmp/NetworkService.py
229c229,230
< recognizedNICs = set(re.findall(r"verbsConnectPorts\[\d+\] +: (\w+/\d+)/\d+\n", mmfsadm))
---
> #recognizedNICs = set(re.findall(r"verbsConnectPorts\[\d+\] +: (\w+/\d+)/\d+\n", mmfsadm))
> recognizedNICs = set(re.findall(r"verbsConnectPorts\[\d+\] +: (\w+/\d+)/\d+/\d+\n", mmfsadm))
And restart the - mmsysmoncontrol restart
Regards
________________________________
Yaron Daniel 94 Em Ha'Moshavot Rd
[cid:_1_0B5B5F080B5B5954005EFD8BC22582BD]
Storage Architect – IL Lab Services (Storage) Petach Tiqva, 49527
IBM Global Markets, Systems HW Sales Israel
Phone: +972-3-916-5672
Fax: +972-3-916-5672
Mobile: +972-52-8395593
e-mail: yard at il.ibm.com
IBM Israel<http://www.ibm.com/il/he/>
[IBM Storage Strategy and Solutions v1][IBM Storage Management and Data Protection v1][cid:_1_06EDAF6406EDA744005EFD8BC22582BD][cid:_1_06EDB16C06EDA744005EFD8BC22582BD] [https://acclaim-production-app.s3.amazonaws.com/images/6c2c3858-6df8-45be-ac2b-f93b8da74e20/Data%2BDriven%2BMulti%2BCloud%2BStrategy%2BV1%2Bver%2B4.png] [Related image]
From: "Andrew Beattie" <abeattie at au1.ibm.com>
To: gpfsug-discuss at spectrumscale.org
Date: 06/28/2018 11:16 AM
Subject: Re: [gpfsug-discuss] How to get rid of very old mmhealth events
Sent by: gpfsug-discuss-bounces at spectrumscale.org
________________________________
Do you know if there is actually a cable plugged into port 2?
The system will work fine as long as there is network connectivity, but you may have an issue with redundancy or loss of bandwidth if you do not have every port cabled and configured correctly.
Regards
Andrew Beattie
Software Defined Storage - IT Specialist
Phone: 614-2133-7927
E-mail: abeattie at au1.ibm.com<mailto:abeattie at au1.ibm.com>
----- Original message -----
From: "Dorigo Alvise (PSI)" <alvise.dorigo at psi.ch>
Sent by: gpfsug-discuss-bounces at spectrumscale.org
To: "gpfsug-discuss at spectrumscale.org" <gpfsug-discuss at spectrumscale.org>
Cc:
Subject: [gpfsug-discuss] How to get rid of very old mmhealth events
Date: Thu, Jun 28, 2018 6:08 PM
Dear experts,
I've e GL2 IBM system running SpectrumScale v4.2.3-6 (RHEL 7.3).
The system is working properly but I get a DEGRADED status report for the NETWORK running the command mmhealth:
[root at sf-gssio1 ~]# mmhealth node show
Node name: sf-gssio1.psi.ch
Node status: DEGRADED
Status Change: 23 min. ago
Component Status Status Change Reasons
-------------------------------------------------------------------------------------------------------------------------------------------
GPFS HEALTHY 22 min. ago -
NETWORK DEGRADED 145 days ago ib_rdma_link_down(mlx5_0/2), ib_rdma_nic_down(mlx5_0/2), ib_rdma_nic_unrecognized(mlx5_0/2)
[...]
This event is clearly an outlier because the network, verbs and IB are correctly working:
[root at sf-gssio1 ~]# mmfsadm test verbs status
VERBS RDMA status: started
[root at sf-gssio1 ~]# mmlsconfig verbsPorts|grep gssio1
verbsPorts mlx5_0/1 [sf-ems1,sf-gssio1,sf-gssio2]
[root at sf-gssio1 ~]# mmdiag --config|grep verbsPorts
! verbsPorts mlx5_0/1
[root at sf-gssio1 ~]# ibstat mlx5_0
CA 'mlx5_0'
CA type: MT4113
Number of ports: 2
Firmware version: 10.16.1020
Hardware version: 0
Node GUID: 0xec0d9a03002b5db0
System image GUID: 0xec0d9a03002b5db0
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 42
LMC: 0
SM lid: 1
Capability mask: 0x26516848
Port GUID: 0xec0d9a03002b5db0
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Disabled
Rate: 10
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0x26516848
Port GUID: 0xec0d9a03002b5db8
Link layer: InfiniBand
That event is there since 145 days and I didn't go away after a daemon restart (mmshutdown/mmstartup).
My question is: how I can get rid of this event and restore the mmhealth's output to HEALTHY ? This is important because I've nagios sensors that periodically parse the "mmhealth -Y ..." output and at the moment I've to disable their email notification (which is not good if some real bad event happens).
Thanks,
Alvise
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180705/891f0000/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ATT00001.gif
Type: image/gif
Size: 1851 bytes
Desc: ATT00001.gif
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180705/891f0000/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ATT00003.gif
Type: image/gif
Size: 4376 bytes
Desc: ATT00003.gif
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180705/891f0000/attachment-0001.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ATT00004.gif
Type: image/gif
Size: 5093 bytes
Desc: ATT00004.gif
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180705/891f0000/attachment-0002.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ATT00005.gif
Type: image/gif
Size: 4746 bytes
Desc: ATT00005.gif
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180705/891f0000/attachment-0003.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ATT00006.gif
Type: image/gif
Size: 4557 bytes
Desc: ATT00006.gif
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180705/891f0000/attachment-0004.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ATT00007.gif
Type: image/gif
Size: 5093 bytes
Desc: ATT00007.gif
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180705/891f0000/attachment-0005.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ATT00008.jpg
Type: image/jpeg
Size: 11294 bytes
Desc: ATT00008.jpg
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180705/891f0000/attachment.jpg>
More information about the gpfsug-discuss
mailing list