[gpfsug-discuss] mmsysmon.py revisited

John Hearns john.hearns at asml.com
Thu Jul 20 08:39:29 BST 2017


This is really interesting.
I know we can look at the interrupt rates of course, but is there a way we can quantify the effects of interrupts / OS jitter here?


-----Original Message-----
From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Jonathon A Anderson
Sent: Wednesday, July 19, 2017 8:29 PM
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Subject: Re: [gpfsug-discuss] mmsysmon.py revisited

OPA behaves _significantly_ differently from Mellanox IB. OPA uses the host CPU for packet processing, whereas Mellanox IB uses a discrete asic on the HBA. As a result, OPA is much more sensitive to task placement and interrupts, in our experience, because the host CPU load competes with the fabric IO processing load.

~jonathon


On 7/19/17, 12:12 PM, "gpfsug-discuss-bounces at spectrumscale.org on behalf of david_johnson at brown.edu" <gpfsug-discuss-bounces at spectrumscale.org on behalf of david_johnson at brown.edu> wrote:

    We have FDR14 Mellanox fabric, probably similar interrupt load as OPA.

      -- ddj
    Dave Johnson

    On Jul 19, 2017, at 1:52 PM, Jonathon A Anderson <jonathon.anderson at colorado.edu> wrote:

    >> It might be a problem specific to your system environment or a wrong configuration therefore please get in contact with IBM support to analyze the root cause of the high usage.
    >
    > I suspect it’s actually a result of frequent IO interrupts causing jitter in conflict with MPI on the shared Intel Omni-Path network, in our case.
    >
    > We’ve already tried pursuing support on this through our vendor, DDN, and got no-where. Eventually we were the ones who tried killing mmsysmon, and that fixed our problem.
    >
    > The official company line of “we don't see significant CPU consumption by mmsysmon on our test systems” isn’t helping. Do you have a test system with OPA?
    >
    > ~jonathon
    >
    >
    > On 7/19/17, 7:05 AM, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Mathias Dietz" <gpfsug-discuss-bounces at spectrumscale.org on behalf of MDIETZ at de.ibm.com> wrote:
    >
    >    thanks for the feedback.
    >
    >    Let me clarify what mmsysmon is doing.
    >    Since IBM Spectrum Scale 4.2.1 the mmsysmon process is used for the overall health monitoring and CES failover handling.
    >    Even without CES it is an essential part of the system because it monitors the individual components and provides health state information and error events.
    >
    >    This information is needed by other Spectrum Scale components (mmhealth command, the IBM Spectrum Scale GUI, Support tools, Install Toolkit,..) and therefore disabling mmsysmon will impact them.
    >
    >
    >> It’s a huge problem. I don’t understand why it hasn’t been given
    >
    >> much credit by dev or support.
    >
    >    Over the last couple of month, the development team has put a strong focus on this topic.
    >
    >    In order to monitor the health of the individual components, mmsysmon listens for notifications/callback but also has to do some polling.
    >    We are trying to reduce the polling overhead constantly and replace polling with notifications when possible.
    >
    >
    >    Several improvements have been added to 4.2.3, including the ability to configure the polling frequency to reduce the overhead. (mmhealth config interval)
    >
    >    See https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ibm.com%2Fsupport%2Fknowledgecenter%2Fen%2FSTXKQY_4.2.3%2Fcom.ibm.spectrum.scale.v4r23.doc%2Fbl1adm_mmhealth.htm&data=01%7C01%7Cjohn.hearns%40asml.com%7C63ff97a625e4499cb8ba08d4ced41754%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1&sdata=Uzdg4ogcQwidNfi8TMp%2FdCMqnSLTFxU4y8n2ub%2F28xQ%3D&reserved=0
    >    In addition a new option has been introduced to clock align the monitoring threads in order to reduce CPU jitter.
    >
    >
    >    Nevertheless, we don't see significant CPU consumption by mmsysmon on our test systems.
    >
    >    It might be a problem specific to your system environment or a wrong configuration therefore please get in contact with IBM support to analyze the root cause of the high usage.
    >
    >    Kind regards
    >
    >    Mathias Dietz
    >
    >    IBM Spectrum Scale - Release Lead Architect and RAS Architect
    >
    >
    >
    >    gpfsug-discuss-bounces at spectrumscale.org wrote on 07/18/2017 07:51:21 PM:
    >
    >> From: Jonathon A Anderson <jonathon.anderson at colorado.edu>
    >> To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
    >> Date: 07/18/2017 07:51 PM
    >> Subject: Re: [gpfsug-discuss] mmsysmon.py revisited
    >> Sent by: gpfsug-discuss-bounces at spectrumscale.org
    >>
    >> There’s no official way to cleanly disable it so far as I know yet;
    >> but you can defacto disable it by deleting /var/mmfs/mmsysmon/
    >> mmsysmonitor.conf.
    >>
    >> It’s a huge problem. I don’t understand why it hasn’t been given
    >> much credit by dev or support.
    >>
    >> ~jonathon
    >>
    >>
    >> On 7/18/17, 11:21 AM, "gpfsug-discuss-bounces at spectrumscale.org on
    >> behalf of David Johnson" <gpfsug-discuss-bounces at spectrumscale.org
    >> on behalf of david_johnson at brown.edu> wrote:
    >>
    >>
    >>
    >>
    >>    We also noticed a fair amount of CPU time accumulated by mmsysmon.py on
    >>    our diskless compute nodes. I read the earlier query, where it
    >> was answered:
    >>
    >>
    >>
    >>
    >>    ces == Cluster Export Services,  mmsysmon.py comes from
    >> mmcesmon. It is used for managing export services of GPFS. If it is
    >> killed,  your nfs/smb etc will be out of work.
    >>    Their overhead is small and they are very important. Don't
    >> attempt to kill them.
    >>
    >>
    >>
    >>
    >>
    >>
    >>    Our question is this — we don’t run the latest “protocols", our
    >> NFS is CNFS, and our CIFS is clustered CIFS.
    >>    I can understand it might be needed with Ganesha, but on every node?
    >>
    >>
    >>    Why in the world would I be getting this daemon running on all
    >> client nodes, when I didn’t install the “protocols" version
    >>    of the distribution?   We have release 4.2.2 at the moment.  How
    >> can we disable this?
    >>
    >>
    >>    Thanks,
    >>     — ddj
    >>
    >>
    >> _______________________________________________
    >> gpfsug-discuss mailing list
    >> gpfsug-discuss at spectrumscale.org
    >> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.hearns%40asml.com%7C63ff97a625e4499cb8ba08d4ced41754%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1&sdata=9yatmHdcxwy%2FD%2FoZuLVDkPjpIeS9F7crTLl2MoUUIyo%3D&reserved=0
    >
    >
    >
    > _______________________________________________
    > gpfsug-discuss mailing list
    > gpfsug-discuss at spectrumscale.org
    > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.hearns%40asml.com%7C63ff97a625e4499cb8ba08d4ced41754%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1&sdata=9yatmHdcxwy%2FD%2FoZuLVDkPjpIeS9F7crTLl2MoUUIyo%3D&reserved=0
    _______________________________________________
    gpfsug-discuss mailing list
    gpfsug-discuss at spectrumscale.org
    https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.hearns%40asml.com%7C63ff97a625e4499cb8ba08d4ced41754%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1&sdata=9yatmHdcxwy%2FD%2FoZuLVDkPjpIeS9F7crTLl2MoUUIyo%3D&reserved=0


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.hearns%40asml.com%7C63ff97a625e4499cb8ba08d4ced41754%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1&sdata=9yatmHdcxwy%2FD%2FoZuLVDkPjpIeS9F7crTLl2MoUUIyo%3D&reserved=0
-- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt.


More information about the gpfsug-discuss mailing list