[gpfsug-discuss] Experience with zimon database stability, and best practices for backup?

Sun Sep 24 19:04:59 BST 2017

Hello All,

In a recent Spectrum Scale performance study, we used zimon/mmperfmon to
gather metrics. During a period of 2 months, we ended up losing data twice
from the zimon database; once after the virtual disk serving both the OS
files and zimon collector and DB storage was resized, and a second time
after an unknown event (the loss was discovered when plotting in Grafana
only went back to a certain data and time; likewise, mmperfmon query output
only went back to the same time).

Details:
- Spectrum Scale 4.2.1.1 (on NSD servers); 4.2.1.2 on the zimon collector
node and other clients
- Data retention in the "raw" stratum was set to 2 months; the "domains"
settings were as follows (note that we did not hit the ceiling of 60GB
(1GB/file * 60 files):

domains = {
        # this is the raw domain
        aggregation = 0         # aggregation factor for the raw domain is
always 0.
        ram = "12g"             # amount of RAM to be used
        duration = "2m"         # amount of time that data with the highest
precision is kept.
        filesize = "1g"         # maximum file size
        files = 60              # number of files.
},
{
        # this is the first aggregation domain that aggregates to 10 seconds
        aggregation = 10
        ram = "800m"            # amount of RAM to be used
        duration = "6m"         # keep aggregates for 1 week.
        filesize = "1g"         # maximum file size
        files = 10              # number of files.
},
{
        # this is the second aggregation domain that aggregates to 30*10
seconds == 5 minutes
        aggregation = 30
        ram = "800m"            # amount of RAM to be used
        duration = "1y"         # keep averages for 2 months.
        filesize = "1g"         # maximum file size
        files = 5               # number of files.
},
{
        # this is the third aggregation domain that aggregates to 24*30*10
seconds == 2 hours
        aggregation = 24
        ram = "800m"            # amount of RAM to be used
        duration = "2y"         #
        filesize = "1g"         # maximum file size
        files = 5               # number of files.
}

Questions:

1.) Has anyone had similar issues with losing data from zimon?

2.) Are there known circumstances where data could be lost, e.g. changing
the aggregation domain definitions, or even simply restarting the zimon
collector?

3.) Does anyone have any "best practices" for backing up the zimon
database? We were taking weekly "snapshots" by shutting down the collector,
and making a tarball copy of the /opt/ibm/zimon directory (but the database
corruption/data loss still crept through for various reasons).

In terms of debugging, we do not have Scale or zimon logs going back to the
suspected dates of data loss; we do have a gpfs.snap from about a month
after the last data loss - would it have any useful clues? Opening a PMR
could be tricky, as it was the customer who has the support entitlement,
and the environment (specifically the old cluster definitino and the zimon
collector VM) was torn down.

Many Thanks,
  Keith

-- 
Keith D. Ball, PhD
RedLine Performance Solutions, LLC
web:  http://www.redlineperf.com/
email: kball at redlineperf.com <aqualkenbush at redlineperf.com>
cell: 540-557-7851 <%28540%29%20557-7851>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170924/b2d6a044/attachment.htm>