[gpfsug-discuss] excessive lowDiskSpace events (how is threshold triggered?)

Tue Feb 25 21:29:43 GMT 2014

On 25/02/14 20:17, mark.bergman at uphs.upenn.edu wrote:
>
> I'm running GPFS 3.5.0.9 under Linux, and I'm seeing what seem to be an
> excessive number of lowDiskSpace events on the "system" pool.
>
> I've got an mmcallback set up, including a log report of which pool is
> triggering the lowDiskSpace callback.

Bear in mind that once you hit a lowDiskSpace event your callback will 
helpfully be called every two minutes until the condition is cleared. So 
you callback needs to have locking otherwise the mmapplypolicy will go 
nuts if it takes more than two minutes to clear the lowDiskSpace event.

>
> The part that is confusing me is that the "system" pool doesn't seem to be
> above the policy thresholds.
>
> For example, 'mmdf' shows that there is about 26% free in the 'system' pool:
>
> -------------------------
> disk                disk size  failure holds    holds                 free free
> name                             group metadata data        in full blocks in fragments
> --------------- ------------- -------- -------- ----- --------------------
> -------------------
> Disks in storage pool: system (Maximum disk size allowed is 33 TB)
> dx80_rg16_vol1           546G       -1 yes      yes          125.1G ( 23%) 23.96G ( 4%)
> dx80_rg4_vol1            546G        1 yes      yes          108.1G ( 20%) 33.84G ( 6%)
> dx80_rg13_vol1           546G        1 yes      yes            109G ( 20%) 32.78G ( 6%)
> dx80_rg6_vol1            546G        1 yes      yes          104.4G ( 19%) 35.61G ( 7%)
> dx80_rg3_vol1            546G        1 yes      yes          105.6G ( 19%) 35.29G ( 6%)
>                  -------------                         -------------------- -------------------
> (pool total)           2.666T                                552.1G ( 20%) 161.5G ( 6%)
> -------------------------

Bear in mind these are round numbers. You cannot add the two percentages 
together and get a completely accurate picture. Stands to reason if you 
think about it.

[SNIP]

>
> /* next threshold: some free space, move middle-aged files */
> RULE 'move files that have not been changed in 7 days from the system pool to dx80_medium' MIGRATE FROM POOL 'system'
>          TO POOL 'dx80_medium'
>          THRESHOLD(75,65)
>          LIMIT(95)
>          WEIGHT(KB_ALLOCATED)
>          WHERE (DAYS(CURRENT_TIMESTAMP) - DAYS(CHANGE_TIME) > 7 )
>          AND KB_ALLOCATED >= 1024
> -------------------------
>
>
> As I understand it, none of those rules should trigger a lowDiskSpace event
> when the pool is 74% full, as it is now.

I would say 74% and 75% are very close and you are not taking into 
account that the 20% and 6% are rounded values and adding them together 
gives a result that is sufficiently slightly wrong to trigger the 
lowDiskSpace event.

> Is the threshold in a file migration policy based on the %free (or used) in
> full blocks only, or in the sum of full blocks plus fragments?

What does mmdf without a --blocksize option, or with --blocksize 1K look 
like, and what does doing the accurate maths then reveal?

My guess is you are that tiny bit fuller than you thing due to rounding 
errors, then you are getting hit with the lets call the callback every 
two minutes till it clears.

JAB.

-- 
Jonathan A. Buzzard                 Email: jonathan (at) buzzard.me.uk
Fife, United Kingdom.