[gpfsug-discuss] [non-nasa source] Re: pagepool shrink doesn't release all memory

Sun Feb 25 16:59:45 GMT 2018

Oh, and I think you're absolutely right about the rdma interaction. If I 
stop the infiniband service on a node and try the same exercise again, I 
can jump between 100G and 1G several times and the free'd memory is 
actually released.

-Aaron

On 2/25/18 11:54 AM, Aaron Knister wrote:
> Hi Stijn,
> 
> Thanks for sharing your experiences-- I'm glad I'm not the only one 
> whose had the idea (and come up empty handed).
> 
> About the pagpool and numa awareness, I'd remembered seeing something 
> about that somewhere and I did some googling and found there's a 
> parameter called numaMemoryInterleave that "starts mmfsd with numactl 
> --interleave=all". Do you think that provides the kind of numa awareness 
> you're looking for?
> 
> -Aaron
> 
> On 2/23/18 9:44 AM, Stijn De Weirdt wrote:
>> hi all,
>>
>> we had the same idea long ago, afaik the issue we had was due to the
>> pinned memory the pagepool uses when RDMA is enabled.
>>
>> at some point we restarted gpfs on the compute nodes for each job,
>> similar to the way we do swapoff/swapon; but in certain scenarios gpfs
>> really did not like it; so we gave up on it.
>>
>> the other issue that needs to be resolved is that the pagepool needs to
>> be numa aware, so the pagepool is nicely allocated across all numa
>> domains, instead of using the first ones available. otherwise compute
>> jobs might start that only do non-local doamin memeory access.
>>
>> stijn
>>
>> On 02/23/2018 03:35 PM, IBM Spectrum Scale wrote:
>>> AFAIK you can increase the pagepool size dynamically but you cannot 
>>> shrink
>>> it dynamically.  To shrink it you must restart the GPFS daemon.   Also,
>>> could you please provide the actual pmap commands you executed?
>>>
>>> Regards, The Spectrum Scale (GPFS) team
>>>
>>> ------------------------------------------------------------------------------------------------------------------ 
>>>
>>> If you feel that your question can benefit other users of  Spectrum 
>>> Scale
>>> (GPFS), then please post it to the public IBM developerWroks Forum at
>>> https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 
>>>
>>> .
>>>
>>> If your query concerns a potential software error in Spectrum Scale 
>>> (GPFS)
>>> and you have an IBM software maintenance contract please contact
>>> 1-800-237-5511 in the United States or your local IBM Service Center in
>>> other countries.
>>>
>>> The forum is informally monitored as time permits and should not be used
>>> for priority messages to the Spectrum Scale (GPFS) team.
>>>
>>>
>>>
>>> From:   Aaron Knister <aaron.s.knister at nasa.gov>
>>> To:     <gpfsug-discuss at spectrumscale.org>
>>> Date:   02/22/2018 10:30 PM
>>> Subject:        Re: [gpfsug-discuss] pagepool shrink doesn't release all
>>> memory
>>> Sent by:        gpfsug-discuss-bounces at spectrumscale.org
>>>
>>>
>>>
>>> This is also interesting (although I don't know what it really means).
>>> Looking at pmap run against mmfsd I can see what happens after each 
>>> step:
>>>
>>> # baseline
>>> 00007fffe4639000  59164K      0K      0K      0K      0K ---p [anon]
>>> 00007fffd837e000  61960K      0K      0K      0K      0K ---p [anon]
>>> 0000020000000000 1048576K 1048576K 1048576K 1048576K      0K rwxp [anon]
>>> Total:           1613580K 1191020K 1189650K 1171836K      0K
>>>
>>> # tschpool 64G
>>> 00007fffe4639000  59164K      0K      0K      0K      0K ---p [anon]
>>> 00007fffd837e000  61960K      0K      0K      0K      0K ---p [anon]
>>> 0000020000000000 67108864K 67108864K 67108864K 67108864K      0K rwxp
>>> [anon]
>>> Total:           67706636K 67284108K 67282625K 67264920K      0K
>>>
>>> # tschpool 1G
>>> 00007fffe4639000  59164K      0K      0K      0K      0K ---p [anon]
>>> 00007fffd837e000  61960K      0K      0K      0K      0K ---p [anon]
>>> 0000020001400000 139264K 139264K 139264K 139264K      0K rwxp [anon]
>>> 0000020fc9400000 897024K 897024K 897024K 897024K      0K rwxp [anon]
>>> 0000020009c00000 66052096K      0K      0K      0K      0K rwxp [anon]
>>> Total:           67706636K 1223820K 1222451K 1204632K      0K
>>>
>>> Even though mmfsd has that 64G chunk allocated there's none of it
>>> *used*. I wonder why Linux seems to be accounting it as allocated.
>>>
>>> -Aaron
>>>
>>> On 2/22/18 10:17 PM, Aaron Knister wrote:
>>>> I've been exploring the idea for a while of writing a SLURM SPANK 
>>>> plugin
>>>
>>>> to allow users to dynamically change the pagepool size on a node. Every
>>>> now and then we have some users who would benefit significantly from a
>>>> much larger pagepool on compute nodes but by default keep it on the
>>>> smaller side to make as much physmem available as possible to batch
>>> work.
>>>>
>>>> In testing, though, it seems as though reducing the pagepool doesn't
>>>> quite release all of the memory. I don't really understand it because
>>>> I've never before seen memory that was previously resident become
>>>> un-resident but still maintain the virtual memory allocation.
>>>>
>>>> Here's what I mean. Let's take a node with 128G and a 1G pagepool.
>>>>
>>>> If I do the following to simulate what might happen as various jobs
>>>> tweak the pagepool:
>>>>
>>>> - tschpool 64G
>>>> - tschpool 1G
>>>> - tschpool 32G
>>>> - tschpool 1G
>>>> - tschpool 32G
>>>>
>>>> I end up with this:
>>>>
>>>> mmfsd thinks there's 32G resident but 64G virt
>>>> # ps -o vsz,rss,comm -p 24397
>>>>      VSZ   RSS COMMAND
>>>> 67589400 33723236 mmfsd
>>>>
>>>> however, linux thinks there's ~100G used
>>>>
>>>> # free -g
>>>>                total       used       free     shared    buffers
>>> cached
>>>> Mem:           125        100         25          0          0
>>> 0
>>>> -/+ buffers/cache:         98         26
>>>> Swap:            7          0          7
>>>>
>>>> I can jump back and forth between 1G and 32G *after* allocating 64G
>>>> pagepool and the overall amount of memory in use doesn't balloon but I
>>>> can't seem to shed that original 64G.
>>>>
>>>> I don't understand what's going on... :) Any ideas? This is with Scale
>>>> 4.2.3.6.
>>>>
>>>> -Aaron
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at spectrumscale.org
>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
> 

-- 
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776