[gpfsug-discuss] reserving memory for GPFS process

Tue Dec 20 17:18:35 GMT 2016

When using m_mem_free on GE with cgroup=true, GE just depends on the kernel OOM
killer. There's one killer per cgroup so when a job goes off the rails,
only its processes are eligible for OOM killing.

I'm not sure how Slurm does it but anything that uses cgroups should have
the above behavior.

On Tue, Dec 20, 2016 at 12:15:23PM -0500, Brian Marshall wrote:
> Skyrim equals Slurm.  Mobile shenanigans.
> 
> Brian
> 
> On Dec 20, 2016 12:07 PM, "Brian Marshall" <mimarsh2 at vt.edu> wrote:
> 
> > We use adaptive - Moab torque right now but are thinking about going to
> > Skyrim
> >
> > Brian
> >
> > On Dec 20, 2016 11:38 AM, "Buterbaugh, Kevin L" <
> > Kevin.Buterbaugh at vanderbilt.edu> wrote:
> >
> >> Hi Brian,
> >>
> >> It would be helpful to know what scheduling software, if any, you use.
> >>
> >> We were a PBS / Moab shop for a number of years but switched to SLURM two
> >> years ago.  With both you can configure the maximum amount of memory
> >> available to all jobs on a node.  So we just simply ???reserve??? however much
> >> we need for GPFS and other ???system??? processes.
> >>
> >> I can tell you that SLURM is *much* more efficient at killing processes
> >> as soon as they exceed the amount of memory they???ve requested than PBS /
> >> Moab ever dreamed of being.
> >>
> >> Kevin
> >>
> >> On Dec 20, 2016, at 10:27 AM, Skylar Thompson <skylar2 at u.washington.edu>
> >> wrote:
> >>
> >> We're a Grid Engine shop, and use cgroups (m_mem_free) to control user
> >> process memory
> >> usage. In the GE exec host configuration, we reserve 4GB for the OS
> >> (including GPFS) so jobs are not able to consume all the physical memory
> >> on
> >> the system.
> >>
> >> On Tue, Dec 20, 2016 at 11:25:04AM -0500, Brian Marshall wrote:
> >>
> >> All,
> >>
> >> What is your favorite method for stopping a user process from eating up
> >> all
> >> the system memory and saving 1 GB (or more) for the GPFS / system
> >> processes?  We have always kicked around the idea of cgroups but never
> >> moved on it.
> >>
> >> The problem:  A user launches a job which uses all the memory on a node,
> >> which causes the node to be expelled, which causes brief filesystem
> >> slowness everywhere.
> >>
> >> I bet this problem has already been solved and I am just googling the
> >> wrong
> >> search terms.
> >>
> >>
> >> Thanks,
> >> Brian
> >>
> >>
> >> _______________________________________________
> >> gpfsug-discuss mailing list
> >> gpfsug-discuss at spectrumscale.org
> >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> >>
> >>
> >>
> >> --
> >> -- Skylar Thompson (skylar2 at u.washington.edu)
> >> -- Genome Sciences Department, System Administrator
> >> -- Foege Building S046, (206)-685-7354 <(206)%20685-7354>
> >> -- University of Washington School of Medicine
> >> _______________________________________________
> >> gpfsug-discuss mailing list
> >> gpfsug-discuss at spectrumscale.org
> >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> >>
> >>
> >>
> >>
> >> ???
> >> Kevin Buterbaugh - Senior System Administrator
> >> Vanderbilt University - Advanced Computing Center for Research and
> >> Education
> >> Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 <(615)%20875-9633>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> gpfsug-discuss mailing list
> >> gpfsug-discuss at spectrumscale.org
> >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> >>
> >>

> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-- 
-- Skylar Thompson (skylar2 at u.washington.edu)
-- Genome Sciences Department, System Administrator
-- Foege Building S046, (206)-685-7354
-- University of Washington School of Medicine