[gpfsug-discuss] OOM Killer killing off GPFS 3.5

Tue May 24 17:17:12 BST 2016

This problem is more complex than it may seem.  The thing is, mmfsd runs as
root, as thus already possesses a certain amount of natural immunity to OOM
killer.  So adjusting mmfsd oom_score_adj doesn't radically change the
ranking of OOM killer victims, only tweaks it.  The way things are supposed
to work is: a user process eats up a lot of memory, and once a threshold is
hit, OOM killer picks off the memory hog, and the memory is released.
Unprivileged processes inherently have a higher OOM score, and should be
killed off first.  If that doesn't work, for some reason, the OOM killer
gets desperate and starts going after root processes.  Once things get to
this point, it's tough.  If you somehow manage to spare mmfsd per se,
what's going to happen next?  The OOM killer still needs a victim.  What
we've seen happen in such a situation is semi-random privileged process
killing.  mmfsd stays alive, but various other system processes are picked
off, and pretty quickly the node is a basket case.  A Linux node is not
very resilient to random process killing.  And it doesn't help that those
other privileged processes usually don't use much memory, so killing them
doesn't release much, and the carnage keeps on going.  The real problem is:
why wasn't the non-privileged memory hog process killed off first, before
root processes became fair game?  This is where things get pretty
complicated, and depend heavily on the Linux version.  There's one specific
issue that did get diagnosed.  If a process is using mmap and has page
faults going that result in GPFS IO, on older versions of GPFS the process
would fail to error out after a SIGKILL, due to locking complications
spanning Linux kernel VMM and GPFS mmap code.  This means the OOM killer
would attempt to kill a process, but that wouldn't produce the desired
result (the process is still around), and the OOM killer keeps moving down
the list.  This problem has been fixed in the current GPFS service levels.
It is possible that a similar problem may exist that prevents a memory hog
process from erroring out.  I strongly encourage opening a PMR to
investigate such a situation, instead of trying to work around it without
understanding why mmfsd was targeted in the first place.

This is the case of prevention being the best cure.  Where we've seen
success is customers using cgroups to prevent user processes from running a
node out of memory in the first place.  This has been shown to work well.
Dealing with the fallout from running out of memory is a much harder task.

The post-mmfsd-kill symptoms that are described in the original note are
not normal.  If an mmfsd process is killed, other nodes will become aware
of this fact faily quickly, and the node is going to be expelled from the
cluster (yes, expels *can* be a good thing).  In the normal case, TCP/IP
sockets are closed as soon as mmfsd is killed, and other nodes immediately
receive TCP RST packets, and close their connection endpoints.  If the
worst case, if a node just becomes catatonic, but RST is not sent out, the
troubled node is going to be expelled from the cluster after about 2
minutes of pinging (in a default configuration).  There should definitely
not be a permanent hang that necessitates a manual intervention.  Again,
older versions of GPFS had no protection against surprise OOM thread kills,
but in the current code some self-monitoring capabilities have been added,
and a single troubled node won't have a lasting impact on the cluster.  If
you aren't running with a reasonably current level of GPFS 3.5 service, I
strongly recommend upgrading.  If you see the symptoms originally described
with the current code, that's a bug that we need to fix, so please open a
PMR to address the issue.

yuri

From:	"Sanchez, Paul" <Paul.Sanchez at deshaw.com>
To:	gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>,
Date:	05/24/2016 07:33 AM
Subject:	Re: [gpfsug-discuss] OOM Killer killing off GPFS 3.5
Sent by:	gpfsug-discuss-bounces at spectrumscale.org

Hi Peter,

This is mentioned explicitly in the Spectrum Scale docs (
http://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.pdg.doc/bl1pdg_kerncfg.htm?lang=en
) as a problem for the admin to consider, and many of us have been bitten
by this.  There are references going back at least to GPFS 3.1 in 2008 on
developerworks complaining about this situation.

While the answer you described below is essentially what we do as well, I
would argue that this is a problem which IBM should just own and fix for
everyone. I cannot think of a situation in which you would want GPFS to be
sacrificed on a node due to out-of-memory conditions, and I have seen
several terrible consequences of this, including loss of cached,
user-acknowledged writes.

I don't think there are any real gotchas.  But in addition, our own
implementation also:

  * uses "--event preStartup" instead of "startup", since it runs earlier
and reduces the risk of a race

  * reads the score back out and complains if it hasn't been set

  * includes "set -e" to ensure that errors will terminate the script and
return a non-zero exit code to the callback parent

Thx
Paul

-----Original Message-----
From: gpfsug-discuss-bounces at spectrumscale.org [
mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Peter Childs
Sent: Tuesday, May 24, 2016 10:01 AM
To: gpfsug main discussion list
Subject: [gpfsug-discuss] OOM Killer killing off GPFS 3.5

Hi All,

We have an issue where the Linux kills off GPFS first when a computer runs
out of memory, this happens when user processors have exhausted memory and
swap and the out of memory killer in Linux kills the GPFS daemon as the
largest user of memory, due to its large pinned memory foot print.

We have an issue where the Linux kills off GPFS first when a computer runs
out of memory. We are running GPFS 3.5

We believe this happens when user processes have exhausted memory and swap
and the out of memory killer in Linux chooses to  kill the GPFS daemon as
the largest user of memory, due to its large pinned memory footprint.

This means that GPFS is killed and the whole cluster blocks for a minute
before it resumes operation, this is not ideal, and kills and causes issues
with most of the cluster.

What we see is users unable to login elsewhere on the cluster until we have
powered off the node. We believe this is because while the node is still
pingable, GPFS doesn't expel it from the cluster.

This issue mainly occurs on our frontend nodes of our HPC cluster but can
effect the rest of the cluster when it occurs.

This issue mainly occurs on the login nodes of our HPC cluster but can
affect the rest of the cluster when it occurs.

I've seen others on list with this issue.

We've come up with a solution where by the gpfs is adjusted so that is
unlikely to be the first thing to be killed, and hopefully the user process
is killed and not GPFS.

We've come up with a solution to adjust the OOM score of GPFS, so that it
is unlikely to be the first thing to be killed, and hopefully the OOM
killer picks a user process instead.

Out testing says this solution works, but I'm asking here firstly to share
our knowledge and secondly to ask if there is anything we've missed with
this solution and issues with this.

We've tested this and it seems to work. I'm asking here firstly to share
our knowledge and secondly to ask if there is anything we've missed with
this solution.

Its short which is part of its beauty.

/usr/local/sbin/gpfs-oom_score_adj

<pre>
#!/bin/bash

for proc in $(pgrep mmfs); do
      echo -500 >/proc/$proc/oom_score_adj  done </pre>

This can then be called automatically on GPFS startup with the following:

<pre>
mmaddcallback startupoomkiller --command /usr/local/sbin/gpfs-oom_score_adj
--event startup </pre>

and either restart gpfs or just run the script on all nodes.

Peter Childs
ITS Research Infrastructure
Queen Mary, University of London
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160524/43490dfb/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160524/43490dfb/attachment.gif>