[gpfsug-discuss] data interface and management infercace.

Mon Jul 13 18:42:47 BST 2015

Some thoughts on node expels, based on the last 2-3 months of "expel hell"
here. We've spent a lot of time looking at this issue, across multiple
clusters. A big thanks to IBM for helping us center in on the right issues.
First, you need to understand if the expels are due to "expired lease"
message, or expels due to "communication issues". It sounds like you are
talking about the latter. In the case of nodes being expelled due to
communication issues, it's more likely the problem in related to network
congestion. This  can occur at many levels - the node, the network, or the
switch.

When it's a communication issue, changing prams like "missed ping timeout"
isn't going to help you. The problem for us ended up being that GPFS wasn't
getting a response to a periodic "keep alive" poll to the node, and after
300 seconds, it declared the node dead and expelled it. You can tell if
this is the issue by starting to look at the RPC waiters just before the
expel. If you see something like "Waiting for poll on sock" RPC, that the
node is waiting for that periodic poll to return, and it's not seeing it.
The response is either lost in the network, sitting on the network queue,
or the node is too busy to send it. You may also see RPC's like "waiting
for exclusive use of connection" RPC - this is another clear indication of
network congestion.

Look at the GPFSUG presentions (http://www.gpfsug.org/presentations/) for
one by Jason Hick (NERSC) - he also talks about these issues. You need to
take a look at net.ipv4.tcp_wmem and net.ipv4.tcp_rmem, especially if you
have client nodes that are on slower network interfaces.

In our case, it was a number of factors - adjusting these  settings,
looking at congestion at the switch level, and some physical hardware
issues.

I would be happy to discuss in more detail (offline) if you want). There
are no simple solutions. :-)

Bob Oesterlin, Sr Storage Engineer, Nuance Communications
robert.oesterlin at nuance.com

On Mon, Jul 13, 2015 at 11:45 AM, Scott D <sdenham at gmail.com> wrote:

> I spent a good deal of time exploring this topic when I was at IBM. I
> think there are two key aspects here; the congestion of the actual
> interfaces on the [cluster, FS, token] management nodes and competition for
> other resources like CPU cycles on those nodes.  When using a single
> Ethernet interface (or for that matter IB RDMA + IPoIB over the same
> interface), at some point the two kinds of traffic begin to conflict. The
> management traffic being much more time sensitive suffers as a result.  One
> solution is to separate the traffic.  For larger clusters though (1000s of
> nodes), a better solution, that may avoid having to have a 2nd interface on
> every client node, is to add dedicated nodes as managers and not rely on
> NSD servers for this.  It does cost you some modest servers and GPFS server
> licenses.  My previous client generally used previous-generation retired
> compute nodes for this job.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20150713/5037e858/attachment.htm>