[gpfsug-discuss] GPFS heartbeat network specifications and resilience

Fri Jul 22 17:25:49 BST 2016

Hi Ashish,

Can you describe more about what problem you are trying to solve?  And 
what failure mode you are trying to avoid?

GPFS depends on uninterrupted network access between the cluster members 
(well, mainly between each cluster member and the current cluster 
manager node), but there are many ways to ensure that, and many ways to 
recover from interruptions.

e.g. we tend to set
minMissedPingTimeout 30
pingPeriod 5

Bump those up if network/system gets busy.  Performance and latency will 
suffer but at least cluster members won't be expelled.

Regards,
Alex

On 07/21/2016 03:26 AM, Ashish Thandavan wrote:
> Dear all,
>
> Please could anyone be able to point me at specifications required for
> the GPFS heartbeat network? Are there any figures for latency, jitter,
> etc that one should be aware of?
>
> I also have a related question about resilience. Our three GPFS NSD
> servers utilize a single network port on each server and communicate
> heartbeat traffic over a private VLAN. We are looking at improving the
> resilience of this setup by adding an additional network link on each
> server (going to a different member of a pair of stacked switches than
> the existing one) and running the heartbeat network over bonded
> interfaces on the three servers. Are there any recommendations as to
> which network bonding type to use?
>
> Based on the name alone, Mode 1 (active-backup) appears to be the ideal
> choice, and I believe the switches do not need any special
> configuration. However, it has been suggested that Mode 4 (802.3ad) or
> LACP bonding might be the way to go; this aggregates the two ports and
> does require the relevant switch ports to be configured to support this.
> Is there a recommended bonding mode?
>
> If anyone here currently uses bonded interfaces for their GPFS heartbeat
> traffic, may I ask what type of bond have you configured? Have you had
> any problems with the setup? And more importantly, has it been of use in
> keeping the cluster up and running in the scenario of one network link
> going down?
>
> Thank you,
>
> Regards,
> Ash
>
>
>

-- 
Alex Chekholko chekh at stanford.edu