[gpfsug-discuss] GPFS admin host name vs subnets

Sven Oehme oehmes at gmail.com
Tue Sep 2 15:11:03 BST 2014


Ed,

if you enable RDMA, GPFS will always use this as preferred data transfer.
if you have subnets configured, GPFS will prefer this for communication
with higher priority as the default interface.
so the order is RDMA , subnets, default.
if RDMA will fail for whatever reason we will use the subnets defined
interface and if that fails as well we will use the default interface.

the easiest way to see what is used is to run mmdiag --network (only avail
on more recent versions of GPFS) it will tell you if RDMA is enabled
between individual nodes as well as if a subnet connection is used or not :

[root at client05 ~]# mmdiag --network

=== mmdiag: network ===

Pending messages:
  (none)
Inter-node communication configuration:
  tscTcpPort      1191
  my address      192.167.13.5/16 (eth0) <c0n6>
  my addr list    192.1.13.5/16 (ib1)  192.0.13.5/16 (ib0)/
client04.clientad.almaden.ibm.com  192.167.13.5/16 (eth0)
  my node number  17
TCP Connections between nodes:
  Device ib0:
    hostname                            node     destination     status
err  sock  sent(MB)  recvd(MB)  ostype
    client04n1                           <c1n0>   192.0.4.1       connected
 0    69    0         37         Linux/L
    client04n2                           <c1n1>   192.0.4.2       connected
 0    70    0         37         Linux/L
    client04n3                           <c1n2>   192.0.4.3       connected
 0    68    0         0          Linux/L
  Device ib1:
    hostname                            node     destination     status
err  sock  sent(MB)  recvd(MB)  ostype
    clientcl21                           <c0n0>   192.1.201.21    connected
 0    65    0         0          Linux/L
    clientcl25                           <c0n3>   192.1.201.25    connected
 0    66    0         0          Linux/L
    clientcl26                           <c0n4>   192.1.201.26    connected
 0    67    0         0          Linux/L
    clientcl21                           <c1n3>   192.1.201.21    connected
 0    71    0         0          Linux/L
    clientcl22                           <c1n4>   192.1.201.22    connected
 0    63    0         0          Linux/L
    client10                            <c1n5>   192.1.13.10     connected
 0    73    0         0          Linux/L
    client08                            <c1n7>   192.1.13.8      connected
 0    72    0         0          Linux/L
RDMA Connections between nodes:
  Fabric 1 - Device mlx4_0 Port 1 Width 4x Speed FDR lid 13
    hostname                            idx CM state VS buff RDMA_CT(ERR)
RDMA_RCV_MB RDMA_SND_MB VS_CT(ERR) VS_SND_MB VS_RCV_MB WAIT_CON_SLOT
WAIT_NODE_SLOT
    clientcl21                           0   N  RTS   (Y)903  0      (0  )
0           0           192  (0  ) 0         0         0             0
    client04n1                           0   N  RTS   (Y)477  0      (0  )
0           0           12367404(0  ) 107905    594       0             0
    client04n1                           1   N  RTS   (Y)477  0      (0  )
0           0           12367404(0  ) 107901    593       0             0
    client04n2                           0   N  RTS   (Y)477  0      (0  )
0           0           12371352(0  ) 107911    594       0             0
    client04n2                           2   N  RTS   (Y)477  0      (0  )
0           0           12371352(0  ) 107902    594       0             0
    clientcl21                           0   N  RTS   (Y)880  0      (0  )
0           0           11   (0  ) 0         0         0             0
    client04n3                           0   N  RTS   (Y)969  0      (0  )
0           0           5    (0  ) 0         0         0             0
    clientcl26                           0   N  RTS   (Y)702  0      (0  )
0           0           35   (0  ) 0         0         0             0
    client08                            0   N  RTS   (Y)637  0      (0  ) 0
          0           16   (0  ) 0         0         0             0
    clientcl25                           0   N  RTS   (Y)574  0      (0  )
0           0           14   (0  ) 0         0         0             0
    clientcl22                           0   N  RTS   (Y)507  0      (0  )
0           0           2    (0  ) 0         0         0             0
    client10                            0   N  RTS   (Y)568  0      (0  ) 0
          0           121  (0  ) 0         0         0             0
  Fabric 2 - Device mlx4_0 Port 2 Width 4x Speed FDR lid 65
    hostname                            idx CM state VS buff RDMA_CT(ERR)
RDMA_RCV_MB RDMA_SND_MB VS_CT(ERR) VS_SND_MB VS_RCV_MB WAIT_CON_SLOT
WAIT_NODE_SLOT
    clientcl21                           1   N  RTS   (Y)904  0      (0  )
0           0           192  (0  ) 0         0         0             0
    client04n1                           2   N  RTS   (Y)477  0      (0  )
0           0           12367404(0  ) 107897    593       0             0
    client04n2                           1   N  RTS   (Y)477  0      (0  )
0           0           12371352(0  ) 107903    594       0             0
    clientcl21                           1   N  RTS   (Y)881  0      (0  )
0           0           10   (0  ) 0         0         0             0
    clientcl26                           1   N  RTS   (Y)701  0      (0  )
0           0           35   (0  ) 0         0         0             0
    client08                            1   N  RTS   (Y)637  0      (0  ) 0
          0           16   (0  ) 0         0         0             0
    clientcl25                           1   N  RTS   (Y)574  0      (0  )
0           0           14   (0  ) 0         0         0             0
    clientcl22                           1   N  RTS   (Y)507  0      (0  )
0           0           2    (0  ) 0         0         0             0
    client10                            1   N  RTS   (Y)568  0      (0  ) 0
          0           121  (0  ) 0         0         0             0

in this example you can see thet my client (client05) has multiple subnets
configured as well as RDMA.
so to connected to the various TCP devices (ib0 and ib1) to different
cluster nodes and also has a RDMA connection to a different set of nodes.
as you can see there is basically no traffic on the TCP devices, as all the
traffic uses the 2 defined RDMA fabrics.
there is not a single connection using the daemon interface (eth0) as all
nodes are either connected via subnets or via RDMA.

hope this helps. Sven



On Tue, Sep 2, 2014 at 6:44 AM, Ed Wahl <ewahl at osc.edu> wrote:

> Seems like you are on the correct track.  This is similar to my setup.
>  subnett'ed daemon 10GbE, 1GbE with main being QDR RDMA,   admin 1GbE.   To
> my mind the most important part is  Setting "privateSubnetOverride" to 1.
> This allows both your 1GbE and your 40GbE to be on a private subnet.
> Serving block over public IPs just seems wrong on SO many levels. Whether
> truly private/internal or not.  And how many people use public IPs
> internally? Wait, maybe I don't want to know...
>
>    Using 'verbsRdma enable' for your FDR seems to override Daemon node
> name for block, at least in my experience.  I love the fallback to 10GbE
> and then 1GbE in case of disaster when using IB.  Lately we seem to be
> generating bugs in OpenSM at a frightening rate so that has been
> _extremely_ helpful. Now if we could just monitor when it happens more
> easily than running mmfsadm test verbs conn, say by logging a failure of
> RDMA?
>
>
> Ed
> OSC
>
> ________________________________________
> From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org]
> on behalf of Simon Thompson (Research Computing - IT Services) [
> S.J.Thompson at bham.ac.uk]
> Sent: Monday, September 01, 2014 3:44 PM
> To: gpfsug main discussion list
> Subject: [gpfsug-discuss] GPFS admin host name vs subnets
>
> I was just reading through the docs at:
>
>
> https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/GPFS+Network+Communication+Overview
>
> And was wondering about using admin host name bs using subnets. My reading
> of the page is that if say I have a 1GbE network and a 40GbE network, I
> could have an admin host name on the 1GbE network. But equally from the
> docs, it looks like I could also use subnets to achieve the same whilst
> allowing the admin network to be a fall back for data if necessary.
>
> For example, create the cluster using the primary name on the 1GbE
> network, then use the subnets property to use set the network on the 40GbE
> network as the first and the network on the 1GbE network as the second in
> the list, thus GPFS data will pass over the 40GbE network in preference and
> the 1GbE network will, by default only be used for admin traffic as the
> admin host name will just be the name of the host on the 1GbE network.
>
> Is my reading of the docs correct? Or do I really want to be creating the
> cluster using the 40GbE network hostnames and set the admin node name to
> the name of the 1GbE network interface?
>
> (there's actually also an FDR switch in there somewhere for verbs as well)
>
> Thanks
>
> Simon
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140902/16c2aa99/attachment.htm>


More information about the gpfsug-discuss mailing list