[gpfsug-discuss] mmdiag output questions

Tue Sep 9 11:23:47 BST 2014

Hi All,

When tracing a problem recently (which turned out to be a NIC failure), mmdiag proved useful in tracing broken cluster connections. I have some questions about the output of mmdiag using the --network switch:

Occasionally I see nodes in the same cluster grouped, but in no readily identifiable way - for example, the following output has three headings "Device bon0:" with some nodes listed, but the nodes don't seem to share anything in common like status, err, ostype, etc.

Also, is anyone able to explain what might be seen under the err column? Do these correspond to GPFS error codes as one might see in mmfs.log.latest? What is the sock column displaying - the number of open sockets or the socket state? Lastly, the sent/recvd columns seem very low. Is there a rolling time window within which these statistics are kept in some internal mmfsd buffer?

Cheers.

=== mmdiag: network ===

Pending messages:
  (none)
Inter-node communication configuration:
  tscTcpPort      1191
  my address      10.100.10.51/22 (eth0) <c0n8>
  my addr list    10.200.21.1/16 (bond0)/cpdn.oerc.local  10.100.10.51/22 (eth0)
  my node number  9
TCP Connections between nodes:
  Device bond0:
    hostname                            node     destination     status     err  sock  sent(MB)  recvd(MB)  ostype
    gpfs01                              <c0n0>   10.200.1.1      connected  0    32    110       110        Linux/L
    gpfs02                              <c0n1>   10.200.2.1      connected  0    36    104       104        Linux/L
    linux                               <c0n2>   10.200.101.1    connected  0    37    0         0          Linux/L
    jupiter                             <c0n3>   10.200.102.1    connected  0    35    0         0          Windows/L
    cnfs0                               <c0n4>   10.200.10.10    connected  0    39    0         0          Linux/L
    cnfs1                               <c0n5>   10.200.10.11    init       0    -1    0         0          Linux/L
  Device bond0:
    hostname                            node     destination     status     err  sock  sent(MB)  recvd(MB)  ostype
    cnfs2                               <c0n6>   10.200.10.12    connected  0    33    5         5          Linux/L
    cnfs3                               <c0n7>   10.200.10.13    init       0    -1    0         0          Linux/L
    cpdn-ppc02                          <c0n9>   10.200.61.1     init       0    -1    0         0          Linux/L
    cpdn-ppc03                          <c0n10>  10.200.62.1     init       0    -1    0         0          Linux/L
  Device bond0:
    hostname                            node     destination     status     err  sock  sent(MB)  recvd(MB)  ostype
    cpdn-ppc01                          <c0n11>  10.200.60.1     connected  0    38    0         0          Linux/L
diag verbs: VERBS RDMA class not initialized

Conversely, the output of mmdiag --network on the file system manager node for the same cluster looks like this:

=== mmdiag: network ===

Pending messages:
  (none)
Inter-node communication configuration:
  tscTcpPort      1191
  my address      10.100.10.21/22 (eth0) <c0n0>
  my addr list    10.200.1.1/16 (bond0)/cpdn.oerc.local  10.100.10.21/22 (eth0)
  my node number  1
TCP Connections between nodes:
  Device bond0:
    hostname                            node     destination     status     err  sock  sent(MB)  recvd(MB)  ostype
    gpfs02                              <c0n1>   10.200.2.1      connected  0    73    219       219        Linux/L
    linux                               <c0n2>   10.200.101.1    connected  0    49    180       181        Linux/L
    jupiter                             <c0n3>   10.200.102.1    connected  0    33    3         3          Windows/L
    cnfs0                               <c0n4>   10.200.10.10    connected  0    61    3         3          Linux/L
    cnfs1                               <c0n5>   10.200.10.11    connected  0    81    0         0          Linux/L
    cnfs2                               <c0n6>   10.200.10.12    connected  0    64    23        23         Linux/L
    cnfs3                               <c0n7>   10.200.10.13    connected  0    60    2         2          Linux/L
    tsm01                               <c0n8>   10.200.21.1     connected  0    50    110       110        Linux/L
    cpdn-ppc02                          <c0n9>   10.200.61.1     connected  0    63    0         0          Linux/L
    cpdn-ppc03                          <c0n10>  10.200.62.1     connected  0    65    0         0          Linux/L
    cpdn-ppc01                          <c0n11>  10.200.60.1     connected  0    62    94        94         Linux/L
diag verbs: VERBS RDMA class not initialized

All neatly connected!

--

Luke Raimbach
IT Manager
Oxford e-Research Centre
7 Keble Road,
Oxford,
OX1 3QG

+44(0)1865 610639