[gpfsug-discuss] CES doesn't assign addresses to nodes
Jonathon A Anderson
jonathon.anderson at colorado.edu
Tue Jan 24 19:48:02 GMT 2017
I think I'm having the same issue described here:
http://www.spectrumscale.org/pipermail/gpfsug-discuss/2016-October/002288.html
Any advice or further troubleshooting steps would be much appreciated. Full disclosure: I also have a DDN case open. (78804)
We've got a four-node (snsd{1..4}) DDN gridscaler system. I'm trying to add two CES protocol nodes (sgate{1,2}) to serve NFS.
Here's the steps I took:
---
mmcrnodeclass protocol -N sgate1-opa,sgate2-opa
mmcrnodeclass nfs -N sgate1-opa,sgate2-opa
mmchconfig cesSharedRoot=/gpfs/summit/ces
mmchcluster --ccr-enable
mmchnode --ces-enable -N protocol
mmces service enable NFS
mmces service start NFS -N nfs
mmces address add --ces-ip 10.225.71.104,10.225.71.105
mmces address policy even-coverage
mmces address move --rebalance
---
This worked the very first time I ran it, but the CES addresses weren't re-distributed after restarting GPFS or a node reboot.
Things I've tried:
* disabling ces on the sgate nodes and re-running the above procedure
* moving the cluster and filesystem managers to different snsd nodes
* deleting and re-creating the cesSharedRoot directory
Meanwhile, the following log entry appears in mmfs.log.latest every ~30s:
---
Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.104
Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.105
Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: handleNetworkProblem with lock held: assignIP 10.225.71.104_0-_+,10.225.71.105_0-_+ 1
Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Assigning addresses: 10.225.71.104_0-_+,10.225.71.105_0-_+
Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: moveCesIPs: 10.225.71.104_0-_+,10.225.71.105_0-_+
---
Also notable, whenever I add or remove addresses now, I see this in mmsysmonitor.log (among a lot of other entries):
---
2017-01-23T20:40:56.363 sgate1 D ET_cesnetwork Entity state without requireUnique: ces_network_ips_down WARNING No CES relevant NICs detected - Service.calculateAndUpdateState:275
2017-01-23T20:40:11.364 sgate1 D ET_cesnetwork Update multiple entities at once {'p2p2': 1, 'bond0': 1, 'p2p1': 1} - Service.setLocalState:333
---
For the record, here's the interface I expect to get the address on sgate1:
---
11: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP
link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff
inet 10.225.71.107/20 brd 10.225.79.255 scope global bond0
valid_lft forever preferred_lft forever
inet6 fe80::3efd:feff:fe08:a7c0/64 scope link
valid_lft forever preferred_lft forever
---
which is a bond of p2p1 and p2p2.
---
6: p2p1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP qlen 1000
link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff
7: p2p2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP qlen 1000
link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff
---
A similar bond0 exists on sgate2.
I crawled around in /usr/lpp/mmfs/lib/mmsysmon/CESNetworkService.py for a while trying to figure it out, but have been unsuccessful so far.
More information about the gpfsug-discuss
mailing list