[gpfsug-discuss] CES doesn't assign addresses to nodes

Jonathon A Anderson jonathon.anderson at colorado.edu
Tue Jan 24 19:48:02 GMT 2017


I think I'm having the same issue described here:

http://www.spectrumscale.org/pipermail/gpfsug-discuss/2016-October/002288.html

Any advice or further troubleshooting steps would be much appreciated. Full disclosure: I also have a DDN case open. (78804)

We've got a four-node (snsd{1..4}) DDN gridscaler system. I'm trying to add two CES protocol nodes (sgate{1,2}) to serve NFS. 

Here's the steps I took: 

--- 
mmcrnodeclass protocol -N sgate1-opa,sgate2-opa 
mmcrnodeclass nfs -N sgate1-opa,sgate2-opa 
mmchconfig cesSharedRoot=/gpfs/summit/ces 
mmchcluster --ccr-enable 
mmchnode --ces-enable -N protocol 
mmces service enable NFS 
mmces service start NFS -N nfs 
mmces address add --ces-ip 10.225.71.104,10.225.71.105 
mmces address policy even-coverage 
mmces address move --rebalance 
--- 

This worked the very first time I ran it, but the CES addresses weren't re-distributed after restarting GPFS or a node reboot. 

Things I've tried: 

* disabling ces on the sgate nodes and re-running the above procedure 
* moving the cluster and filesystem managers to different snsd nodes 
* deleting and re-creating the cesSharedRoot directory 

Meanwhile, the following log entry appears in mmfs.log.latest every ~30s: 

--- 
Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.104 
Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.105 
Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: handleNetworkProblem with lock held: assignIP 10.225.71.104_0-_+,10.225.71.105_0-_+ 1 
Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Assigning addresses: 10.225.71.104_0-_+,10.225.71.105_0-_+ 
Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: moveCesIPs: 10.225.71.104_0-_+,10.225.71.105_0-_+ 
--- 

Also notable, whenever I add or remove addresses now, I see this in mmsysmonitor.log (among a lot of other entries): 

--- 
2017-01-23T20:40:56.363 sgate1 D ET_cesnetwork Entity state without requireUnique: ces_network_ips_down WARNING No CES relevant NICs detected - Service.calculateAndUpdateState:275 
2017-01-23T20:40:11.364 sgate1 D ET_cesnetwork Update multiple entities at once {'p2p2': 1, 'bond0': 1, 'p2p1': 1} - Service.setLocalState:333 
--- 

For the record, here's the interface I expect to get the address on sgate1: 

--- 
11: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP 
link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff 
inet 10.225.71.107/20 brd 10.225.79.255 scope global bond0 
valid_lft forever preferred_lft forever 
inet6 fe80::3efd:feff:fe08:a7c0/64 scope link 
valid_lft forever preferred_lft forever 
--- 

which is a bond of p2p1 and p2p2. 

--- 
6: p2p1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP qlen 1000 
link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff 
7: p2p2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP qlen 1000 
link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff 
--- 

A similar bond0 exists on sgate2. 

I crawled around in /usr/lpp/mmfs/lib/mmsysmon/CESNetworkService.py for a while trying to figure it out, but have been unsuccessful so far.



More information about the gpfsug-discuss mailing list