[gpfsug-discuss] filesystem manager crashes every time mmdelsnapshot (from either the filesystem manager or some other nsd/client) is called

Sabuj Pattanayek sabujp at gmail.com
Thu May 15 13:48:00 BST 2014


Hi all,

We're running 3.5.0.17 now and it looks like the filesystem manager
automatically reboots (and sometimes fails to automatically reboot) after
mmdelsnapshot is called, either from the filesystem manager itself or from
some other nsd/node . It didn't start happening immediately after we
updated to 17, but we never had this issue when we were at 3.5.0.11 . The
error mmdelsnapshot throws at some point is :

Lost connection to file system daemon.
mmdelsnapshot: An internode connection between GPFS nodes was disrupted.
mmdelsnapshot: Command failed.  Examine previous error messages to
determine cause.

It also causes an mmfs generic error and or a kernel: BUG: soft lockup
- CPU#15 stuck for 67s! [mmfsd:39266], the latter causes the system to
not reboot itself (which is actually worse), but the former does.


It also causes some havoc with CNFS file locking even after the
filesystem manager is rebooted and has come up :


May 15 07:10:12 mako-nsd1 sm-notify[19387]: Failed to bind RPC socket:
Address already in use


May 15 07:21:03 mako-nsd1 sm-notify[11052]: Invalid bind address or port

for RPC socket: Name or service not known


Saw some snapshot related fixes in 3.5.0.18, anyone seen this behavior
or know if it's fixed in 18?


Thanks,

Sabuj
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140515/f6db747b/attachment.htm>


More information about the gpfsug-discuss mailing list