[gpfsug-discuss] SAN problem ... multipathd ... mmunlinkfileset ... ???

Buterbaugh, Kevin L Kevin.Buterbaugh at Vanderbilt.Edu
Thu Jun 15 21:00:47 BST 2017


Hi All,

I’ve got some very weird problems going on here (and I do have a PMR open with IBM).  On Monday I attempted to unlink a fileset, something that I’ve done many times with no issues.  This time, however, it hung up the filesystem.  I was able to clear things up by shutting down GPFS on the filesystem manager for that filesystem and restarting it.

The very next morning we awoke to problems with GPFS.  I noticed in my messages file on all my NSD servers I had messages like:

Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: rejecting I/O to offline device
Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: [sdab] Write Protect is off
Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: rejecting I/O to offline device
Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: [sdab] Asking for cache data failed
Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: [sdab] Assuming drive cache: write through
Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: rejecting I/O to offline device
Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: [sdab] Attached SCSI disk
Jun 12 22:03:32 nsd32 multipathd: sdab: add path (uevent)
Jun 12 22:03:32 nsd32 multipathd: sdab: failed to get path uid
Jun 12 22:03:32 nsd32 multipathd: uevent trigger error
Jun 12 22:03:42 nsd32 kernel: rport-0:0-4: blocked FC remote port time out: removing target and saving binding

Since we use an FC SAN and Linux multi-pathing I was expecting some sort of problem with the switches.  Now on the switches I see messages like:

 [114][Thu Jun 15 19:02:05.411 UTC 2017][I][8600.0020][Port][Port: 9][SYNC_LOSS]
  [115][Thu Jun 15 19:03:49.988 UTC 2017][I][8600.001F][Port][Port: 9][SYNC_ACQ]

Which (while not in this example) do correlate time-wise with the multi path messages on the servers.  So it’s not a GPFS problem and I shouldn’t be bugging this list about this EXCEPT…

These issues only started on Monday after I ran the mmunlinkfileset command.  That’s right … NO such errors prior to then.  And literally NOTHING changed on Monday with my SAN environment (nothing had changed there for months actually).  Nothing added to nor removed from the SAN.  No changes until today when, in an attempt to solve this issue, I updated the switch firmware on all switches one at a time.  I also yum updated to the latest RHEL 7 version of the multipathd packages.

I’ve been Googling and haven’t found anything useful on those SYNC_LOSS messages on the QLogic SANbox 5800 switches.  Anybody out there happen to have any knowledge of them and what could be causing them?  Oh, I’m investigating this now … but it’s not all ports that are throwing the errors.  And the ports that are seem to be random and don’t have one specific type of hardware plugged in … i.e. some ports have NSD servers plugged in, others have storage arrays.

I understand that it makes no sense that mmunlinkfileset hanging would cause problems with my SAN … but I also don’t believe in coincidences!

I’m running GPFS 4.2.2.3.  Any help / suggestions apprecaiated!

—
Kevin Buterbaugh - Senior System Administrator
Vanderbilt University - Advanced Computing Center for Research and Education
Kevin.Buterbaugh at vanderbilt.edu<mailto:Kevin.Buterbaugh at vanderbilt.edu> - (615)875-9633



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170615/8c9d8fa1/attachment.htm>


More information about the gpfsug-discuss mailing list