[gpfsug-discuss] SAN problem ... multipathd ... mmunlinkfileset ... ???

Edward Wahl ewahl at osc.edu
Thu Jun 15 21:50:10 BST 2017


On Thu, 15 Jun 2017 20:00:47 +0000
"Buterbaugh, Kevin L" <Kevin.Buterbaugh at Vanderbilt.Edu> wrote:

> Hi All,
> 
> I’ve got some very weird problems going on here (and I do have a PMR open
> with IBM).  On Monday I attempted to unlink a fileset, something that I’ve
> done many times with no issues.  This time, however, it hung up the
> filesystem.  I was able to clear things up by shutting down GPFS on the
> filesystem manager for that filesystem and restarting it.
> 
> The very next morning we awoke to problems with GPFS.  I noticed in my
> messages file on all my NSD servers I had messages like:
> 
> Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: rejecting I/O to offline device
> Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: [sdab] Write Protect is off
> Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: rejecting I/O to offline device
> Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: [sdab] Asking for cache data failed
> Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: [sdab] Assuming drive cache: write
> through Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: rejecting I/O to offline
> device Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: [sdab] Attached SCSI disk
> Jun 12 22:03:32 nsd32 multipathd: sdab: add path (uevent)
> Jun 12 22:03:32 nsd32 multipathd: sdab: failed to get path uid
> Jun 12 22:03:32 nsd32 multipathd: uevent trigger error
> Jun 12 22:03:42 nsd32 kernel: rport-0:0-4: blocked FC remote port time out:
> removing target and saving binding
> 
> Since we use an FC SAN and Linux multi-pathing I was expecting some sort of
> problem with the switches.  Now on the switches I see messages like:
> 
>  [114][Thu Jun 15 19:02:05.411 UTC 2017][I][8600.0020][Port][Port:
> 9][SYNC_LOSS] [115][Thu Jun 15 19:03:49.988 UTC
> 2017][I][8600.001F][Port][Port: 9][SYNC_ACQ]
> 
> Which (while not in this example) do correlate time-wise with the multi path
> messages on the servers.  So it’s not a GPFS problem and I shouldn’t be
> bugging this list about this EXCEPT…
> 
> These issues only started on Monday after I ran the mmunlinkfileset command.
> That’s right … NO such errors prior to then.  And literally NOTHING changed
> on Monday with my SAN environment (nothing had changed there for months
> actually).  Nothing added to nor removed from the SAN.  No changes until
> today when, in an attempt to solve this issue, I updated the switch firmware
> on all switches one at a time.  I also yum updated to the latest RHEL 7
> version of the multipathd packages.
> 
> I’ve been Googling and haven’t found anything useful on those SYNC_LOSS
> messages on the QLogic SANbox 5800 switches.  Anybody out there happen to
> have any knowledge of them and what could be causing them?  Oh, I’m
> investigating this now … but it’s not all ports that are throwing the
> errors.  And the ports that are seem to be random and don’t have one specific
> type of hardware plugged in … i.e. some ports have NSD servers plugged in,
> others have storage arrays.

I have a half dozen of the Sanbox 5802 switches, but no GPFS devices going
through them any longer.  Used to though.    We do see that exact same messages
when the FC interface on a device goes bad (SFP, HCA, etc) or someone moving
cables. This happens when the device cannot properly join the loop with it's
login. I've NEVER seen them randomly though.  Nor has this been a bad cable type
error.   I don't recall why, but I froze our Sanbox's at : V7.4.0.16.0  I'm sure
I have notes on it somewhere.

 I've got one right now, in fact, with a bad ancient LTO4 drive.  
[8124][Thu Jun 15 12:46:00.190 EDT 2017][I][8600.001F][Port][Port: 4][SYNC_ACQ]
[8125][Thu Jun 15 12:49:20.920 EDT 2017][I][8600.0020][Port][Port: 4][SYNC_LOSS]


Sounds like the sanbox itself is having an issue perhaps?  "Show alarm" clean on
the sanbox?   Array has a bad HCA?  'Show port 9' errors not crazy?  All power
supplies working? 



> I understand that it makes no sense that mmunlinkfileset hanging would cause
> problems with my SAN … but I also don’t believe in coincidences!
> 
> I’m running GPFS 4.2.2.3.  Any help / suggestions apprecaiated!

This does seem like QUITE the coincidence.  Increased traffic on the
device triggered a failure? (The fear of all RAID users!)  Multipath is working
properly though? Sounds like mmlsdisk would have shown devices not in 'ready'.
We mysteriously lost a MD disk during a recent downtime and it caused an MMFSCK
to not run properly until we figured it out.  4.2.2.3 as well.  MD Replication
is NOT helpful in that case. 

Ed



> 
>> Kevin Buterbaugh - Senior System Administrator
> Vanderbilt University - Advanced Computing Center for Research and Education
> Kevin.Buterbaugh at vanderbilt.edu<mailto:Kevin.Buterbaugh at vanderbilt.edu> -
> (615)875-9633
> 
> 
> 



-- 

Ed Wahl
Ohio Supercomputer Center
614-292-9302



More information about the gpfsug-discuss mailing list