[gpfsug-discuss] SAN problem ... multipathd ... mmunlinkfileset ... ???

Buterbaugh, Kevin L Kevin.Buterbaugh at Vanderbilt.Edu
Thu Jun 15 22:14:33 BST 2017


Hi Ed, others,

I have spent the intervening time since sending my original e-mail taking the logs from the SAN switches and putting them into text files where they can be sorted and grep’d … and something potentially interesting has come to light … 

While there are a number of ports on all switches that have one or two SYNC_LOSS errors on them, on two of the switches port 9 has dozens of SYNC_LOSS errors (looking at the raw logs with other messages interspersed that wasn’t obvious).  Turns out that one particular dual-controller storage array is plugged into those ports and - in a stroke of good luck which usually manages to avoid me - that particular storage array is no longer in use!  It, and a few others still in use, are older and about to be life-cycled.  Since it’s no longer in use, I have unplugged it from the SAN and am monitoring to see if my problems now go away.

Yes, correlation is not causation.  And sometimes coincidences do happen.  I’ll monitor to see if this is one of those occasions.  Thanks…

Kevin

> On Jun 15, 2017, at 3:50 PM, Edward Wahl <ewahl at osc.edu> wrote:
> 
> On Thu, 15 Jun 2017 20:00:47 +0000
> "Buterbaugh, Kevin L" <Kevin.Buterbaugh at Vanderbilt.Edu> wrote:
> 
>> Hi All,
>> 
>> I’ve got some very weird problems going on here (and I do have a PMR open
>> with IBM).  On Monday I attempted to unlink a fileset, something that I’ve
>> done many times with no issues.  This time, however, it hung up the
>> filesystem.  I was able to clear things up by shutting down GPFS on the
>> filesystem manager for that filesystem and restarting it.
>> 
>> The very next morning we awoke to problems with GPFS.  I noticed in my
>> messages file on all my NSD servers I had messages like:
>> 
>> Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: rejecting I/O to offline device
>> Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: [sdab] Write Protect is off
>> Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: rejecting I/O to offline device
>> Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: [sdab] Asking for cache data failed
>> Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: [sdab] Assuming drive cache: write
>> through Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: rejecting I/O to offline
>> device Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: [sdab] Attached SCSI disk
>> Jun 12 22:03:32 nsd32 multipathd: sdab: add path (uevent)
>> Jun 12 22:03:32 nsd32 multipathd: sdab: failed to get path uid
>> Jun 12 22:03:32 nsd32 multipathd: uevent trigger error
>> Jun 12 22:03:42 nsd32 kernel: rport-0:0-4: blocked FC remote port time out:
>> removing target and saving binding
>> 
>> Since we use an FC SAN and Linux multi-pathing I was expecting some sort of
>> problem with the switches.  Now on the switches I see messages like:
>> 
>> [114][Thu Jun 15 19:02:05.411 UTC 2017][I][8600.0020][Port][Port:
>> 9][SYNC_LOSS] [115][Thu Jun 15 19:03:49.988 UTC
>> 2017][I][8600.001F][Port][Port: 9][SYNC_ACQ]
>> 
>> Which (while not in this example) do correlate time-wise with the multi path
>> messages on the servers.  So it’s not a GPFS problem and I shouldn’t be
>> bugging this list about this EXCEPT…
>> 
>> These issues only started on Monday after I ran the mmunlinkfileset command.
>> That’s right … NO such errors prior to then.  And literally NOTHING changed
>> on Monday with my SAN environment (nothing had changed there for months
>> actually).  Nothing added to nor removed from the SAN.  No changes until
>> today when, in an attempt to solve this issue, I updated the switch firmware
>> on all switches one at a time.  I also yum updated to the latest RHEL 7
>> version of the multipathd packages.
>> 
>> I’ve been Googling and haven’t found anything useful on those SYNC_LOSS
>> messages on the QLogic SANbox 5800 switches.  Anybody out there happen to
>> have any knowledge of them and what could be causing them?  Oh, I’m
>> investigating this now … but it’s not all ports that are throwing the
>> errors.  And the ports that are seem to be random and don’t have one specific
>> type of hardware plugged in … i.e. some ports have NSD servers plugged in,
>> others have storage arrays.
> 
> I have a half dozen of the Sanbox 5802 switches, but no GPFS devices going
> through them any longer.  Used to though.    We do see that exact same messages
> when the FC interface on a device goes bad (SFP, HCA, etc) or someone moving
> cables. This happens when the device cannot properly join the loop with it's
> login. I've NEVER seen them randomly though.  Nor has this been a bad cable type
> error.   I don't recall why, but I froze our Sanbox's at : V7.4.0.16.0  I'm sure
> I have notes on it somewhere.
> 
> I've got one right now, in fact, with a bad ancient LTO4 drive.  
> [8124][Thu Jun 15 12:46:00.190 EDT 2017][I][8600.001F][Port][Port: 4][SYNC_ACQ]
> [8125][Thu Jun 15 12:49:20.920 EDT 2017][I][8600.0020][Port][Port: 4][SYNC_LOSS]
> 
> 
> Sounds like the sanbox itself is having an issue perhaps?  "Show alarm" clean on
> the sanbox?   Array has a bad HCA?  'Show port 9' errors not crazy?  All power
> supplies working? 
> 
> 
> 
>> I understand that it makes no sense that mmunlinkfileset hanging would cause
>> problems with my SAN … but I also don’t believe in coincidences!
>> 
>> I’m running GPFS 4.2.2.3.  Any help / suggestions apprecaiated!
> 
> This does seem like QUITE the coincidence.  Increased traffic on the
> device triggered a failure? (The fear of all RAID users!)  Multipath is working
> properly though? Sounds like mmlsdisk would have shown devices not in 'ready'.
> We mysteriously lost a MD disk during a recent downtime and it caused an MMFSCK
> to not run properly until we figured it out.  4.2.2.3 as well.  MD Replication
> is NOT helpful in that case. 
> 
> Ed
> 
> 
> 
>> 
>>>> Kevin Buterbaugh - Senior System Administrator
>> Vanderbilt University - Advanced Computing Center for Research and Education
>> Kevin.Buterbaugh at vanderbilt.edu<mailto:Kevin.Buterbaugh at vanderbilt.edu> -
>> (615)875-9633
>> 
>> 
>> 
> 
> 
> 
> -- 
> 
> Ed Wahl
> Ohio Supercomputer Center
> 614-292-9302



More information about the gpfsug-discuss mailing list