[gpfsug-discuss] How to simulate an NSD failure?

Alex Chekholko alex at calicolabs.com
Fri Oct 13 17:53:40 BST 2017


John,

I think a "philosophical" difference between GPFS code and newer
filesystems which were written later, in the age of "commodity hardware",
is that GPFS expects the underlying hardware to be very reliable.  So
"disks" are typically RAID arrays available via multiple paths.  And
network links should have no errors, and be highly reliable, etc.  GPFS
does not detect these things well as it does not expect them to fail.

That's why you see some discussions around "improving network diagnostics"
and "improving troubleshooting tools" and things like that.

Having a failed NSD is highly unusual for a GPFS system and you should
design your system so that situation does not happen.

In your example here, if data is striped across two NSDs and one of them
becomes inaccessible, when a client tries to write, it should get an I/O
error, and perhaps even unmount the filesystem (depending on where you
metadata lives).

Regards,
Alex

On Fri, Oct 13, 2017 at 5:56 AM, John Hearns <john.hearns at asml.com> wrote:

> I have set up a small testbed, consisting of three nodes. Two of the nodes
> have a disk which is being used as an NSD.
>
> This is being done for some preparation for fun and games with some whizzy
> new servers. The testbed has spinning drives.
>
> I have created two NSDs and have set the data replication to 1 (this is
> deliberate).
>
> I am trying to fail an NSD and find which files have parts on the failed
> NSD.
>
> A first test with ‘mmdeldisk’ didn’t have much effect as SpectrumScale is
> smart enough to copy the data off the drive.
>
>
>
> I now take the drive offline and delete it by
>
> echo offline > /sys/block/sda/device/state
>
> echo 1 > /sys/block/sda/delete
>
>
>
> Short of going to the data centre and physically pulling the drive that’s
> a pretty final way of stopping access to a drive.
>
> I then wrote 100 files to the filesystem, the node with the NSD did log
> “rejecting I/O to offline device”
>
> However mmlsdisk <filesystem>   says that this disk is status ‘ready’
>
>
>
> I am going to stop that NSD and run an mmdeldisk – at which point I do
> expect things to go south rapidly.
>
> I just am not understanding at what point a failed write would be
> detected? Or once a write fails are all the subsequent writes
>
> Routed off to the active NSD(s) ??
>
>
>
> Sorry if I am asking an idiot question.
>
>
>
> Inspector.clouseau at surete.fr
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> -- The information contained in this communication and any attachments is
> confidential and may be privileged, and is for the sole use of the intended
> recipient(s). Any unauthorized review, use, disclosure or distribution is
> prohibited. Unless explicitly stated otherwise in the body of this
> communication or the attachment thereto (if any), the information is
> provided on an AS-IS basis without any express or implied warranties or
> liabilities. To the extent you are relying on this information, you are
> doing so at your own risk. If you are not the intended recipient, please
> notify the sender immediately by replying to this message and destroy all
> copies of this message and any attachments. Neither the sender nor the
> company/group of companies he or she represents shall be liable for the
> proper and complete transmission of the information contained in this
> communication, or for any delay in its receipt.
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20171013/a91a5561/attachment.htm>


More information about the gpfsug-discuss mailing list