[gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09)

Ben De Luca bdeluca at gmail.com
Tue Oct 10 19:51:28 BST 2017


does this corrupt the entire filesystem or just the open files that are
being written too?

One is horrific and the other is just mildly bad.

On 10 October 2017 at 17:09, IBM Spectrum Scale <scale at us.ibm.com> wrote:

> Bob,
>
> The problem may occur when the TCP connection is broken between two nodes.
> While in the vast majority of the cases when data stops flowing through the
> connection, the result is one of the nodes getting expelled, there are
> cases where the TCP connection simply breaks -- that is relatively rare but
> happens on occasion. There is logic in the mmfsd daemon to detect the
> disconnection and attempt to reconnect to the destination in question. If
> the reconnect is successful then steps are taken to recover the state kept
> by the daemons, and that includes resending some RPCs that were in flight
> when the disconnection took place.
>
> As the flash describes, a problem in the logic to resend some RPCs was
> causing one of the RPC headers to be omitted, resulting in the RPC data to
> be interpreted as the (missing) header. Normally the result is an assert on
> the receiving end, like the "logAssertFailed: !"Request and queue size
> mismatch"  assert described in the flash. However, it's at least
> conceivable (though expected to very rare) that the content of the RPC data
> could be interpreted as a valid RPC header. In the case of an RPC which
> involves data transfer between an NSD client and NSD server, that might
> result in incorrect data being written to some NSD device.
>
> Disconnect/reconnect scenarios appear to be uncommon. An entry like
>
> [N] Reconnected to xxx.xxx.xxx.xxx nodename <c0n0>
>
> in mmfs.log would be an indication that a reconnect has occurred. By
> itself, the reconnect will not imply that data or the file system was
> corrupted, since that will depend on what RPCs were pending when the
> connection happened. In the case the assert above is hit, no corruption is
> expected, since the daemon will go down before incorrect data gets written.
>
> Reconnects involving an NSD server are those which present the highest
> risk, given that NSD-related RPCs are used to write data into NSDs
>
> Even on clusters that have not been subjected to disconnects/reconnects
> before, such events might still happen in the future in case of network
> glitches. It's then recommended that an efix for the problem be applied in
> a timely fashion.
>
>
> Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668
>
>
>
> Regards, The Spectrum Scale (GPFS) team
>
> ------------------------------------------------------------
> ------------------------------------------------------
> If you feel that your question can benefit other users of  Spectrum Scale
> (GPFS), then please post it to the public IBM developerWroks Forum at
> https://www.ibm.com/developerworks/community/
> forums/html/forum?id=11111111-0000-0000-0000-000000000479.
>
> If your query concerns a potential software error in Spectrum Scale (GPFS)
> and you have an IBM software maintenance contract please contact
>  1-800-237-5511 in the United States or your local IBM Service Center in
> other countries.
>
> The forum is informally monitored as time permits and should not be used
> for priority messages to the Spectrum Scale (GPFS) team.
>
>
>
> From:        "Oesterlin, Robert" <Robert.Oesterlin at nuance.com>
> To:        gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> Date:        10/09/2017 10:38 AM
> Subject:        [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale
> (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file
> system corruption or undetected file data corruption (2017.10.09)
> Sent by:        gpfsug-discuss-bounces at spectrumscale.org
> ------------------------------
>
>
>
> Can anyone from the Scale team comment?
>
> Anytime I see “may result in file system corruption or undetected file
> data corruption” it gets my attention.
>
> Bob Oesterlin
> Sr Principal Storage Engineer, Nuance
>
>
>
>
>
>
>
> *Storage *
> IBM My Notifications
> Check out the *IBM Electronic Support*
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ibm.com_support_electronicsupport&d=DwMFaQ&c=djjh8EKwHtOepW4Bjau0lKhLlu-DxM1dlgP0rrLsOzY&r=LPDewt1Z4o9eKc86MXmhqX-45Cz1yz1ylYELF9olLKU&m=2hMPYHHwtifmhKAUCosUh9MGpzkaN0JdxjNcVaoow6o&s=eMcFZGRxm1xwxVTiTaMVHlsrgeXTk6V-jyAps5PkbzI&e=>
>
>
> IBM Spectrum Scale
> *: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect
> function may result in file system corruption or undetected file data
> corruption*
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ibm.com_support_docview.wss-3Fuid-3Dssg1S1010668-26myns-3Ds033-26mynp-3DOCSTXKQY-26mynp-3DOCSWJ00-26mync-3DE-26cm-5Fsp-3Ds033-2D-5F-2DOCSTXKQY-2DOCSWJ00-2D-5F-2DE&d=DwMFaQ&c=djjh8EKwHtOepW4Bjau0lKhLlu-DxM1dlgP0rrLsOzY&r=LPDewt1Z4o9eKc86MXmhqX-45Cz1yz1ylYELF9olLKU&m=2hMPYHHwtifmhKAUCosUh9MGpzkaN0JdxjNcVaoow6o&s=0akHYM3LsURDoS-IYLtB36K5YvPDmXaMEu6rMb3Cjdk&e=>
> IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2
> levels, in which resending an NSD RPC after a network reconnect function
> may result in file system corruption or undetected file data corruption.
>
>
>
>
>
>
>
>  _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.
> org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=
> IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=-
> AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e=
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20171010/e8c24195/attachment.htm>


More information about the gpfsug-discuss mailing list