[gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09)

Tomasz.Wolski at ts.fujitsu.com Tomasz.Wolski at ts.fujitsu.com
Wed Oct 25 10:42:02 BST 2017


Thank you for the information. I was checking changelog for GPFS 4.2.3.5 standard edition, but it does not mention APAR IJ00398:

https://www-01.ibm.com/support/docview.wss?rs=0&uid=isg400003555
This update addresses the following APARs: IJ00031 IJ00094 IJ00397 IV99611 IV99675 IV99676 IV99677 IV99678 IV99679 IV99680 IV99709 IV99796.

On the other hand Flash Notification advices upgrading to GPFS 4.2.3.5:
http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668&myns=s033&mynp=OCSTXKQY&mynp=OCSWJ00&mync=E&cm_sp=s033-_-OCSTXKQY-OCSWJ00-_-E

Could you please verify if that version contains the fix?

Best regards,
Tomasz Wolski

From: Felipe Knop [mailto:knop at us.ibm.com] On Behalf Of IBM Spectrum Scale
Sent: Wednesday, October 11, 2017 2:31 PM
To: gpfsug main discussion list
Cc: gpfsug-discuss-bounces at spectrumscale.org; Wolski, Tomasz
Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09)

Tomasz,

Though the error in the NSD RPC has been the most visible manifestation of the problem, this issue could end up affecting other RPCs as well, so the fix should be applied even for SAN configurations.



Regards, The Spectrum Scale (GPFS) team

------------------------------------------------------------------------------------------------------------------
If you feel that your question can benefit other users of  Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479.

If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact  1-800-237-5511 in the United States or your local IBM Service Center in other countries.

The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team.



From:        "Tomasz.Wolski at ts.fujitsu.com<mailto:Tomasz.Wolski at ts.fujitsu.com>" <Tomasz.Wolski at ts.fujitsu.com<mailto:Tomasz.Wolski at ts.fujitsu.com>>
To:        gpfsug main discussion list <gpfsug-discuss at spectrumscale.org<mailto:gpfsug-discuss at spectrumscale.org>>
Date:        10/11/2017 02:09 AM
Subject:        Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09)
Sent by:        gpfsug-discuss-bounces at spectrumscale.org<mailto:gpfsug-discuss-bounces at spectrumscale.org>
________________________________



Hi, From what I understand this does not affect FC SAN cluster configuration, but mostly NSD IO communication?

Best regards,
Tomasz Wolski

From: gpfsug-discuss-bounces at spectrumscale.org<mailto:gpfsug-discuss-bounces at spectrumscale.org> [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Ben De Luca
Sent: Wednesday, October 11, 2017 6:40 AM
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09)

Lyle thanks for the update, has this issue always existed, or just in v4.1 and 4.2?

It seems that the likely hood of this event is very low but of course you encourage people to update asap.



On 11 October 2017 at 00:15, Uwe Falke <UWEFALKE at de.ibm.com<mailto:UWEFALKE at de.ibm.com>> wrote:
Hi, I understood the  failure to occur requires that the RPC payload of
the RPC resent without actual header can be mistaken for a valid RPC
header. The resend mechanism is probably not considering what the actual
content/target the  RPC has.
So, in principle, the RPC could be to update a data block, or a metadata
block - so it may hit just a single data file or corrupt your entire file
system.
However, I think the likelihood that the RPC content can go as valid RPC
header is very low.


Mit freundlichen Grüßen / Kind regards


Dr. Uwe Falke

IT Specialist
High Performance Computing Services / Integrated Technology Services /
Data Center Services
-------------------------------------------------------------------------------------------------------------------------------------------
IBM Deutschland
Rathausstr. 7
09111 Chemnitz
Phone: +49 371 6978 2165<tel:%2B49%20371%206978%202165>
Mobile: +49 175 575 2877<tel:%2B49%20175%20575%202877>
E-Mail: uwefalke at de.ibm.com<mailto:uwefalke at de.ibm.com>
-------------------------------------------------------------------------------------------------------------------------------------------
IBM Deutschland Business & Technology Services GmbH / Geschäftsführung:
Thomas Wolter, Sven Schooß
Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart,
HRB 17122




From:   Ben De Luca <bdeluca at gmail.com<mailto:bdeluca at gmail.com>>
To:     gpfsug main discussion list <gpfsug-discuss at spectrumscale.org<mailto:gpfsug-discuss at spectrumscale.org>>
Cc:     gpfsug-discuss-bounces at spectrumscale.org<mailto:gpfsug-discuss-bounces at spectrumscale.org>
Date:   10/10/2017 08:52 PM
Subject:        Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum
Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in
file system corruption or undetected file data corruption (2017.10.09)
Sent by:        gpfsug-discuss-bounces at spectrumscale.org<mailto:gpfsug-discuss-bounces at spectrumscale.org>


does this corrupt the entire filesystem or just the open files that are
being written too?

One is horrific and the other is just mildly bad.

On 10 October 2017 at 17:09, IBM Spectrum Scale <scale at us.ibm.com<mailto:scale at us.ibm.com>> wrote:
Bob,

The problem may occur when the TCP connection is broken between two nodes.
While in the vast majority of the cases when data stops flowing through
the connection, the result is one of the nodes getting expelled, there are
cases where the TCP connection simply breaks -- that is relatively rare
but happens on occasion. There is logic in the mmfsd daemon to detect the
disconnection and attempt to reconnect to the destination in question. If
the reconnect is successful then steps are taken to recover the state kept
by the daemons, and that includes resending some RPCs that were in flight
when the disconnection took place.

As the flash describes, a problem in the logic to resend some RPCs was
causing one of the RPC headers to be omitted, resulting in the RPC data to
be interpreted as the (missing) header. Normally the result is an assert
on the receiving end, like the "logAssertFailed: !"Request and queue size
mismatch"  assert described in the flash. However, it's at least
conceivable (though expected to very rare) that the content of the RPC
data could be interpreted as a valid RPC header. In the case of an RPC
which involves data transfer between an NSD client and NSD server, that
might result in incorrect data being written to some NSD device.

Disconnect/reconnect scenarios appear to be uncommon. An entry like

[N] Reconnected to xxx.xxx.xxx.xxx nodename <c0n0>

in mmfs.log would be an indication that a reconnect has occurred. By
itself, the reconnect will not imply that data or the file system was
corrupted, since that will depend on what RPCs were pending when the
connection happened. In the case the assert above is hit, no corruption is
expected, since the daemon will go down before incorrect data gets
written.

Reconnects involving an NSD server are those which present the highest
risk, given that NSD-related RPCs are used to write data into NSDs

Even on clusters that have not been subjected to disconnects/reconnects
before, such events might still happen in the future in case of network
glitches. It's then recommended that an efix for the problem be applied in
a timely fashion.


Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668



Regards, The Spectrum Scale (GPFS) team

------------------------------------------------------------------------------------------------------------------
If you feel that your question can benefit other users of  Spectrum Scale
(GPFS), then please post it to the public IBM developerWroks Forum at
https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479
.

If your query concerns a potential software error in Spectrum Scale (GPFS)
and you have an IBM software maintenance contract please contact
1-800-237-5511<tel:1-800-237-5511>in the United States or your local IBM Service Center in
other countries.

The forum is informally monitored as time permits and should not be used
for priority messages to the Spectrum Scale (GPFS) team.



From:        "Oesterlin, Robert" <Robert.Oesterlin at nuance.com<mailto:Robert.Oesterlin at nuance.com>>
To:        gpfsug main discussion list <gpfsug-discuss at spectrumscale.org<mailto:gpfsug-discuss at spectrumscale.org>>
Date:        10/09/2017 10:38 AM
Subject:        [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale
(GPFS) V4.1 and 4.2 levels: network reconnect function may result in file
system corruption or undetected file data corruption (2017.10.09)
Sent by:        gpfsug-discuss-bounces at spectrumscale.org<mailto:gpfsug-discuss-bounces at spectrumscale.org>



Can anyone from the Scale team comment?

Anytime I see ?may result in file system corruption or undetected file
data corruption? it gets my attention.

Bob Oesterlin
Sr Principal Storage Engineer, Nuance














Storage
IBM My Notifications
Check out the IBM Electronic Support







IBM Spectrum Scale



: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect
function may result in file system corruption or undetected file data
corruption



IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2
levels, in which resending an NSD RPC after a network reconnect function
may result in file system corruption or undetected file data corruption.

















_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__spectrumscale.org&d=DwMGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=sYR0tp1p1aSwQn0RM9QP4zPQxUppP2L6GbiIIiozJUI&s=MdcLR4NOSYoR7HLQgDdVYZge98oH4rMDZaeGbWGJem0&e=>
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=-AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e=




_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__spectrumscale.org&d=DwMGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=sYR0tp1p1aSwQn0RM9QP4zPQxUppP2L6GbiIIiozJUI&s=MdcLR4NOSYoR7HLQgDdVYZge98oH4rMDZaeGbWGJem0&e=>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss<https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwMGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=sYR0tp1p1aSwQn0RM9QP4zPQxUppP2L6GbiIIiozJUI&s=3ZjFj8wR05Z0PLErKreAAKH50vaxxvbM1H-NngJWJwI&e=>

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__spectrumscale.org&d=DwMGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=sYR0tp1p1aSwQn0RM9QP4zPQxUppP2L6GbiIIiozJUI&s=MdcLR4NOSYoR7HLQgDdVYZge98oH4rMDZaeGbWGJem0&e=>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss<https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwMGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=sYR0tp1p1aSwQn0RM9QP4zPQxUppP2L6GbiIIiozJUI&s=3ZjFj8wR05Z0PLErKreAAKH50vaxxvbM1H-NngJWJwI&e=>




_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__spectrumscale.org&d=DwMGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=sYR0tp1p1aSwQn0RM9QP4zPQxUppP2L6GbiIIiozJUI&s=MdcLR4NOSYoR7HLQgDdVYZge98oH4rMDZaeGbWGJem0&e=>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss<https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwMGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=sYR0tp1p1aSwQn0RM9QP4zPQxUppP2L6GbiIIiozJUI&s=3ZjFj8wR05Z0PLErKreAAKH50vaxxvbM1H-NngJWJwI&e=>
 _______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=sYR0tp1p1aSwQn0RM9QP4zPQxUppP2L6GbiIIiozJUI&s=3ZjFj8wR05Z0PLErKreAAKH50vaxxvbM1H-NngJWJwI&e=

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20171025/42d1c0a4/attachment.htm>


More information about the gpfsug-discuss mailing list