[gpfsug-discuss] Again! Using IBM Spectrum Scale could lead to data loss

Felipe Knop knop at us.ibm.com
Wed Aug 23 05:40:19 BST 2017


Aaron,

IBM's policy is to issue a flash when such data corruption/loss problem 
has been identified, even if the problem has never been encountered by any 
customer. In fact, most of the flashes have been the result of internal 
test activity, even though the discovery took place after the affected 
versions/PTFs have already been released.  This is the case of two of the 
recent flashes:

http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010293

http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010487

The flashes normally do not indicate the risk level that a given problem 
has of being hit, since there are just too many variables at play, given 
that clusters and workloads vary significantly.

The first issue above appears to be uncommon (and potentially rare).  The 
second issue seems to have a higher probability of occurring -- and as 
described in the flash, the problem is triggered by failures being 
encountered while running one of the commands listed in the "Users 
Affected" section of the writeup.

I don't think precise recommendations could be given on

 if the bugs fall in the category of "drop everything and patch *now*" or 
"this is a theoretically nasty bug but we've yet to see it in the wild"

since different clusters, configuration, or workload may drastically 
affect the the likelihood of hitting the problem.  On the other hand, when 
coming up with the text for the flash, the team attempts to provide as 
much information as possible/available on the known triggers and 
mitigation circumstances.

  Felipe

----
Felipe Knop                                     knop at us.ibm.com
GPFS Development and Security
IBM Systems
IBM Building 008
2455 South Rd, Poughkeepsie, NY 12601
(845) 433-9314  T/L 293-9314





From:   Aaron Knister <aaron.knister at gmail.com>
To:     gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date:   08/22/2017 10:37 AM
Subject:        Re: [gpfsug-discuss] Again! Using IBM Spectrum Scale could 
lead to data loss
Sent by:        gpfsug-discuss-bounces at spectrumscale.org



Hi Jochen,

I share your concern about data loss bugs and I too have found it 
troubling especially since the 4.2 stream is in my immediate future 
(although I would have rather stayed on 4.1 due to my perception of 
stability/integrity issues in 4.2). By and large 4.1 has been *extremely* 
stable for me.

While not directly related to the stability concerns, I'm curious as to 
why your customer sites are requiring downtime to do the upgrades? While, 
of course, individual servers need to be taken offline to update GPFS the 
collective should be able to stay up. Perhaps your customer environments 
just don't lend themselves to that. 

It occurs to me that some of these bugs sound serious (and indeed I 
believe this one is) I recently found myself jumping prematurely into an 
update for the metanode filesize corruption bug that as it turns out that 
while very scary sounding is not necessarily a particularly common bug (if 
I understand correctly). Perhaps it would be helpful if IBM could clarify 
the believed risk of these updates or give us some indication if the bugs 
fall in the category of "drop everything and patch *now*" or "this is a 
theoretically nasty bug but we've yet to see it in the wild". I could 
imagine IBM legal wanting to avoid a situation where IBM indicates 
something is low risk but someone hits it and it eats data. Although many 
companies do this with security patches so perhaps it's a non-issue.

From my perspective I don't think existing customers are being 
"forgotten". I think IBM is pushing hard to help Spectrum Scale adapt to 
an ever-changing world and I think these features are necessary and 
useful. Perhaps Scale would benefit from more resources being dedicated to 
QA/Testing which isn't a particularly sexy thing-- it doesn't result in 
any new shiny features for customers (although "not eating your data" is a 
feature I find really attractive).

Anyway, I hope IBM can find a way to minimize the frequency of these bugs. 
Personally speaking, I'm pretty convinced, it's not for lack of capability 
or dedication on the part of the great folks actually writing the code.

-Aaron

On Tue, Aug 22, 2017 at 7:09 AM, Zeller, Jochen <Jochen.Zeller at sva.de> 
wrote:
Dear community,
 
this morning I started in a good mood, until I’ve checked my mailbox. 
Again a reported bug in Spectrum Scale that could lead to data loss. 
During the last year I was looking for a stable Scale version, and each 
time I’ve thought: “Yes, this one is stable and without serious data loss 
bugs” - a few day later, IBM announced a new APAR with possible data loss 
for this version. 
 
I am supporting many clients in central Europe. They store databases, 
backup data, life science data, video data, results of technical 
computing, do HPC on the file systems, etc. Some of them had to change 
their Scale version nearly monthly during the last year to prevent running 
in one of the serious data loss bugs in Scale. From my perspective, it was 
and is a shame to inform clients about new reported bugs right after the 
last update. From client perspective, it was and is a lot of work and 
planning to do to get a new downtime for updates. And their internal 
customers are not satisfied with those many downtimes of the clusters and 
applications.
 
For me, it seems that Scale development is working on features for a 
specific project or client, to achieve special requirements. But they 
forgot the existing clients, using Scale for storing important data or 
running important workloads on it.
 
To make us more visible, I’ve used the IBM recommended way to notify about 
mandatory enhancements, the less favored RFE:
 
http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=109334
 
If you like, vote for more reliability in Scale.
 
I hope this a good way to show development and responsible persons that we 
have trouble and are not satisfied with the quality of the releases.
 
 
Regards,
 
Jochen 
 
 
 
 
 
 
 

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=oNT2koCZX0xmWlSlLblR9Q&m=Nh-z-CGPni6b-k9jTdJfWNw6-jtvc8OJgjogfIyp498&s=Vsf2AaMf7b7F6Qv3lGZ9-xBciF9gdfuqnb206aVG-Go&e= 




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170823/5b35dad5/attachment-0002.htm>


More information about the gpfsug-discuss mailing list