[gpfsug-discuss] data integrity documentation

Stijn De Weirdt stijn.deweirdt at ugent.be
Wed Aug 2 21:38:29 BST 2017


hi ed,

On 08/02/2017 10:11 PM, Edward Wahl wrote:
> What version of GPFS?  Are you generating a patch file?
4.2.3 series, now we run 4.2.3.3

to be clear, right now we use mmfsck to trigger the chksum issue hoping
we can find the actual "hardware" issue.

we know by elimination which HCAs to avoid, so we do not get the
checksum errors. but to consider that a fix, we need to know if the data
written by the client can be trusted due to these silent hw errors.

> 
> Try using this before your mmfsck:
> 
> mmdsh -N <nsdnodes|all> mmfsadm test fsck usePatchQueue 0
mmchmgr somefs nsdXYZ
mmfsck somefs -Vn -m -N nsdXYZ -t /var/tmp/

the idea is to force everything as much as possible on one node,
accessing the other failure group is forced over network

> 
> my notes say all, but I would have only had NSD nodes up at the time.
> Supposedly the mmfsck mess in 4.1 and 4.2.x was fixed in 4.2.2.3. 
we had the "pleasure" last to have mmfsck segfaulting while we were
trying to recover a filesystem, at least that was certainly fixed ;)


stijn

> I won't know for sure until late August.
> 
> Ed
> 
> 
> On Wed, 2 Aug 2017 21:20:14 +0200
> Stijn De Weirdt <stijn.deweirdt at ugent.be> wrote:
> 
>> hi sven,
>>
>> the data is not corrupted. mmfsck compares 2 inodes, says they don't
>> match, but checking the data with tbdbfs reveals they are equal.
>> (one replica has to be fetched over the network; the nsds cannot access
>> all disks)
>>
>> with some nsdChksum... settings we get during this mmfsck a lot of
>> "Encountered XYZ checksum errors on network I/O to NSD Client disk"
>>
>> ibm support says these are hardware issues, but wrt to mmfsck false
>> positives.
>>
>> anyway, our current question is: if these are hardware issues, is there
>> anything in gpfs client->nsd (on the network side) that would detect
>> such errors. ie can we trust the data (and metadata).
>> i was under the impression that client to disk is not covered, but i
>> assumed that at least client to nsd (the network part) was checksummed.
>>
>> stijn
>>
>>
>> On 08/02/2017 09:10 PM, Sven Oehme wrote:
>>> ok, i think i understand now, the data was already corrupted. the config
>>> change i proposed only prevents a potentially known future on the wire
>>> corruption, this will not fix something that made it to the disk already.
>>>
>>> Sven
>>>
>>>
>>>
>>> On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt <stijn.deweirdt at ugent.be>
>>> wrote:
>>>   
>>>> yes ;)
>>>>
>>>> the system is in preproduction, so nothing that can't stopped/started in
>>>> a few minutes (current setup has only 4 nsds, and no clients).
>>>> mmfsck triggers the errors very early during inode replica compare.
>>>>
>>>>
>>>> stijn
>>>>
>>>> On 08/02/2017 08:47 PM, Sven Oehme wrote:  
>>>>> How can you reproduce this so quick ?
>>>>> Did you restart all daemons after that ?
>>>>>
>>>>> On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt <stijn.deweirdt at ugent.be>
>>>>> wrote:
>>>>>  
>>>>>> hi sven,
>>>>>>
>>>>>>  
>>>>>>> the very first thing you should check is if you have this setting
>>>>>>> set :  
>>>>>> maybe the very first thing to check should be the faq/wiki that has this
>>>>>> documented?
>>>>>>  
>>>>>>>
>>>>>>> mmlsconfig envVar
>>>>>>>
>>>>>>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1
>>>>>>> MLX5_USE_MUTEX 1
>>>>>>>
>>>>>>> if that doesn't come back the way above you need to set it :
>>>>>>>
>>>>>>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1
>>>>>>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1"  
>>>>>> i just set this (wasn't set before), but problem is still present.
>>>>>>  
>>>>>>>
>>>>>>> there was a problem in the Mellanox FW in various versions that was  
>>>> never  
>>>>>>> completely addressed (bugs where found and fixed, but it was never  
>>>> fully  
>>>>>>> proven to be addressed) the above environment variables turn code on
>>>>>>> in  
>>>>>> the  
>>>>>>> mellanox driver that prevents this potential code path from being used  
>>>> to  
>>>>>>> begin with.
>>>>>>>
>>>>>>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in  
>>>> Scale  
>>>>>>> that even you don't set this variables the problem can't happen anymore
>>>>>>> until then the only choice you have is the envVar above (which btw  
>>>> ships  
>>>>>> as  
>>>>>>> default on all ESS systems).
>>>>>>>
>>>>>>> you also should be on the latest available Mellanox FW & Drivers as
>>>>>>> not  
>>>>>> all  
>>>>>>> versions even have the code that is activated by the environment  
>>>>>> variables  
>>>>>>> above, i think at a minimum you need to be at 3.4 but i don't remember  
>>>>>> the  
>>>>>>> exact version. There had been multiple defects opened around this
>>>>>>> area,  
>>>>>> the  
>>>>>>> last one i remember was  :  
>>>>>> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from
>>>>>> dell, and the fw is a bit behind. i'm trying to convince dell to make
>>>>>> new one. mellanox used to allow to make your own, but they don't  
>>>> anymore.  
>>>>>>  
>>>>>>>
>>>>>>> 00154843 : ESS ConnectX-3 performance issue - spinning on  
>>>>>> pthread_spin_lock  
>>>>>>>
>>>>>>> you may ask your mellanox representative if they can get you access to  
>>>>>> this  
>>>>>>> defect. while it was found on ESS , means on PPC64 and with ConnectX-3
>>>>>>> cards its a general issue that affects all cards and on intel as well  
>>>> as  
>>>>>>> Power.  
>>>>>> ok, thanks for this. maybe such a reference is enough for dell to update
>>>>>> their firmware.
>>>>>>
>>>>>> stijn
>>>>>>  
>>>>>>>
>>>>>>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt <  
>>>> stijn.deweirdt at ugent.be>  
>>>>>>> wrote:
>>>>>>>  
>>>>>>>> hi all,
>>>>>>>>
>>>>>>>> is there any documentation wrt data integrity in spectrum scale:
>>>>>>>> assuming a crappy network, does gpfs garantee somehow that data  
>>>> written  
>>>>>>>> by client ends up safe in the nsd gpfs daemon; and similarly from the
>>>>>>>> nsd gpfs daemon to disk.
>>>>>>>>
>>>>>>>> and wrt crappy network, what about rdma on crappy network? is it the  
>>>>>> same?  
>>>>>>>>
>>>>>>>> (we are hunting down a crappy infiniband issue; ibm support says it's
>>>>>>>> network issue; and we see no errors anywhere...)
>>>>>>>>
>>>>>>>> thanks a lot,
>>>>>>>>
>>>>>>>> stijn
>>>>>>>> _______________________________________________
>>>>>>>> gpfsug-discuss mailing list
>>>>>>>> gpfsug-discuss at spectrumscale.org
>>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>>>>>>  
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> gpfsug-discuss mailing list
>>>>>>> gpfsug-discuss at spectrumscale.org
>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>>>>>  
>>>>>> _______________________________________________
>>>>>> gpfsug-discuss mailing list
>>>>>> gpfsug-discuss at spectrumscale.org
>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>>>>  
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> gpfsug-discuss mailing list
>>>>> gpfsug-discuss at spectrumscale.org
>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>>>  
>>>> _______________________________________________
>>>> gpfsug-discuss mailing list
>>>> gpfsug-discuss at spectrumscale.org
>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>>  
>>>
>>>
>>>
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at spectrumscale.org
>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>   
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> 
> 
> 



More information about the gpfsug-discuss mailing list