[gpfsug-discuss] Odd behavior with cat followed by grep.

Wed Feb 14 17:51:04 GMT 2018

Just speculating here (also known as making things up) but I wonder if
grep is somehow using the file's size in its determination of binary
status. I also see mmap in the strace so maybe there's some issue with
mmap where some internal GPFS buffer is getting truncated
inappropriately but leaving a bunch of null values which gets returned
to grep.

-Aaron

On 2/14/18 10:21 AM, John Hanks wrote:
> Hi Valdis,
> 
> I tired with the grep replaced with 'ls -ls' and 'md5sum', I don't think
> this is a data integrity issue, thankfully:
> 
> $ ./pipetestls.sh 
> 256 -rw-r--r-- 1 39073 3001 530721 Feb 14 07:16
> /srv/gsfs0/projects/pipetest.tmp.txt
> 0 -rw-r--r-- 1 39073 3953 530721 Feb 14 07:16 /home/griznog/pipetest.tmp.txt
> 
> $ ./pipetestmd5.sh 
> 15cb81a85c9e450bdac8230309453a0a  /srv/gsfs0/projects/pipetest.tmp.txt
> 15cb81a85c9e450bdac8230309453a0a  /home/griznog/pipetest.tmp.txt
> 
> And replacing grep with 'file' even properly sees the files as ASCII:
> $ ./pipetestfile.sh 
> /srv/gsfs0/projects/pipetest.tmp.txt: ASCII text, with very long lines
> /home/griznog/pipetest.tmp.txt: ASCII text, with very long lines
> 
> I'll poke a little harder at grep next and see what the difference in
> strace of each reveals.
> 
> Thanks,
> 
> jbh
> 
> 
> 
> 
> On Wed, Feb 14, 2018 at 7:08 AM, <valdis.kletnieks at vt.edu
> <mailto:valdis.kletnieks at vt.edu>> wrote:
> 
>     On Wed, 14 Feb 2018 06:20:32 -0800, John Hanks said:
> 
>     > #  ls -aln /srv/gsfs0/projects/pipetest.tmp.txt $HOME/pipetest.tmp.txt
>     > -rw-r--r-- 1 39073 3953 530721 Feb 14 06:10 /home/griznog/pipetest.tmp.txt
>     > -rw-r--r-- 1 39073 3001 530721 Feb 14 06:10
>     > /srv/gsfs0/projects/pipetest.tmp.txt
>     >
>     > We can "fix" the user case that exposed this by not using a temp file or
>     > inserting a sleep, but I'd still like to know why GPFS is behaving this way
>     > and make it stop.
> 
>     May be related to replication, or other behind-the-scenes behavior.
> 
>     Consider this example - 4.2.3.6, data and metadata replication both
>     set to 2, 2 sites 95 cable miles apart, each is 3 Dell servers with
>     a full
>     fiberchannel mesh to 3 Dell MD34something arrays.
> 
>     % dd if=/dev/zero bs=1k count=4096 of=sync.test; ls -ls sync.test;
>     sleep 5; ls -ls sync.test; sleep 5; ls -ls sync.test
>     4096+0 records in
>     4096+0 records out
>     4194304 bytes (4.2 MB) copied, 0.0342852 s, 122 MB/s
>     2048 -rw-r--r-- 1 root root 4194304 Feb 14 09:35 sync.test
>     8192 -rw-r--r-- 1 root root 4194304 Feb 14 09:35 sync.test
>     8192 -rw-r--r-- 1 root root 4194304 Feb 14 09:35 sync.test
> 
>     Notice that the first /bin/ls shouldn't be starting until after the
>     dd has
>     completed - at which point it's only allocated half the blocks
>     needed to hold
>     the 4M of data at one site.  5 seconds later, it's allocated the
>     blocks at both
>     sites and thus shows the full 8M needed for 2 copies.
> 
>     I've also seen (but haven't replicated it as I write this) a small
>     file (4-8K
>     or so) showing first one full-sized block, then a second full-sized
>     block, and
>     then dropping back to what's needed for 2 1/32nd fragments.  That had me
>     scratching my head
> 
>     Having said that, that's all metadata fun and games, while your case
>     appears to have some problems with data integrity (which is a whole lot
>     scarier).  It would be *really* nice if we understood the problem here.
> 
>     The scariest part is:
> 
>     > The first grep | wc -l returns 1, because grep outputs  "Binary file /path/to/
>     > gpfs/mount/test matches"
> 
>     which seems to be implying that we're failing on semantic consistency.
>     Basically, your 'cat' command is completing and closing the file,
>     but then a
>     temporally later open of the same find is reading something other
>     that only the
>     just-written data.  My first guess is that it's a race condition
>     similar to the
>     following: The cat command is causing a write on one NSD server, and
>     the first
>     grep results in a read from a *different* NSD server, returning the
>     data that
>     *used* to be in the block because the read actually happens before
>     the first
>     NSD server actually completes the write.
> 
>     It may be interesting to replace the grep's with pairs of 'ls -ls /
>     dd' commands to grab the
>     raw data and its size, and check the following:
> 
>     1) does the size (both blocks allocated and logical length) reported by
>     ls match the amount of data actually read by the dd?
> 
>     2) Is the file length as actually read equal to the written length,
>     or does it
>     overshoot and read all the way to the next block boundary?
> 
>     3) If the length is correct, what's wrong with the data that's
>     telling grep that
>     it's a binary file?  ( od -cx is your friend here).
> 
>     4) If it overshoots, is the remainder all-zeros (good) or does it
>     return semi-random
>     "what used to be there" data (bad, due to data exposure issues)?
> 
>     (It's certainly not the most perplexing data consistency issue I've
>     hit in 4 decades - the
>     winner *has* to be a intermittent data read corruption on a GPFS 3.5
>     cluster that
>     had us, IBM, SGI, DDN, and at least one vendor of networking gear
>     all chasing our
>     tails for 18 months before we finally tracked it down. :)
> 
>     _______________________________________________
>     gpfsug-discuss mailing list
>     gpfsug-discuss at spectrumscale.org <http://spectrumscale.org>
>     http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>     <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
> 
> 
> 
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> 

-- 
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776