Hi Matevz,
I've seen two different aspects of this in the past...
1) If IP checksums are computed on the card or if a switch in your network
is compressing the data because it knows the other switch knows how to
decompress it, then it's possible that certain sequences of bits may cause
the wrong checksum to be computed or may, in fact, flip bits. The ones I've
seen is where long lengths of a particular bit (1 or 0) is transmitted. Hard
to pin down and you would need the manufacturer to work with you on this
(well, at least they can tell you if they have seen this problem).
2) Indeed, we have seen that some sites have files that are corrupted and,
unfortunately, you get to pick from them on certain days and get corrupted
data. This is easier to figure out (trivial via brute force). While I wish
that were the case here, I am a bit skeptical since the error mode is always
the same and presumably on a random set of files. But, at least this one is
easy to rule out.
Andy
-----Original Message-----
From: Matevz Tadel
Sent: Wednesday, October 08, 2014 5:21 PM
To: xrootd-dev
Cc: Jeff Dost
Subject: Bit flip of a pair of bits in caching proxy
Hi,
This does not seem to be an xrootd related problem (other then we observe it
on
a caching proxy) but we figured people here might now how to go about
figuring
out what's happening.
We just noticed checksums on some files in healing caching proxy don't match
up.
Looking onto bit-level, it turns out it always goes for bit 2 (2^2 = 4)
being
flipped twice in the opposite direction on offsets 16 bytes apart. Strange,
right? It also seems to bypass the ip checksum protection.
To us it seems like a hardware error ... but we are unsure as to how to pin
it
down. The errors are time correlated, i.e., most of the errors occur on a
few
selected days. This could also mean jobs that were running on UCSD on those
days
were asking for data from some dataset and we always got redirected to the
same
site -- we will cross check this. My gut feeling is it must be network ...
but
there really is no base for this as I've never had to deal with a thing like
that. Could it be ram (see ecc info below)?
Has anybody seen anything like this? Any ideas?
The machine where we run the proxy is rather oldish, with two 1 Gbps ports
(one
going outside and one into our T2), disk where it happens is a logical
volume
composed of 4 real partitions from 4 different disks (not raid).
Cheers,
Matevz
[1720] root@xrootd-proxy ~# dmidecode --type 16
# dmidecode 2.12
SMBIOS 2.4 present.
Handle 0x0022, DMI type 16, 15 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Single-bit ECC
Maximum Capacity: 48 GB
Error Information Handle: Not Provided
Number Of Devices: 12
########################################################################
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
########################################################################
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
|