Thanks Andy,
Indeed, 2 would be the best outcome, I'll keep you posted. I did not know about
routers being able to do compression. If we identify a specific site as the
culprit and it is not consistently reproducible it probably will hint at network
issue on the route fromt that site to UCSD.
Matevz
On 10/08/14 19:01, Andrew Hanushevsky wrote:
> Hi Matevz,
>
> I've seen two different aspects of this in the past...
>
> 1) If IP checksums are computed on the card or if a switch in your network is
> compressing the data because it knows the other switch knows how to decompress
> it, then it's possible that certain sequences of bits may cause the wrong
> checksum to be computed or may, in fact, flip bits. The ones I've seen is where
> long lengths of a particular bit (1 or 0) is transmitted. Hard to pin down and
> you would need the manufacturer to work with you on this (well, at least they
> can tell you if they have seen this problem).
>
> 2) Indeed, we have seen that some sites have files that are corrupted and,
> unfortunately, you get to pick from them on certain days and get corrupted data.
> This is easier to figure out (trivial via brute force). While I wish that were
> the case here, I am a bit skeptical since the error mode is always the same and
> presumably on a random set of files. But, at least this one is easy to rule out.
>
> Andy
>
> -----Original Message----- From: Matevz Tadel
> Sent: Wednesday, October 08, 2014 5:21 PM
> To: xrootd-dev
> Cc: Jeff Dost
> Subject: Bit flip of a pair of bits in caching proxy
>
> Hi,
>
> This does not seem to be an xrootd related problem (other then we observe it on
> a caching proxy) but we figured people here might now how to go about figuring
> out what's happening.
>
> We just noticed checksums on some files in healing caching proxy don't match up.
> Looking onto bit-level, it turns out it always goes for bit 2 (2^2 = 4) being
> flipped twice in the opposite direction on offsets 16 bytes apart. Strange,
> right? It also seems to bypass the ip checksum protection.
>
> To us it seems like a hardware error ... but we are unsure as to how to pin it
> down. The errors are time correlated, i.e., most of the errors occur on a few
> selected days. This could also mean jobs that were running on UCSD on those days
> were asking for data from some dataset and we always got redirected to the same
> site -- we will cross check this. My gut feeling is it must be network ... but
> there really is no base for this as I've never had to deal with a thing like
> that. Could it be ram (see ecc info below)?
>
> Has anybody seen anything like this? Any ideas?
>
> The machine where we run the proxy is rather oldish, with two 1 Gbps ports (one
> going outside and one into our T2), disk where it happens is a logical volume
> composed of 4 real partitions from 4 different disks (not raid).
>
> Cheers,
> Matevz
>
>
> [1720] root@xrootd-proxy ~# dmidecode --type 16
> # dmidecode 2.12
> SMBIOS 2.4 present.
>
> Handle 0x0022, DMI type 16, 15 bytes
> Physical Memory Array
> Location: System Board Or Motherboard
> Use: System Memory
> Error Correction Type: Single-bit ECC
> Maximum Capacity: 48 GB
> Error Information Handle: Not Provided
> Number Of Devices: 12
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the XROOTD-DEV list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the XROOTD-DEV list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
########################################################################
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
|