Hi,
This does not seem to be an xrootd related problem (other then we observe it on
a caching proxy) but we figured people here might now how to go about figuring
out what's happening.
We just noticed checksums on some files in healing caching proxy don't match up.
Looking onto bit-level, it turns out it always goes for bit 2 (2^2 = 4) being
flipped twice in the opposite direction on offsets 16 bytes apart. Strange,
right? It also seems to bypass the ip checksum protection.
To us it seems like a hardware error ... but we are unsure as to how to pin it
down. The errors are time correlated, i.e., most of the errors occur on a few
selected days. This could also mean jobs that were running on UCSD on those days
were asking for data from some dataset and we always got redirected to the same
site -- we will cross check this. My gut feeling is it must be network ... but
there really is no base for this as I've never had to deal with a thing like
that. Could it be ram (see ecc info below)?
Has anybody seen anything like this? Any ideas?
The machine where we run the proxy is rather oldish, with two 1 Gbps ports (one
going outside and one into our T2), disk where it happens is a logical volume
composed of 4 real partitions from 4 different disks (not raid).
Cheers,
Matevz
[1720] root@xrootd-proxy ~# dmidecode --type 16
# dmidecode 2.12
SMBIOS 2.4 present.
Handle 0x0022, DMI type 16, 15 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Single-bit ECC
Maximum Capacity: 48 GB
Error Information Handle: Not Provided
Number Of Devices: 12
########################################################################
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
|