Print

Print


Hi,

This does not seem to be an xrootd related problem (other then we observe it on 
a caching proxy) but we figured people here might now how to go about figuring 
out what's happening.

We just noticed checksums on some files in healing caching proxy don't match up. 
Looking onto bit-level, it turns out it always goes for bit 2 (2^2 = 4) being 
flipped twice in the opposite direction on offsets 16 bytes apart. Strange, 
right? It also seems to bypass the ip checksum protection.

To us it seems like a hardware error ... but we are unsure as to how to pin it 
down. The errors are time correlated, i.e., most of the errors occur on a few 
selected days. This could also mean jobs that were running on UCSD on those days 
were asking for data from some dataset and we always got redirected to the same 
site -- we will cross check this. My gut feeling is it must be network ... but 
there really is no base for this as I've never had to deal with a thing like 
that. Could it be ram (see ecc info below)?

Has anybody seen anything like this? Any ideas?

The machine where we run the proxy is rather oldish, with two 1 Gbps ports (one 
going outside and one into our T2), disk where it happens is a logical volume 
composed of 4 real partitions from 4 different disks (not raid).

Cheers,
Matevz


[1720] root@xrootd-proxy ~# dmidecode --type 16
# dmidecode 2.12
SMBIOS 2.4 present.

Handle 0x0022, DMI type 16, 15 bytes
Physical Memory Array
         Location: System Board Or Motherboard
         Use: System Memory
         Error Correction Type: Single-bit ECC
         Maximum Capacity: 48 GB
         Error Information Handle: Not Provided
         Number Of Devices: 12

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1