Print

Print


Hi,

While testing an xcache (5.1.0) deployment, a cache client a received a corrupted data during a read. Investigation showed the corresponding (partially) cached file on the xcache had a corrupted cache block (512KiB). The corruption was particular, if the cached block was supposed to consist of gooddata[0:524288] it was actually gooddata[4096: 524288] | gooddata[520192: 524288]. The deployment setup involved an upstream source (also 5.1.0) with a custom XrdOss; and the xcache was running a development build of XrdOssCsi. So it was not necessarily clear where the fault might be. The rate of corruption was presumably low; a previous installation with 5.0.3 had not show corruption with over 1PB of data through the cache, the 5.1.0 install had showed it after about 150TB.

Trying to simplify the setup XrdOssCsi was removed from the xcache, and "XRD_LOGLEVEL=Info" added.
Subsequently, the xcache was frequently (more than 1 once per minute, with a file read rate of ~2 per second at ~2.5GB/s) logging Info entries from XrdCl of the form:

[2021-03-04 12:20:42.948488 +0100][Info   ][File              ] [0xabfc44e0@root:[log in to unmask]:1094//mockdata/EVNT.13260262._001673.pool.root.1_199709383_20?xrdcl.requuid=0fb317da-4cc8-4854-8908-375b552d24dd] Received corrupted page, will retry page #0.
[...]
[2021-03-04 12:20:42.950979 +0100][Info   ][File              ] [0xabfc44e0@root:[log in to unmask]:1094//mockdata/EVNT.13260262._001673.pool.root.1_199709383_20?xrdcl.requuid=0fb317da-4cc8-4854-8908-375b552d24dd] Received corrupted page, will retry page #127.

However the data delivered to XrdPfc were correct. (I have not reproduced the original corruption pattern). Some investigation pointed to a desynchronisation within XrdCl while reading a kXR_pgread response from the origin server: expecting ... the processing was sometimes seen to treat part of the datapage as the crc, etc.

A possible PR to fix this is PR #1419. With this the "Received corrupted page" Info notices are no longer seen (at least during the period inspected). I suppose this does not definitely show the original corruption was due to this, but I think it is explicable by the desynchronisation, assuming specific circumstances etc.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.

[ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/xrootd/xrootd/issues/1420", "url": "https://github.com/xrootd/xrootd/issues/1420", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1