We are observing that large transfers ( >>10GB) through an XCache are consistently failing.

The symptoms are that, somewhere in the transfer, pgread fails with a complaint about a checksum error. This triggers a retry of all pages in the block (strange, as I'd assume that only a single page would be corrupted, not all 128 each time). The retries result in either minutes-long stalls (causing a timeout in the client accessing xcache) or, as it will reopen the file internally as part of the recovery, a failure due to an expired token. Basically, we were never able to get a 200GB file to completely transfer (though I think it would have eventually succeeded after a few more hours of a retry loop as it at least made it further on each transfer).

This changed, however, when I switched the origin to require TLS. Instead of a corrupted page + recovery, every few minutes I'd get a disconnect along the lines off:

[2023-02-02 19:20:07.214508 +0000][Error  ][PostMaster        ] [[log in to unmask]:1095] Forcing error on disconnect: [ERROR] Operation interrupted.

(the underlying error is unclear to me here). XCache cleanly recovered from this error and I've not had any failures since enabling TLS (compared to no successes without TLS).

At this point, I said "ah-ha! I have discovered a problematic network and we are seeing TCP packet corruptions!"

Unfortunately, this also doesn't appear to be true. The TCP rates are reasonable (60MB/s) for a transfer going over 2,000 miles, suggesting the TCP corruption rate can't be that bad. Further, none of the corruptions seem to occur when I utilize read instead of pgread or if I do a HTTPS-based transfer. It seems unlikely that there's a TCP issue only when pgread is used!

This is with XRootD 5.5.1 on both the origin and cache.

I'm stumped on what could be going on here and would appreciate it if someone else could try a similar setup to see if it trivially duplicates. I think I've got a reasonably simple configuration with the only somewhat-unique thing here is the size of the file (~200GB).


Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.Message ID: <xrootd/xrootd/issues/1893@github.com>

[ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/xrootd/xrootd/issues/1893", "url": "https://github.com/xrootd/xrootd/issues/1893", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1