Just a bit wider context:
spurious xrootd-internal retries have been blamed for data loss at CERN (truncated/empty files) since at least last summer, with LHC experiments struggling to turn this off. I understand that the internal retry mechanism (i.e drop TCP connection, open new connection, resend) is generic and that "just" turning it off just for writes or open+truncate is currently not an option.
I would advocate to disable this mechanism by default for now, and perhaps re-enable once it can be safely limited to a subset of idempotent operations (read?).
All LHC experiments have mechanisms for higher-level retry in case "xrdcp" gives some error. What they cannot really cope with is a silently corrupted file and "successful" exit code.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/xrootd/xrootd","title":"xrootd/xrootd","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/xrootd/xrootd"}},"updates":{"snippets":[{"icon":"PERSON","message":"@jmuf in #673: Just a bit wider context:\r\nspurious xrootd-internal retries have been blamed for data loss at CERN (truncated/empty files) since at least last summer, with LHC experiments struggling to turn this off. I understand that the internal retry mechanism (i.e drop TCP connection, open new connection, resend) is generic and that \"just\" turning it off just for writes or open+truncate is currently not an option.\r\nI would advocate to disable this mechanism by default for now, and perhaps re-enable once it can be safely limited to a subset of idempotent operations (read?).\r\nAll LHC experiments have mechanisms for higher-level retry in case \"xrdcp\" gives some error. What they cannot really cope with is a silently corrupted file and \"successful\" exit code."}],"action":{"name":"View Issue","url":"https://github.com/xrootd/xrootd/issues/673#issuecomment-374851663"}}}

Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1