Dear xrootd experts,

We have an issue seen by LHCb at the RAL Tier-1 over the last year or so and we are trying to understand and solve it. The issue is as follows - when a LHCb job tries to stream data off the RAL ECHO storage, it fails every once in a while leading to eventual job failures of ~ 20% or more depending on the number of input files being processed. This can become an issue especially for user jobs.

Following a controlled study (which we can tweak as needed - feel free to suggest!) we see about 80% of the failures contain an "Operation expired" error as below. We have set XRD_LOGLEVEL=Dump to get the output below and in this particular case the input file is about 5GB in size (root://xrootd.echo.stfc.ac.uk:1094/lhcb:prod/lhcb/LHCb/Collision15em/ALLTURBO.DST/00046157/0000/00046157_00001735_1.allturbo.dst).

Any suggestions on what to do to fix / understand the problem would be greatly appreciated! We have so far tried to set XRD_STREAMTIMEOUT=600 but that has not helped.

[2020-07-22 06:51:04.479831 +0000][Dump   ][TaskMgr           ] Will rerun task "FileTimer task" at [2020-07-22 06:51:19 +0000]
[2020-07-22 06:51:15.046797 +0000][Dump   ][XRootDTransport   ] [xrootd.echo.stfc.ac.uk:1094 #0.0] Stream inactive since 392 seconds, TTL: 300, allocated SIDs: 4, open files: 4
[2020-07-22 06:51:15.046842 +0000][Dump   ][XRootDTransport   ] [xrootd.echo.stfc.ac.uk:1094 #0.0] Stream inactive since 392 seconds, stream timeout: 600, allocated SIDs: 4, wait barrier: 2020-07-22 06:44:43 +00
00
[2020-07-22 06:51:18.481068 +0000][Dump   ][TaskMgr           ] Running task: "TickGeneratorTask for: xrootd.echo.stfc.ac.uk:1094"
[2020-07-22 06:51:18.481119 +0000][Dump   ][XRootD            ] [xrootd.echo.stfc.ac.uk:1094] Stream event reported for msg kXR_readv (handle: 0x03000000, chunks: 1024, total size: 1709338)
[2020-07-22 06:51:18.481135 +0000][Debug  ][XRootD            ] [xrootd.echo.stfc.ac.uk:1094] Handling error while processing kXR_readv (handle: 0x03000000, chunks: 1024, total size: 1709338): [ERROR] Operation 
expired.
[2020-07-22 06:51:18.481145 +0000][Error  ][XRootD            ] [xrootd.echo.stfc.ac.uk:1094] Unable to get the response to request kXR_readv (handle: 0x03000000, chunks: 1024, total size: 1709338)
[2020-07-22 06:51:18.481171 +0000][Dump   ][XRootD            ] [xrootd.echo.stfc.ac.uk:1094] Stream event reported for msg kXR_readv (handle: 0x03000000, chunks: 230, total size: 394117)
[2020-07-22 06:51:18.481183 +0000][Debug  ][XRootD            ] [xrootd.echo.stfc.ac.uk:1094] Handling error while processing kXR_readv (handle: 0x03000000, chunks: 230, total size: 394117): [ERROR] Operation ex
pired.

I also note that this issue is fairly random


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.

[ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/xrootd/xrootd/issues/1259", "url": "https://github.com/xrootd/xrootd/issues/1259", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1