Print

Print


Dear @bbockelm and @wyang007

Many thanks for your responses. To start with, I am the RAL contact who is looking at this with support from the ECHO team. My expertise is more in the field of some of the LHCb software, though I will be happy to feed back suggestions or include the relevant RAL administrators in this discussion as needed. Unfortunately there is not much expertise on xrootd within RAL and so we need help from you in solving this issue!

To answer @wyang007 my jobs are running within RAL on dedicated worker nodes which replicate the production setup. But these machines run only test jobs and we can tweak the settings here to see if the issue gets fixed. The cache size is a few 100 GB in size. We have this test system because LHCb have observed this issue on a large scale and we are trying to solve it.

@bbockelm For my understanding, why doesn't xrootd reopen the file? We see quite a few "socket errors" in general but this is never (apparently) fatal. The socket closes and then reopens. What happens that this process is terminated in such cases? For example, we see

[2020-07-22 20:42:20.822408 +0000][Error  ][AsyncSock         ] [xrootd.echo.stfc.ac.uk:1094 #0.0] Socket error while handshaking: [ERROR] Socket timeout
[2020-07-22 20:42:20.822418 +0000][Debug  ][AsyncSock         ] [xrootd.echo.stfc.ac.uk:1094 #0.0] Closing the socket
[2020-07-22 20:42:20.822431 +0000][Debug  ][Poller            ] <[::ffff:172.28.5.33]:60726><--><[::ffff:172.28.1.1]:1094> Removing socket from the poller
[2020-07-22 20:42:20.822510 +0000][Error  ][PostMaster        ] [xrootd.echo.stfc.ac.uk:1094 #0] elapsed = 108, pConnectionWindow = 120 seconds.
[2020-07-22 20:42:20.822532 +0000][Info   ][PostMaster        ] [xrootd.echo.stfc.ac.uk:1094 #0] Attempting reconnection in 12 seconds.
[2020-07-22 20:42:20.822546 +0000][Debug  ][TaskMgr           ] Registering task: "StreamConnectorTask for xrootd.echo.stfc.ac.uk:1094 #0" to be run at: [2020-07-22 20:
42:32 +0000]
[2020-07-22 20:42:26.822614 +0000][Dump   ][TaskMgr           ] Running task: "FileTimer task"
[2020-07-22 20:42:26.822672 +0000][Dump   ][File              ] [0x2ce62820@root://xrootd.echo.stfc.ac.uk:1094/lhcb:prod/lhcb/LHCb/Collision16/FULLTURBO.DST/00052099/00
00/00052099_00001653_2.fullturbo.dst] Got a timer event

The file reads are in general not random I suppose - in normal operations we would read from the beginning to the end of the file, looping over the events and my test jobs do exactly that.

I am not sure that I understand why the streaming should essentially do a copy locally of the whole file at one go. Doesn't this negate the whole point of the streaming? Apologies if this has been discussed before, but I am not completely aware of the detailed rationale behind the choices of the system, except in a general sense.

It would be ideal if this problem can be made to "go away" by fine tuning the caching layer. For this, it would be nice to have a feel for what variables will affect the caching layer - it would be great if you could point us in that direction. Of course we will be happy to consider other solutions if that is the best way forward. It is clear however that the current situation cannot be the status quo.

Many thanks and Cheers,
Raja.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.

[ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/xrootd/xrootd/issues/1259#issuecomment-662718595", "url": "https://github.com/xrootd/xrootd/issues/1259#issuecomment-662718595", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1