Print

Print


Trying to summarise some (perhaps) relevant extra information:

Over several hours before the situation where the server was dropping connections, there were a number of log entries mentioning 'bad file descriptor', such as:

e.g. grepping for ":1387@" gave this sequence:

[...]
230216 22:36:41 time=1676583401.392132 func=_close                   level=INFO  logid=unknown [log in to unmask]:1095 tid=00007f14d91fe700 source=XrdFstOfsFile:2251             tident=daemon.75437:3834@st-096-gg500b5c sec=      uid=1 gid=1 name=nobody geo="" msg="done close" rc=0 errc=0
230216 22:36:41 282023 XrootdXeq: daemon.75437:3834@st-096-gg500b5c disc 0:00:02
230216 22:36:48 25875 XrootdXeq: 3.22020:3834@p06636710m46495 pub IP64 login as daemon
230216 22:36:48 25875 XrootdXeq: 3.22020:3834@p06636710m46495 disc 0:00:07
230216 22:36:48 178174 XrdLink: Unable to close 3.22020:3834@p06636710m46495; bad file descriptor
230216 22:36:58 292432 XrootdXeq: daemon.189977:3834@st-096-ff809e4a pub IP64 login as daemon
[...]

The first line is EOS specific and relates to closing a file, next line the close of connection by xrootd. Then a new connection on fd 3834 is logged (22:36:48), which is then closed very shortly afterwards (also logged at 22:36:48). The connection time indicates that connection started 7 seconds before; it is possible the machine had some loading or other issue that delayed everything for a few seconds at this point (indeed this may be a contributing factor). However a short time later, still at 22:36:48 there is a close error logged from a different thread than the "disc" message. The final line is a new connection on this descriptor 10 seconds later, which is unremarkable.

About 24 hours after the issue there are examples of a few log messages such as:

230218 02:25:22 57131 XrdLinkScan: Link cmsplt01.15591:2640@b7g29p5493 is disabled and idle.

(however this message is not unique to this period).

My interpretation is that there is a the possibility of two XrdLinkXeq::Close on a link in succession. The first call of Close() calls Recycle() the Protocol (generating the 'disc' log line) and then closes the network FD (without message). A second XrdLinkXeq::Close() would then not Recycle() the Protocol (as it has been set to 0), but the close(fd) returns as error (which is logged as the "bad file descriptor" line).

Such a possibility seems to exist (and I have reproduced approximately this by adding a couple of sleep() calls), e.g. in the following situation:

(1) A connection is Accepted() in "thread 1", assigned to a Link and the Link successfully matched to a protocol in XrdProtLoad::Process, consuming 20 bytes from the connection. The link has Activate() called successfully, (i.e. the FD is added to a poller's epoll set, but with (0) for the selected events).
(2) The connection is closed (e.g. client process is terminated).
(3) Execution of "thread 1" continues at the bottom of XrdProtLoad::Process, where it directly calls the protocol's Process() method which returns -1 (since it read the end of stream during Recv), and execution returns to XrdLinkXeq::DoIt(). DoIt() calls XrdLinkXeq::Close().
(4) The poller's poll loop does an epoll_wait (lets say in "thread 2"). The connection FD is already in the poll set, with 0 events selected. But the epoll can still return error events EPOLLERR and EPOLLHUP. It returns one (EPOLLHUP in this case, I believe).
(5) "Thread 1" continues in Close(). It calls Recycle() on Protocol; detaches the poller (removes FD from poll set), sets PollInfo.Poller =0, PollInfo.FD=-1, and the Link is Unhook() and available to be Allocated again; and closes the (fd).
(6) In "Thread 2" resumes in XrdPollE::Start() and evaluates the condition if (!(pInfo->isEnabled) && pInfo->FD >= 0), which is false since pInfo->FD = -1. The false branch causes it to Schedule the Link.
(7) A "thread 3" then executes XrdLinkXeq::DoIt() of the Link. Protocol is zero, and rc is evaluated uninitialised, but unless it happens to equal -EINPROGRESS, Close() will be executed again.

The idea is that between (5)&(6) a new conneciton can come in, be assigned to FD and assuming that its Link is assigned to a different poller it can go on and could even become fully established and idle. The step (7) would then close its FD, but leaving the Link busy (i.e. LinkBat[fd] = XRDLINK_USED, leading to situation described in this issue). The Link may be cleared up by the idleScan if configured. (Indeed on the EOS node it is set with 'xrd.timeout idle 86400').


Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.Message ID: <xrootd/xrootd/issues/1928/1440284237@github.com>

[ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/xrootd/xrootd/issues/1928#issuecomment-1440284237", "url": "https://github.com/xrootd/xrootd/issues/1928#issuecomment-1440284237", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1