LISTSERV 16.5 - XROOTD-L Archives

Hi Alvise,

> suppose that I do a brutal  ethernet cable unplug during a data recv at
> the client side (this can simulate a serious kernel crash in which the
> TCP stack "disappears").
> What I see is that the xrootd server doesn't realize the client
> disconnection... Indeed, the client didn't disconnect at all.
This varies by operating system. However, in general, the server-side rarely
recognizes that a connection just "dropped" unless there is actual activity
on the connection. You can try forcing this by specifying the "keepalive"
option in the "xrd.network" directive. However, it may still take a couple
of hours before the connection is actually close (again, determined by the
implementation in the kernel).

But as we
> know the architecture defines a fault tolerance even for socket
> read/write timeouts generated by serious cataclysm like this. And my
> client does exactly that closing the physical connection that timed out
> and creating a new one. When I plug back the ethernet cable in the
> computer it seems that xrootd doesn't detect, for a while, that the old
> physical connection is actually closed (it seems that the TCP closure
> handshake do not occur anymore...), while the new physical connection
> succesfully connects to xrootd.
Quite correct, that's part of the socket specification. The only way the old
connection will be automatically closed is if you managed to use the *same*
source port number. The sTCP specification clearly states that the
destination side must close the "old" connection when this happens. The
circumstances are pretty rare in practice.

>
> Then when the client tries to re-open the file in "UPDATE" mode it
> receives a "kXR_FileLocked" error. It is right and expected to me,
Yes, this is why there is a "force" option on the open to tell xrootd to
ignore the lock.

> Then I did think that I could resolve this by sending an explicit close
> command (kXR_close); but xrootd refuses to execute the command saying:
> "close does not refer to an open file" and I'm sure that command is
> trying to close the right filehandle (I made many cross-check with the
> client and server log files). Please read the log in the following:
Doesn't matter. xrootd assigns file handles by socket number. So,one socket
can't "steal" a file handle from another socket.

> Now I think this is not a bug in the code of course, it is something
> related to the architecture and I would like to hear some comment from
> you...
It architecture, alright; but the architecture is determined by the TCP
specification and the socket implementation by the kernel. There is really
very little I can do about that. One could devise special circumstances
where you could manualy check if the connection closed, but right now, there
is no so check. We never put in a reverse "ping" in the protocol. Peraps we
should to avoid these kinds of bizzare end conditions.

> [ I could do a workaround by remembering old physical connections that
> timed out and retry to close them before starting any other
> communication but after a new physical tcp channel succesfully
> conntected to xrootd (i.e. after I'm sure the ethernet link is UP); but
> it sounds too much tricky to me... ]
Actually, sounds quite impossible in most circumstances. However, you should
*always* close the previous connection that timed out. You can do that at
the time the time out occurs. Not that it would change a lot because the
kernel still won't be able to send the "synclose" request to the server. But
it's cleaner that way from the client's saide.

Again I think you should try specifying "keepalive" on the xrd.network
directive.

Andy