Hi Alvise, > suppose that I do a brutal ethernet cable unplug during a data recv at > the client side (this can simulate a serious kernel crash in which the > TCP stack "disappears"). > What I see is that the xrootd server doesn't realize the client > disconnection... Indeed, the client didn't disconnect at all. This varies by operating system. However, in general, the server-side rarely recognizes that a connection just "dropped" unless there is actual activity on the connection. You can try forcing this by specifying the "keepalive" option in the "xrd.network" directive. However, it may still take a couple of hours before the connection is actually close (again, determined by the implementation in the kernel). But as we > know the architecture defines a fault tolerance even for socket > read/write timeouts generated by serious cataclysm like this. And my > client does exactly that closing the physical connection that timed out > and creating a new one. When I plug back the ethernet cable in the > computer it seems that xrootd doesn't detect, for a while, that the old > physical connection is actually closed (it seems that the TCP closure > handshake do not occur anymore...), while the new physical connection > succesfully connects to xrootd. Quite correct, that's part of the socket specification. The only way the old connection will be automatically closed is if you managed to use the *same* source port number. The sTCP specification clearly states that the destination side must close the "old" connection when this happens. The circumstances are pretty rare in practice. > > Then when the client tries to re-open the file in "UPDATE" mode it > receives a "kXR_FileLocked" error. It is right and expected to me, Yes, this is why there is a "force" option on the open to tell xrootd to ignore the lock. > Then I did think that I could resolve this by sending an explicit close > command (kXR_close); but xrootd refuses to execute the command saying: > "close does not refer to an open file" and I'm sure that command is > trying to close the right filehandle (I made many cross-check with the > client and server log files). Please read the log in the following: Doesn't matter. xrootd assigns file handles by socket number. So,one socket can't "steal" a file handle from another socket. > Now I think this is not a bug in the code of course, it is something > related to the architecture and I would like to hear some comment from > you... It architecture, alright; but the architecture is determined by the TCP specification and the socket implementation by the kernel. There is really very little I can do about that. One could devise special circumstances where you could manualy check if the connection closed, but right now, there is no so check. We never put in a reverse "ping" in the protocol. Peraps we should to avoid these kinds of bizzare end conditions. > [ I could do a workaround by remembering old physical connections that > timed out and retry to close them before starting any other > communication but after a new physical tcp channel succesfully > conntected to xrootd (i.e. after I'm sure the ethernet link is UP); but > it sounds too much tricky to me... ] Actually, sounds quite impossible in most circumstances. However, you should *always* close the previous connection that timed out. You can do that at the time the time out occurs. Not that it would change a lot because the kernel still won't be able to send the "synclose" request to the server. But it's cleaner that way from the client's saide. Again I think you should try specifying "keepalive" on the xrd.network directive. Andy