LISTSERV 16.5 - XROOTD-L Archives

Hi Andy;
It seems that opening with FORCE flag ON is the only and the best 
solution. The 'keepalive' param for xrd.network is not solving the problem.

thanks

    Alvise

Andrew Hanushevsky wrote:

>Hi Alvise,
>
>  
>
>>suppose that I do a brutal  ethernet cable unplug during a data recv at
>>the client side (this can simulate a serious kernel crash in which the
>>TCP stack "disappears").
>>What I see is that the xrootd server doesn't realize the client
>>disconnection... Indeed, the client didn't disconnect at all.
>>    
>>
>This varies by operating system. However, in general, the server-side rarely
>recognizes that a connection just "dropped" unless there is actual activity
>on the connection. You can try forcing this by specifying the "keepalive"
>option in the "xrd.network" directive. However, it may still take a couple
>of hours before the connection is actually close (again, determined by the
>implementation in the kernel).
>
>But as we
>  
>
>>know the architecture defines a fault tolerance even for socket
>>read/write timeouts generated by serious cataclysm like this. And my
>>client does exactly that closing the physical connection that timed out
>>and creating a new one. When I plug back the ethernet cable in the
>>computer it seems that xrootd doesn't detect, for a while, that the old
>>physical connection is actually closed (it seems that the TCP closure
>>handshake do not occur anymore...), while the new physical connection
>>succesfully connects to xrootd.
>>    
>>
>Quite correct, that's part of the socket specification. The only way the old
>connection will be automatically closed is if you managed to use the *same*
>source port number. The sTCP specification clearly states that the
>destination side must close the "old" connection when this happens. The
>circumstances are pretty rare in practice.
>
>  
>
>>Then when the client tries to re-open the file in "UPDATE" mode it
>>receives a "kXR_FileLocked" error. It is right and expected to me,
>>    
>>
>Yes, this is why there is a "force" option on the open to tell xrootd to
>ignore the lock.
>
>  
>
>>Then I did think that I could resolve this by sending an explicit close
>>command (kXR_close); but xrootd refuses to execute the command saying:
>>"close does not refer to an open file" and I'm sure that command is
>>trying to close the right filehandle (I made many cross-check with the
>>client and server log files). Please read the log in the following:
>>    
>>
>Doesn't matter. xrootd assigns file handles by socket number. So,one socket
>can't "steal" a file handle from another socket.
>
>  
>
>>Now I think this is not a bug in the code of course, it is something
>>related to the architecture and I would like to hear some comment from
>>you...
>>    
>>
>It architecture, alright; but the architecture is determined by the TCP
>specification and the socket implementation by the kernel. There is really
>very little I can do about that. One could devise special circumstances
>where you could manualy check if the connection closed, but right now, there
>is no so check. We never put in a reverse "ping" in the protocol. Peraps we
>should to avoid these kinds of bizzare end conditions.
>
>  
>
>>[ I could do a workaround by remembering old physical connections that
>>timed out and retry to close them before starting any other
>>communication but after a new physical tcp channel succesfully
>>conntected to xrootd (i.e. after I'm sure the ethernet link is UP); but
>>it sounds too much tricky to me... ]
>>    
>>
>Actually, sounds quite impossible in most circumstances. However, you should
>*always* close the previous connection that timed out. You can do that at
>the time the time out occurs. Not that it would change a lot because the
>kernel still won't be able to send the "synclose" request to the server. But
>it's cleaner that way from the client's saide.
>
>Again I think you should try specifying "keepalive" on the xrd.network
>directive.
>
>Andy
>  
>