Print

Print


Hi Andy;
I'm doing many writing tests with xrootd (and my java client) and
I realized there's a strange behaviour of the server that I do not fully 
understand:
suppose that I do a brutal  ethernet cable unplug during a data recv at 
the client side (this can simulate a serious kernel crash in which the 
TCP stack "disappears").
What I see is that the xrootd server doesn't realize the client 
disconnection... Indeed, the client didn't disconnect at all. But as we 
know the architecture defines a fault tolerance even for socket 
read/write timeouts generated by serious cataclysm like this. And my 
client does exactly that closing the physical connection that timed out 
and creating a new one. When I plug back the ethernet cable in the 
computer it seems that xrootd doesn't detect, for a while, that the old 
physical connection is actually closed (it seems that the TCP closure 
handshake do not occur anymore...), while the new physical connection 
succesfully connects to xrootd.

Then when the client tries to re-open the file in "UPDATE" mode it 
receives a "kXR_FileLocked" error. It is right and expected to me, 
because the missing "disconnection" notification to the server didn't 
trigger any closure of the remote file.
In fact after a "normal" client crash xrootd detects the client  
disconnection and logs it with the well-known message

041216 17:29:11 20099 XrdLink: anon.0:11@nbgrid01 disconnected after 
0:00:02 (link read error)
041216 17:29:11 20099 anon.0:11@nbgrid01 XrootdFile: closing r 
/data/dorigoa/testfile.tar
041216 17:29:11 20099 anon.0:11@nbgrid01 XrootdFile: closing w 
/data/dorigoa/testfile_copy.tar.1103214553901

In the case described above a message like this message isn't fired out, 
and files remain open.

Then I did think that I could resolve this by sending an explicit close 
command (kXR_close); but xrootd refuses to execute the command saying: 
"close does not refer to an open file" and I'm sure that command is 
trying to close the right filehandle (I made many cross-check with the 
client and server log files). Please read the log in the following:

041216 16:04:10 19906 anon.0:11@nbgrid01 XrootdProtocol: 0001 fh=1 write 
2000000@2000000
041216 16:04:10 19906 anon.0:11@nbgrid01 XrootdResponse: 0001 sending OK
041216 16:04:15 19906 anon.0:11@nbgrid01 XrootdProtocol: 0001 request 
timeout; read 0 of 24 bytes
041216 16:04:15 19906 anon.0:11@nbgrid01 XrdPoll: sending poller 0 
enable for link 11
041216 16:04:15 19906 XrdPoll: Poller 0 enabled fd 11 entry 1 now at 2
041216 16:04:32 19906 XrdInet: Accepted connection from nbgrid01
041216 16:04:32 19906 XrdSched: running 4@nbgrid01 inq=0
041216 16:04:32 19906 XrdProtocol: matched protocol daemon
041216 16:04:32 19906 ?:14@nbgrid01 XrdPoll: FD 14 attached to poller 1; 
num=1
041216 16:04:32 19906 ?:14@nbgrid01 XrootdProtocol: 0000 req=3007 dlen=0
041216 16:04:32 19906 anon.0:14@nbgrid01 XrootdResponse: 0000 sending OK
041216 16:04:32 19906 XrootdXeq: User logged in as anon.0:14@nbgrid01
041216 16:04:32 19906 anon.0:14@nbgrid01 XrootdProtocol: 0000 req=3003 
dlen=0
041216 16:04:32 19906 anon.0:14@nbgrid01 XrootdResponse: 0000 sending 
err 3004: close does not refer to an open file
041216 16:04:32 19906 anon.0:14@nbgrid01 XrootdProtocol: 0000 req=3010 
dlen=26
041216 16:04:32 19906 anon.0:14@nbgrid01 XrootdProtocol: 0000 open r 
/data/dorigoa/testfile.tar fh=0
041216 16:04:32 19906 anon.0:14@nbgrid01 XrootdResponse: 0000 sending 4 
data bytes; rc=0
041216 16:04:32 19906 anon.0:14@nbgrid01 XrootdProtocol: 0000 req=3013 
dlen=0
041216 16:04:32 19906 anon.0:14@nbgrid01 XrootdProtocol: 0000 fh=0 read 0@0
041216 16:04:32 19906 anon.0:14@nbgrid01 XrootdResponse: 0000 sending OK
041216 16:04:32 19906 anon.0:14@nbgrid01 XrootdProtocol: 0001 req=3003 
dlen=0
041216 16:04:32 19906 anon.0:14@nbgrid01 XrootdResponse: 0001 sending 
err 3004: close does not refer to an open file
041216 16:04:32 19906 anon.0:14@nbgrid01 XrootdProtocol: 0001 req=3010 
dlen=45
041216 16:04:32 19906 XrootdXeq: Output file 
/data/dorigoa/testfile_copy.tar.1103209452144 is already opened by 1 
writer; open denied.


As you can see there's a time gap between 16:04:15 and 16:04:32; in fact 
at 16:04:15 I unplugged the ethernet cable and plugged back in at 
16:04:32. Then you can see xrootd accepts a new connection but do not 
detect any disconnection; and the file 
/data/dorigoa/testfile_copy.tar.1103209452144 remains open.

Now I think this is not a bug in the code of course, it is something 
related to the architecture and I would like to hear some comment from 
you...

[ I could do a workaround by remembering old physical connections that 
timed out and retry to close them before starting any other 
communication but after a new physical tcp channel succesfully 
conntected to xrootd (i.e. after I'm sure the ethernet link is UP); but 
it sounds too much tricky to me... ]

thanks and sorry for the long email... ;)

    Alvise