Hi Andy;
I'm doing many writing tests with xrootd (and my java client) and
I realized there's a strange behaviour of the server that I do not fully
understand:
suppose that I do a brutal ethernet cable unplug during a data recv at
the client side (this can simulate a serious kernel crash in which the
TCP stack "disappears").
What I see is that the xrootd server doesn't realize the client
disconnection... Indeed, the client didn't disconnect at all. But as we
know the architecture defines a fault tolerance even for socket
read/write timeouts generated by serious cataclysm like this. And my
client does exactly that closing the physical connection that timed out
and creating a new one. When I plug back the ethernet cable in the
computer it seems that xrootd doesn't detect, for a while, that the old
physical connection is actually closed (it seems that the TCP closure
handshake do not occur anymore...), while the new physical connection
succesfully connects to xrootd.
Then when the client tries to re-open the file in "UPDATE" mode it
receives a "kXR_FileLocked" error. It is right and expected to me,
because the missing "disconnection" notification to the server didn't
trigger any closure of the remote file.
In fact after a "normal" client crash xrootd detects the client
disconnection and logs it with the well-known message
041216 17:29:11 20099 XrdLink: anon.0:11@nbgrid01 disconnected after
0:00:02 (link read error)
041216 17:29:11 20099 anon.0:11@nbgrid01 XrootdFile: closing r
/data/dorigoa/testfile.tar
041216 17:29:11 20099 anon.0:11@nbgrid01 XrootdFile: closing w
/data/dorigoa/testfile_copy.tar.1103214553901
In the case described above a message like this message isn't fired out,
and files remain open.
Then I did think that I could resolve this by sending an explicit close
command (kXR_close); but xrootd refuses to execute the command saying:
"close does not refer to an open file" and I'm sure that command is
trying to close the right filehandle (I made many cross-check with the
client and server log files). Please read the log in the following:
041216 16:04:10 19906 anon.0:11@nbgrid01 XrootdProtocol: 0001 fh=1 write
2000000@2000000
041216 16:04:10 19906 anon.0:11@nbgrid01 XrootdResponse: 0001 sending OK
041216 16:04:15 19906 anon.0:11@nbgrid01 XrootdProtocol: 0001 request
timeout; read 0 of 24 bytes
041216 16:04:15 19906 anon.0:11@nbgrid01 XrdPoll: sending poller 0
enable for link 11
041216 16:04:15 19906 XrdPoll: Poller 0 enabled fd 11 entry 1 now at 2
041216 16:04:32 19906 XrdInet: Accepted connection from nbgrid01
041216 16:04:32 19906 XrdSched: running 4@nbgrid01 inq=0
041216 16:04:32 19906 XrdProtocol: matched protocol daemon
041216 16:04:32 19906 ?:14@nbgrid01 XrdPoll: FD 14 attached to poller 1;
num=1
041216 16:04:32 19906 ?:14@nbgrid01 XrootdProtocol: 0000 req=3007 dlen=0
041216 16:04:32 19906 anon.0:14@nbgrid01 XrootdResponse: 0000 sending OK
041216 16:04:32 19906 XrootdXeq: User logged in as anon.0:14@nbgrid01
041216 16:04:32 19906 anon.0:14@nbgrid01 XrootdProtocol: 0000 req=3003
dlen=0
041216 16:04:32 19906 anon.0:14@nbgrid01 XrootdResponse: 0000 sending
err 3004: close does not refer to an open file
041216 16:04:32 19906 anon.0:14@nbgrid01 XrootdProtocol: 0000 req=3010
dlen=26
041216 16:04:32 19906 anon.0:14@nbgrid01 XrootdProtocol: 0000 open r
/data/dorigoa/testfile.tar fh=0
041216 16:04:32 19906 anon.0:14@nbgrid01 XrootdResponse: 0000 sending 4
data bytes; rc=0
041216 16:04:32 19906 anon.0:14@nbgrid01 XrootdProtocol: 0000 req=3013
dlen=0
041216 16:04:32 19906 anon.0:14@nbgrid01 XrootdProtocol: 0000 fh=0 read 0@0
041216 16:04:32 19906 anon.0:14@nbgrid01 XrootdResponse: 0000 sending OK
041216 16:04:32 19906 anon.0:14@nbgrid01 XrootdProtocol: 0001 req=3003
dlen=0
041216 16:04:32 19906 anon.0:14@nbgrid01 XrootdResponse: 0001 sending
err 3004: close does not refer to an open file
041216 16:04:32 19906 anon.0:14@nbgrid01 XrootdProtocol: 0001 req=3010
dlen=45
041216 16:04:32 19906 XrootdXeq: Output file
/data/dorigoa/testfile_copy.tar.1103209452144 is already opened by 1
writer; open denied.
As you can see there's a time gap between 16:04:15 and 16:04:32; in fact
at 16:04:15 I unplugged the ethernet cable and plugged back in at
16:04:32. Then you can see xrootd accepts a new connection but do not
detect any disconnection; and the file
/data/dorigoa/testfile_copy.tar.1103209452144 remains open.
Now I think this is not a bug in the code of course, it is something
related to the architecture and I would like to hear some comment from
you...
[ I could do a workaround by remembering old physical connections that
timed out and retry to close them before starting any other
communication but after a new physical tcp channel succesfully
conntected to xrootd (i.e. after I'm sure the ethernet link is UP); but
it sounds too much tricky to me... ]
thanks and sorry for the long email... ;)
Alvise
|