LISTSERV 16.5 - XROOTD-DEV Archives

Follow-up Comment #10, sr #119348 (project xrootd):

There are two issues here:

* ROOT doesn't handle the error code properly. To be fixed, no question about
it.

* The connection is broken by the router in such a way that the client has no
clue that something is wrong. The socket is in a valid state, the client
writes the request data to the socket, the operation succeeds (the write
syscall returns success) and the data is stuck in the OS TCP send queue
(because it was never ACKed). The request timeout passes and the client just
writes the request to the socket one more time and the write operation
succeeds again but again the request data is stuck in the send queue.

Of course, on every request timeout I could assume that the connection is
just broken even though the socket is in a valid state and reconnect, no
problem about that. But, consider the implications for the clients requiring
long standing connections having in mind the fact that the default request
timeout is in the order of five minutes: every request that is sent some
seconds after the previous one would take 5 minutes to complete.

Yes, I could make this timeout shorter but that would mean, reconnection
(hence reauthentication) every couple of seconds if the particular use case
demanded such an access pattern. I don't believe that this is an acceptable
solution either.

The problem can really be solved by sending the probes over the wire to check
that the connection is alive. On Linux you can tweak the TCP stack to do that
for you transparently in the way that fits the particular needs of every use
case but other operating systems are clearly inferior. So, I think that the
question really is: Do we want to support other operating systems as well as
we support Linux or not?

    _______________________________________________________

Reply to this item at:

  <http://savannah.cern.ch/support/?119348>

_______________________________________________
  Message sent via/by LCG Savannah
  http://savannah.cern.ch/