Print

Print


Hello

While testing xrootd I was using the perl client admin lib to obtain the
checksum of a file. Many clients were accessing the data sever (no
redirector was used) in parallel, and each client was looping to obtain
a checksum.

Loop:
    XrdInitialize
    XrdGetChecksum
    XrdTerminate


Very seldom I see the case where the clients prints:

050912 14:49:53 001 Xrd: ReadPartialAnswer: Error reading msg from connmgr (server [datadevsol04.slac.stanford.edu:2094]).
050912 14:49:53 001 Xrd: HandleServerError: Communication error with server [datadevsol04.slac.stanford.edu:2094]. Rebouncing here.
050912 14:49:53 001 Xrd: XrdClientConn::Endsess: Server [datadevsol04.slac.stanford.edu:2094] did not return OK message for last request.
050912 14:49:53 001 Xrd: SendGenCommand: Server declared error 3006:session is active

I believe that the client stills receives the correct checksum but it is
hard to test as the problem is very rare.


The data server log file shows (cut out lines that belong to different clients):

050912 14:44:55 039 wilko.9456:87@kama XrootdProtocol: 1b00 req=3001 dlen=21

050912 14:49:53 001 XrdInet: Accepted connection from kama.slac.stanford.edu
050912 14:49:53 023 XrdSched: running ?:47@kama inq=0
050912 14:49:53 023 XrdProtocol: matched protocol xrootd
050912 14:49:53 023 ?:47@kama XrdPoll: FD 47 attached to poller 1; num=14
050912 14:49:53 023 ?:47@kama XrootdProtocol: 1b00 req=3007 dlen=0
050912 14:49:53 023 wilko.9456:47@kama XrootdResponse: 1b00 sending 16 data bytes; status=0
050912 14:49:53 023 XrootdXeq: wilko.9456:47@kama login

050912 14:50:10 039 XrootdXeq: wilko.9456:87@kama disc 0:14:12
050912 14:50:10 039 wilko.9456:87@kama XrdPoll: FD 87 detached from poller 2; num=12


The first line (14:44:55) is a client checksum request, but the sever is
not returning the answer as there is no corresponding line:
 ... XrootdResponse: 1a00 sending 16 data bytes; status=0

It looks like that the client is then establishing a new connections,
wilko.9456:47 (the old one was wilko.9456:87).


As I said, I can't easily reproduce this problem, but during skimming at
GridK the same message was observed. In GridK's case the same
message repeats every 5 mins until after about 50 mins the
client aborts because of to many communication errors.
This problem has been reported in:
http://babar-hn.slac.stanford.edu:5090/HyperNews/get/SkimSOS/1867.html

GridK is using xrootd version 20050623-0016, whereas my tests were done
with the xrootd HEAD as of Sep. 9th. In both cases the data server was
heavily loaded.

Any ideas?

Cheers,
   wilko