Print

Print


Hello Andy


I forgot to mention that Andreas (GridK) saw this problem with xrdcp, but
I guess if a data server is very busy and the response of the server is
longer then the timeout one would get the same error.

Do you or Fabrizio know what the timeout is and if it can be modified
(for xrdcp and the perl client)?

Cheers,
   wilko


On Tue, 13 Sep 2005, Andrew Hanushevsky wrote:

> Hi Wilko,
>
> This is the standard problem that we noticed with long-running commands.
> The client times out and retries. However, it cannot establish a new
> session because the previous session is still running the cksum command.
> Short of ding some call-back scheme or idling the client while the command
> is running, we don't have an immediate solution other than increase the
> timeout.
>
> Andy
>
>
> On Tue, 13 Sep 2005, Wilko Kroeger wrote:
>
> >
> > Hello
> >
> > While testing xrootd I was using the perl client admin lib to obtain the
> > checksum of a file. Many clients were accessing the data sever (no
> > redirector was used) in parallel, and each client was looping to obtain
> > a checksum.
> >
> > Loop:
> >     XrdInitialize
> >     XrdGetChecksum
> >     XrdTerminate
> >
> >
> > Very seldom I see the case where the clients prints:
> >
> > 050912 14:49:53 001 Xrd: ReadPartialAnswer: Error reading msg from connmgr (server [datadevsol04.slac.stanford.edu:2094]).
> > 050912 14:49:53 001 Xrd: HandleServerError: Communication error with server [datadevsol04.slac.stanford.edu:2094]. Rebouncing here.
> > 050912 14:49:53 001 Xrd: XrdClientConn::Endsess: Server [datadevsol04.slac.stanford.edu:2094] did not return OK message for last request.
> > 050912 14:49:53 001 Xrd: SendGenCommand: Server declared error 3006:session is active
> >
> > I believe that the client stills receives the correct checksum but it is
> > hard to test as the problem is very rare.
> >
> >
> > The data server log file shows (cut out lines that belong to different clients):
> >
> > 050912 14:44:55 039 wilko.9456:87@kama XrootdProtocol: 1b00 req=3001 dlen=21
> >
> > 050912 14:49:53 001 XrdInet: Accepted connection from kama.slac.stanford.edu
> > 050912 14:49:53 023 XrdSched: running ?:47@kama inq=0
> > 050912 14:49:53 023 XrdProtocol: matched protocol xrootd
> > 050912 14:49:53 023 ?:47@kama XrdPoll: FD 47 attached to poller 1; num=14
> > 050912 14:49:53 023 ?:47@kama XrootdProtocol: 1b00 req=3007 dlen=0
> > 050912 14:49:53 023 wilko.9456:47@kama XrootdResponse: 1b00 sending 16 data bytes; status=0
> > 050912 14:49:53 023 XrootdXeq: wilko.9456:47@kama login
> >
> > 050912 14:50:10 039 XrootdXeq: wilko.9456:87@kama disc 0:14:12
> > 050912 14:50:10 039 wilko.9456:87@kama XrdPoll: FD 87 detached from poller 2; num=12
> >
> >
> > The first line (14:44:55) is a client checksum request, but the sever is
> > not returning the answer as there is no corresponding line:
> >  ... XrootdResponse: 1a00 sending 16 data bytes; status=0
> >
> > It looks like that the client is then establishing a new connections,
> > wilko.9456:47 (the old one was wilko.9456:87).
> >
> >
> > As I said, I can't easily reproduce this problem, but during skimming at
> > GridK the same message was observed. In GridK's case the same
> > message repeats every 5 mins until after about 50 mins the
> > client aborts because of to many communication errors.
> > This problem has been reported in:
> > http://babar-hn.slac.stanford.edu:5090/HyperNews/get/SkimSOS/1867.html
> >
> > GridK is using xrootd version 20050623-0016, whereas my tests were done
> > with the xrootd HEAD as of Sep. 9th. In both cases the data server was
> > heavily loaded.
> >
> > Any ideas?
> >
> > Cheers,
> >    wilko
> >
> >
> >
> >
>