Print

Print


Hi,

 yes, I agree this is an old story pioneered by Remi. You can easily thrash a 
data server overloading it with checksum requests. I am surprised that the 
event is very rare in that case. Are you requesting N times the checksum of 
the same file, Wilko? 
 Anyway, the session ending request is only a hint from the client to the 
server. The client should retry anyway, even if the endsess request went bad. 
The problem is that, after a timeout/retry, your client will request a new 
checksum while the first is still running (with no clients waiting for it).

 In my opinion the definitive solution is to extend the chksum mechanism in 
order to:

- be able to kxr_wait the clients (e.g. pending chksums < X). So we can 
support things like 1-day lasting checksums
- not to launch too many checksum processes (e.g. N < X) at once
    - eventually making the client abort (e.g. if pending chksums > X)

 Doing so, we could avoid overloading the server side, while giving 
predictability to the chksum mechanism.
 If you agree, I can put my hands on it starting around the days of the root 
ws (so we can speak about it...)


Fabrizio

On Tuesday 13 September 2005 10:25 pm, Andrew Hanushevsky wrote:
> Hi Wilko,
>
> This is the standard problem that we noticed with long-running commands.
> The client times out and retries. However, it cannot establish a new
> session because the previous session is still running the cksum command.
> Short of ding some call-back scheme or idling the client while the command
> is running, we don't have an immediate solution other than increase the
> timeout.
>
> Andy
>
> On Tue, 13 Sep 2005, Wilko Kroeger wrote:
> > Hello
> >
> > While testing xrootd I was using the perl client admin lib to obtain the
> > checksum of a file. Many clients were accessing the data sever (no
> > redirector was used) in parallel, and each client was looping to obtain
> > a checksum.
> >
> > Loop:
> >     XrdInitialize
> >     XrdGetChecksum
> >     XrdTerminate
> >
> >
> > Very seldom I see the case where the clients prints:
> >
> > 050912 14:49:53 001 Xrd: ReadPartialAnswer: Error reading msg from
> > connmgr (server [datadevsol04.slac.stanford.edu:2094]). 050912 14:49:53
> > 001 Xrd: HandleServerError: Communication error with server
> > [datadevsol04.slac.stanford.edu:2094]. Rebouncing here. 050912 14:49:53
> > 001 Xrd: XrdClientConn::Endsess: Server
> > [datadevsol04.slac.stanford.edu:2094] did not return OK message for last
> > request. 050912 14:49:53 001 Xrd: SendGenCommand: Server declared error
> > 3006:session is active
> >
> > I believe that the client stills receives the correct checksum but it is
> > hard to test as the problem is very rare.
> >
> >
> > The data server log file shows (cut out lines that belong to different
> > clients):
> >
> > 050912 14:44:55 039 wilko.9456:87@kama XrootdProtocol: 1b00 req=3001
> > dlen=21
> >
> > 050912 14:49:53 001 XrdInet: Accepted connection from
> > kama.slac.stanford.edu 050912 14:49:53 023 XrdSched: running ?:47@kama
> > inq=0
> > 050912 14:49:53 023 XrdProtocol: matched protocol xrootd
> > 050912 14:49:53 023 ?:47@kama XrdPoll: FD 47 attached to poller 1; num=14
> > 050912 14:49:53 023 ?:47@kama XrootdProtocol: 1b00 req=3007 dlen=0
> > 050912 14:49:53 023 wilko.9456:47@kama XrootdResponse: 1b00 sending 16
> > data bytes; status=0 050912 14:49:53 023 XrootdXeq: wilko.9456:47@kama
> > login
> >
> > 050912 14:50:10 039 XrootdXeq: wilko.9456:87@kama disc 0:14:12
> > 050912 14:50:10 039 wilko.9456:87@kama XrdPoll: FD 87 detached from
> > poller 2; num=12
> >
> >
> > The first line (14:44:55) is a client checksum request, but the sever is
> > not returning the answer as there is no corresponding line:
> >  ... XrootdResponse: 1a00 sending 16 data bytes; status=0
> >
> > It looks like that the client is then establishing a new connections,
> > wilko.9456:47 (the old one was wilko.9456:87).
> >
> >
> > As I said, I can't easily reproduce this problem, but during skimming at
> > GridK the same message was observed. In GridK's case the same
> > message repeats every 5 mins until after about 50 mins the
> > client aborts because of to many communication errors.
> > This problem has been reported in:
> > http://babar-hn.slac.stanford.edu:5090/HyperNews/get/SkimSOS/1867.html
> >
> > GridK is using xrootd version 20050623-0016, whereas my tests were done
> > with the xrootd HEAD as of Sep. 9th. In both cases the data server was
> > heavily loaded.
> >
> > Any ideas?
> >
> > Cheers,
> >    wilko