Hi, yes, I agree this is an old story pioneered by Remi. You can easily thrash a data server overloading it with checksum requests. I am surprised that the event is very rare in that case. Are you requesting N times the checksum of the same file, Wilko? Anyway, the session ending request is only a hint from the client to the server. The client should retry anyway, even if the endsess request went bad. The problem is that, after a timeout/retry, your client will request a new checksum while the first is still running (with no clients waiting for it). In my opinion the definitive solution is to extend the chksum mechanism in order to: - be able to kxr_wait the clients (e.g. pending chksums < X). So we can support things like 1-day lasting checksums - not to launch too many checksum processes (e.g. N < X) at once - eventually making the client abort (e.g. if pending chksums > X) Doing so, we could avoid overloading the server side, while giving predictability to the chksum mechanism. If you agree, I can put my hands on it starting around the days of the root ws (so we can speak about it...) Fabrizio On Tuesday 13 September 2005 10:25 pm, Andrew Hanushevsky wrote: > Hi Wilko, > > This is the standard problem that we noticed with long-running commands. > The client times out and retries. However, it cannot establish a new > session because the previous session is still running the cksum command. > Short of ding some call-back scheme or idling the client while the command > is running, we don't have an immediate solution other than increase the > timeout. > > Andy > > On Tue, 13 Sep 2005, Wilko Kroeger wrote: > > Hello > > > > While testing xrootd I was using the perl client admin lib to obtain the > > checksum of a file. Many clients were accessing the data sever (no > > redirector was used) in parallel, and each client was looping to obtain > > a checksum. > > > > Loop: > > XrdInitialize > > XrdGetChecksum > > XrdTerminate > > > > > > Very seldom I see the case where the clients prints: > > > > 050912 14:49:53 001 Xrd: ReadPartialAnswer: Error reading msg from > > connmgr (server [datadevsol04.slac.stanford.edu:2094]). 050912 14:49:53 > > 001 Xrd: HandleServerError: Communication error with server > > [datadevsol04.slac.stanford.edu:2094]. Rebouncing here. 050912 14:49:53 > > 001 Xrd: XrdClientConn::Endsess: Server > > [datadevsol04.slac.stanford.edu:2094] did not return OK message for last > > request. 050912 14:49:53 001 Xrd: SendGenCommand: Server declared error > > 3006:session is active > > > > I believe that the client stills receives the correct checksum but it is > > hard to test as the problem is very rare. > > > > > > The data server log file shows (cut out lines that belong to different > > clients): > > > > 050912 14:44:55 039 wilko.9456:87@kama XrootdProtocol: 1b00 req=3001 > > dlen=21 > > > > 050912 14:49:53 001 XrdInet: Accepted connection from > > kama.slac.stanford.edu 050912 14:49:53 023 XrdSched: running ?:47@kama > > inq=0 > > 050912 14:49:53 023 XrdProtocol: matched protocol xrootd > > 050912 14:49:53 023 ?:47@kama XrdPoll: FD 47 attached to poller 1; num=14 > > 050912 14:49:53 023 ?:47@kama XrootdProtocol: 1b00 req=3007 dlen=0 > > 050912 14:49:53 023 wilko.9456:47@kama XrootdResponse: 1b00 sending 16 > > data bytes; status=0 050912 14:49:53 023 XrootdXeq: wilko.9456:47@kama > > login > > > > 050912 14:50:10 039 XrootdXeq: wilko.9456:87@kama disc 0:14:12 > > 050912 14:50:10 039 wilko.9456:87@kama XrdPoll: FD 87 detached from > > poller 2; num=12 > > > > > > The first line (14:44:55) is a client checksum request, but the sever is > > not returning the answer as there is no corresponding line: > > ... XrootdResponse: 1a00 sending 16 data bytes; status=0 > > > > It looks like that the client is then establishing a new connections, > > wilko.9456:47 (the old one was wilko.9456:87). > > > > > > As I said, I can't easily reproduce this problem, but during skimming at > > GridK the same message was observed. In GridK's case the same > > message repeats every 5 mins until after about 50 mins the > > client aborts because of to many communication errors. > > This problem has been reported in: > > http://babar-hn.slac.stanford.edu:5090/HyperNews/get/SkimSOS/1867.html > > > > GridK is using xrootd version 20050623-0016, whereas my tests were done > > with the xrootd HEAD as of Sep. 9th. In both cases the data server was > > heavily loaded. > > > > Any ideas? > > > > Cheers, > > wilko