Hi Jerome, What does the debug log on the server indicate? Andy ----- Original Message ----- From: "Fabrizio Furano" <[log in to unmask]> To: "Jerome LAURET" <[log in to unmask]> Cc: "Xrootd Mailing List" <[log in to unmask]> Sent: Tuesday, December 13, 2005 4:21 PM Subject: Re: HandShake problem in Xoortd > Hi, > > I just checked the code in ROOT 4.04-02, just to realize that the > retry code is there. It should work. > > In fact, even from your plain error log, it seems that the client > already tries more than once to connect/handshake/disconnect. > > A log with level 2 would be more descriptive. However, all you can do > is to play with the parameter: > > XNet.TryConnectServersList > > This is the max number of "first connect" failures before TXNetFile > gives up. > > What I don't understand is that you should already see about 240 > retries, since 240 is the default value (about 2400 seconds). Do you see > so many messages before the failure? > > Note: Making the client retry for hours or days will not solve the > basic problem, i.e. the unresponsive server. So you cannot be guaranteed > that you are not simply moving the problem. I don't know if this is > preferable to having an aborted job. > > Fabrizio > > Jerome LAURET wrote: >> >> Yes, it may be an unresponsive server but the end >> product is a user job crashing according to our users. Some >> of those connections (later retried) would lead to success. >> We can switch to level 3 and see as soon as we can design a >> large scale test: we also have flukes with authentication >> in general I have not reported this yet (I beleive it to be >> LDAP related as changing the LDAP setup changed failure rates >> from 50% to 3% ... but I cannot drop below the 3% failures). >> >> In general, could someone indicates what is the most >> reliable setting for Xrootd to retry upon failures (regardless >> of delays this may cause) ?? >> >> Thanks, >> >> Fabrizio Furano wrote: >> >>> Hi Jerome. >>> >>> This almost never happens, and makes me think about an unresponsive >>> server. But may be caused by weird connection troubles. Do you get >>> this immediately or in the middle of a communication? >>> >>> Since it's a very strance situation, I'd suggest you to put the >>> client side debug level to 3 and send everything to me. Also the >>> server side log (after having enabled it, of course) will be useful. >>> >>> The version included in root4 is rather old, but well known and >>> tested. I am looking forward hoping that everybody will be switching >>> to the newer one asap.... at least I will be no more dealing with N >>> versions of the same code.... >>> >>> Fabrizio >>> >>> >>> >>> Jerome LAURET wrote: >>> >>>> Has anyone experienced this kind of issues and if so, what >>>> to do to resolve it ?? >>>> >>>>> Error in <DoHandShake>: Error reading 4 bytes from the server >>>>> [rcas6132.rcf.bnl.gov:1095]. >>>>> Info in <GetAccessToSrv>: HandShake failed with server >>>>> [rcas6132.rcf.bnl.gov:1095]. >>>>> Error in <TXNetFile::CreateTXNf>: Access to server failed (0) >>>>> Error in <Disconnect>: Destroying nonexistent logconnid 0. >>>>> Error in <DoHandShake>: Error reading 4 bytes from the server >>>>> [rcas6132.rcf.bnl.gov:1095]. >>>>> Info in <GetAccessToSrv>: HandShake failed with server >>>>> [rcas6132.rcf.bnl.gov:1095]. >>>>> Error in <TXNetFile::CreateTXNf>: Access to server failed (0) >>>>> Error in <Disconnect>: Destroying nonexistent logconnid 0. >>>> >>>> >>>> >>>> >>>> Thank you, >>>> >> >