I will enable log level, retest and report on this question. At 3% failure level, it is bothering enough for being disruptive but hard to catch as well (will be sure I have plenty of space for logs) especially with our number of clients and dataservers ... So, level 2 or above ... Will answer within the next 24 hours (hopefully). Thanks for your attention, Andrew Hanushevsky wrote: > Hi Jerome, > > What does the debug log on the server indicate? > > Andy > > ----- Original Message ----- From: "Fabrizio Furano" <[log in to unmask]> > To: "Jerome LAURET" <[log in to unmask]> > Cc: "Xrootd Mailing List" <[log in to unmask]> > Sent: Tuesday, December 13, 2005 4:21 PM > Subject: Re: HandShake problem in Xoortd > > >> Hi, >> >> I just checked the code in ROOT 4.04-02, just to realize that the >> retry code is there. It should work. >> >> In fact, even from your plain error log, it seems that the client >> already tries more than once to connect/handshake/disconnect. >> >> A log with level 2 would be more descriptive. However, all you can do >> is to play with the parameter: >> >> XNet.TryConnectServersList >> >> This is the max number of "first connect" failures before TXNetFile >> gives up. >> >> What I don't understand is that you should already see about 240 >> retries, since 240 is the default value (about 2400 seconds). Do you >> see so many messages before the failure? >> >> Note: Making the client retry for hours or days will not solve the >> basic problem, i.e. the unresponsive server. So you cannot be >> guaranteed that you are not simply moving the problem. I don't know if >> this is preferable to having an aborted job. >> >> Fabrizio >> >> Jerome LAURET wrote: >> >>> >>> Yes, it may be an unresponsive server but the end >>> product is a user job crashing according to our users. Some >>> of those connections (later retried) would lead to success. >>> We can switch to level 3 and see as soon as we can design a >>> large scale test: we also have flukes with authentication >>> in general I have not reported this yet (I beleive it to be >>> LDAP related as changing the LDAP setup changed failure rates >>> from 50% to 3% ... but I cannot drop below the 3% failures). >>> >>> In general, could someone indicates what is the most >>> reliable setting for Xrootd to retry upon failures (regardless >>> of delays this may cause) ?? >>> >>> Thanks, >>> >>> Fabrizio Furano wrote: >>> >>>> Hi Jerome. >>>> >>>> This almost never happens, and makes me think about an unresponsive >>>> server. But may be caused by weird connection troubles. Do you get >>>> this immediately or in the middle of a communication? >>>> >>>> Since it's a very strance situation, I'd suggest you to put the >>>> client side debug level to 3 and send everything to me. Also the >>>> server side log (after having enabled it, of course) will be useful. >>>> >>>> The version included in root4 is rather old, but well known and >>>> tested. I am looking forward hoping that everybody will be switching >>>> to the newer one asap.... at least I will be no more dealing with N >>>> versions of the same code.... >>>> >>>> Fabrizio >>>> >>>> >>>> >>>> Jerome LAURET wrote: >>>> >>>>> Has anyone experienced this kind of issues and if so, what >>>>> to do to resolve it ?? >>>>> >>>>>> Error in <DoHandShake>: Error reading 4 bytes from the server >>>>>> [rcas6132.rcf.bnl.gov:1095]. >>>>>> Info in <GetAccessToSrv>: HandShake failed with server >>>>>> [rcas6132.rcf.bnl.gov:1095]. >>>>>> Error in <TXNetFile::CreateTXNf>: Access to server failed (0) >>>>>> Error in <Disconnect>: Destroying nonexistent logconnid 0. >>>>>> Error in <DoHandShake>: Error reading 4 bytes from the server >>>>>> [rcas6132.rcf.bnl.gov:1095]. >>>>>> Info in <GetAccessToSrv>: HandShake failed with server >>>>>> [rcas6132.rcf.bnl.gov:1095]. >>>>>> Error in <TXNetFile::CreateTXNf>: Access to server failed (0) >>>>>> Error in <Disconnect>: Destroying nonexistent logconnid 0. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Thank you, >>>>> >>> >> -- ,,,,, ( o o ) --m---U---m-- Jerome