Print

Print


	I will enable log level, retest and report on this
question. At 3% failure level, it is bothering enough for
being disruptive but hard to catch as well (will be sure I
have plenty of space for logs) especially with our number
of clients and dataservers ...

	So, level 2 or above ... Will answer within the next
24 hours (hopefully).

	Thanks for your attention,

Andrew Hanushevsky wrote:
> Hi Jerome,
> 
> What does the debug log on the server indicate?
> 
> Andy
> 
> ----- Original Message ----- From: "Fabrizio Furano" <[log in to unmask]>
> To: "Jerome LAURET" <[log in to unmask]>
> Cc: "Xrootd Mailing List" <[log in to unmask]>
> Sent: Tuesday, December 13, 2005 4:21 PM
> Subject: Re: HandShake problem in Xoortd
> 
> 
>> Hi,
>>
>>  I just checked the code in ROOT  4.04-02, just to realize that the 
>> retry code is there. It should work.
>>
>>  In fact, even from your plain error log, it seems that the client 
>> already tries more than once to connect/handshake/disconnect.
>>
>>  A log with level 2 would be more descriptive. However, all you can do 
>> is to play with the parameter:
>>
>> XNet.TryConnectServersList
>>
>>  This is the max number of "first connect" failures before TXNetFile 
>> gives up.
>>
>>  What I don't understand is that you should already see about 240 
>> retries, since 240 is the default value (about 2400 seconds). Do you 
>> see so many messages before the failure?
>>
>>  Note: Making the client retry for hours or days will not solve the 
>> basic problem, i.e. the unresponsive server. So you cannot be 
>> guaranteed that you are not simply moving the problem. I don't know if 
>> this is preferable to having an aborted job.
>>
>> Fabrizio
>>
>> Jerome LAURET wrote:
>>
>>>
>>>     Yes, it may be an unresponsive server but the end
>>> product is a user job crashing according to our users. Some
>>> of those connections (later retried) would lead to success.
>>> We can switch to level 3 and see as soon as we can design a
>>> large scale test: we also have flukes with authentication
>>> in general I have not reported this yet (I beleive it to be
>>> LDAP related as changing the LDAP setup changed failure rates
>>> from 50% to 3% ... but I cannot drop below the 3% failures).
>>>
>>>     In general, could someone indicates what is the most
>>> reliable setting for Xrootd to retry upon failures (regardless
>>> of delays this may cause) ??
>>>
>>>     Thanks,
>>>
>>> Fabrizio Furano wrote:
>>>
>>>> Hi Jerome.
>>>>
>>>>  This almost never happens, and makes me think about an unresponsive 
>>>> server. But may be caused by weird connection troubles. Do you get 
>>>> this immediately or in the middle of a communication?
>>>>
>>>>  Since it's a very strance situation, I'd suggest you to put the 
>>>> client side debug level to 3 and send everything to me. Also the 
>>>> server side log (after having enabled it, of course) will be useful.
>>>>
>>>>  The version included in root4 is rather old, but well known and 
>>>> tested. I am looking forward hoping that everybody will be switching 
>>>> to the newer one asap.... at least I will be no more dealing with N 
>>>> versions of the same code....
>>>>
>>>> Fabrizio
>>>>
>>>>
>>>>
>>>> Jerome LAURET wrote:
>>>>
>>>>>     Has anyone experienced this kind of issues and if so, what
>>>>> to do to resolve it ??
>>>>>
>>>>>> Error in <DoHandShake>: Error reading 4 bytes from the server 
>>>>>> [rcas6132.rcf.bnl.gov:1095].
>>>>>> Info in <GetAccessToSrv>: HandShake failed with server 
>>>>>> [rcas6132.rcf.bnl.gov:1095].
>>>>>> Error in <TXNetFile::CreateTXNf>: Access to server failed (0)
>>>>>> Error in <Disconnect>: Destroying nonexistent logconnid 0.
>>>>>> Error in <DoHandShake>: Error reading 4 bytes from the server 
>>>>>> [rcas6132.rcf.bnl.gov:1095].
>>>>>> Info in <GetAccessToSrv>: HandShake failed with server 
>>>>>> [rcas6132.rcf.bnl.gov:1095].
>>>>>> Error in <TXNetFile::CreateTXNf>: Access to server failed (0)
>>>>>> Error in <Disconnect>: Destroying nonexistent logconnid 0. 
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>     Thank you,
>>>>>
>>>
>>

-- 
              ,,,,,
             ( o o )
          --m---U---m--
              Jerome