Print

Print


Hi Jerome,

What does the debug log on the server indicate?

Andy

----- Original Message ----- 
From: "Fabrizio Furano" <[log in to unmask]>
To: "Jerome LAURET" <[log in to unmask]>
Cc: "Xrootd Mailing List" <[log in to unmask]>
Sent: Tuesday, December 13, 2005 4:21 PM
Subject: Re: HandShake problem in Xoortd


> Hi,
> 
>  I just checked the code in ROOT  4.04-02, just to realize that the 
> retry code is there. It should work.
> 
>  In fact, even from your plain error log, it seems that the client 
> already tries more than once to connect/handshake/disconnect.
> 
>  A log with level 2 would be more descriptive. However, all you can do 
> is to play with the parameter:
> 
> XNet.TryConnectServersList
> 
>  This is the max number of "first connect" failures before TXNetFile 
> gives up.
> 
>  What I don't understand is that you should already see about 240 
> retries, since 240 is the default value (about 2400 seconds). Do you see 
> so many messages before the failure?
> 
>  Note: Making the client retry for hours or days will not solve the 
> basic problem, i.e. the unresponsive server. So you cannot be guaranteed 
> that you are not simply moving the problem. I don't know if this is 
> preferable to having an aborted job.
> 
> Fabrizio
> 
> Jerome LAURET wrote:
>> 
>>     Yes, it may be an unresponsive server but the end
>> product is a user job crashing according to our users. Some
>> of those connections (later retried) would lead to success.
>> We can switch to level 3 and see as soon as we can design a
>> large scale test: we also have flukes with authentication
>> in general I have not reported this yet (I beleive it to be
>> LDAP related as changing the LDAP setup changed failure rates
>> from 50% to 3% ... but I cannot drop below the 3% failures).
>> 
>>     In general, could someone indicates what is the most
>> reliable setting for Xrootd to retry upon failures (regardless
>> of delays this may cause) ??
>> 
>>     Thanks,
>> 
>> Fabrizio Furano wrote:
>> 
>>> Hi Jerome.
>>>
>>>  This almost never happens, and makes me think about an unresponsive 
>>> server. But may be caused by weird connection troubles. Do you get 
>>> this immediately or in the middle of a communication?
>>>
>>>  Since it's a very strance situation, I'd suggest you to put the 
>>> client side debug level to 3 and send everything to me. Also the 
>>> server side log (after having enabled it, of course) will be useful.
>>>
>>>  The version included in root4 is rather old, but well known and 
>>> tested. I am looking forward hoping that everybody will be switching 
>>> to the newer one asap.... at least I will be no more dealing with N 
>>> versions of the same code....
>>>
>>> Fabrizio
>>>
>>>
>>>
>>> Jerome LAURET wrote:
>>>
>>>>     Has anyone experienced this kind of issues and if so, what
>>>> to do to resolve it ??
>>>>
>>>>> Error in <DoHandShake>: Error reading 4 bytes from the server 
>>>>> [rcas6132.rcf.bnl.gov:1095].
>>>>> Info in <GetAccessToSrv>: HandShake failed with server 
>>>>> [rcas6132.rcf.bnl.gov:1095].
>>>>> Error in <TXNetFile::CreateTXNf>: Access to server failed (0)
>>>>> Error in <Disconnect>: Destroying nonexistent logconnid 0.
>>>>> Error in <DoHandShake>: Error reading 4 bytes from the server 
>>>>> [rcas6132.rcf.bnl.gov:1095].
>>>>> Info in <GetAccessToSrv>: HandShake failed with server 
>>>>> [rcas6132.rcf.bnl.gov:1095].
>>>>> Error in <TXNetFile::CreateTXNf>: Access to server failed (0)
>>>>> Error in <Disconnect>: Destroying nonexistent logconnid 0. 
>>>>
>>>>
>>>>
>>>>
>>>>     Thank you,
>>>>
>> 
>