Print

Print


Hi,

  I just checked the code in ROOT  4.04-02, just to realize that the 
retry code is there. It should work.

  In fact, even from your plain error log, it seems that the client 
already tries more than once to connect/handshake/disconnect.

  A log with level 2 would be more descriptive. However, all you can do 
is to play with the parameter:

XNet.TryConnectServersList

  This is the max number of "first connect" failures before TXNetFile 
gives up.

  What I don't understand is that you should already see about 240 
retries, since 240 is the default value (about 2400 seconds). Do you see 
so many messages before the failure?

  Note: Making the client retry for hours or days will not solve the 
basic problem, i.e. the unresponsive server. So you cannot be guaranteed 
that you are not simply moving the problem. I don't know if this is 
preferable to having an aborted job.

Fabrizio

Jerome LAURET wrote:
> 
>     Yes, it may be an unresponsive server but the end
> product is a user job crashing according to our users. Some
> of those connections (later retried) would lead to success.
> We can switch to level 3 and see as soon as we can design a
> large scale test: we also have flukes with authentication
> in general I have not reported this yet (I beleive it to be
> LDAP related as changing the LDAP setup changed failure rates
> from 50% to 3% ... but I cannot drop below the 3% failures).
> 
>     In general, could someone indicates what is the most
> reliable setting for Xrootd to retry upon failures (regardless
> of delays this may cause) ??
> 
>     Thanks,
> 
> Fabrizio Furano wrote:
> 
>> Hi Jerome.
>>
>>  This almost never happens, and makes me think about an unresponsive 
>> server. But may be caused by weird connection troubles. Do you get 
>> this immediately or in the middle of a communication?
>>
>>  Since it's a very strance situation, I'd suggest you to put the 
>> client side debug level to 3 and send everything to me. Also the 
>> server side log (after having enabled it, of course) will be useful.
>>
>>  The version included in root4 is rather old, but well known and 
>> tested. I am looking forward hoping that everybody will be switching 
>> to the newer one asap.... at least I will be no more dealing with N 
>> versions of the same code....
>>
>> Fabrizio
>>
>>
>>
>> Jerome LAURET wrote:
>>
>>>     Has anyone experienced this kind of issues and if so, what
>>> to do to resolve it ??
>>>
>>>> Error in <DoHandShake>: Error reading 4 bytes from the server 
>>>> [rcas6132.rcf.bnl.gov:1095].
>>>> Info in <GetAccessToSrv>: HandShake failed with server 
>>>> [rcas6132.rcf.bnl.gov:1095].
>>>> Error in <TXNetFile::CreateTXNf>: Access to server failed (0)
>>>> Error in <Disconnect>: Destroying nonexistent logconnid 0.
>>>> Error in <DoHandShake>: Error reading 4 bytes from the server 
>>>> [rcas6132.rcf.bnl.gov:1095].
>>>> Info in <GetAccessToSrv>: HandShake failed with server 
>>>> [rcas6132.rcf.bnl.gov:1095].
>>>> Error in <TXNetFile::CreateTXNf>: Access to server failed (0)
>>>> Error in <Disconnect>: Destroying nonexistent logconnid 0. 
>>>
>>>
>>>
>>>
>>>     Thank you,
>>>
>