Let's address this from a diferent angle: in
which version did you "fix" those issues or know those
"issues" have been fixed (the send/receive issue as a
start) ??
- xrd Client: is the version compatible with ROOT 4.04.02
production version ??
- Server: which version it is - We can re-deploy (STAR
patches included for LFN and PFN support included) and
re-test faster than we can trace infrastructure not under
our control ...
Thank you,
Fabrizio Furano wrote:
> Hi,
>
> Jerome LAURET wrote:
>
>>
>> Yes, it may be an unresponsive server but the end
>> product is a user job crashing according to our users. Some
>> of those connections (later retried) would lead to success.
>> We can switch to level 3 and see as soon as we can design a
>> large scale test: we also have flukes with authentication
>> in general I have not reported this yet (I beleive it to be
>> LDAP related as changing the LDAP setup changed failure rates
>> from 50% to 3% ... but I cannot drop below the 3% failures).
>>
>> In general, could someone indicates what is the most
>> reliable setting for Xrootd to retry upon failures (regardless
>> of delays this may cause) ??
>>
>
> That kind of failure has been treated as exceptional and fatal for a
> long time. So the policy was to abort. Newer versions are supposed to
> extend the retry mechanisms also to that circumstance. But I believe
> that it's not your case.
>
> From your answer I understand that the problem does not happen every
> time. The retry mechanism could patch the problem, but the main issue is
> that your server machine seems unable to receive/send the very first
> bytes after the establishment of a connection. The initial idea was that
> a machine which just accepted a connection was supposed to be able to
> handle a transfer of 10-20 bytes, but we were wrong, and fixed that in a
> later release. Although this is very hard to reproduce/debug.
>
> The log could help, but since you get the error only a few times, you
> will get an enormous amount of log to document the trouble. If this is
> the case, put the log level to 2 instead of 3.
>
> Personally I am not aware of interactions with LDAP which can interfere
> with the handshake.
>
> Fabrizio
>
>
>
>
>
>
>
>
>
>> Thanks,
>>
>> Fabrizio Furano wrote:
>>
>>> Hi Jerome.
>>>
>>> This almost never happens, and makes me think about an unresponsive
>>> server. But may be caused by weird connection troubles. Do you get
>>> this immediately or in the middle of a communication?
>>>
>>> Since it's a very strance situation, I'd suggest you to put the
>>> client side debug level to 3 and send everything to me. Also the
>>> server side log (after having enabled it, of course) will be useful.
>>>
>>> The version included in root4 is rather old, but well known and
>>> tested. I am looking forward hoping that everybody will be switching
>>> to the newer one asap.... at least I will be no more dealing with N
>>> versions of the same code....
>>>
>>> Fabrizio
>>>
>>>
>>>
>>> Jerome LAURET wrote:
>>>
>>>> Has anyone experienced this kind of issues and if so, what
>>>> to do to resolve it ??
>>>>
>>>>> Error in <DoHandShake>: Error reading 4 bytes from the server
>>>>> [rcas6132.rcf.bnl.gov:1095].
>>>>> Info in <GetAccessToSrv>: HandShake failed with server
>>>>> [rcas6132.rcf.bnl.gov:1095].
>>>>> Error in <TXNetFile::CreateTXNf>: Access to server failed (0)
>>>>> Error in <Disconnect>: Destroying nonexistent logconnid 0.
>>>>> Error in <DoHandShake>: Error reading 4 bytes from the server
>>>>> [rcas6132.rcf.bnl.gov:1095].
>>>>> Info in <GetAccessToSrv>: HandShake failed with server
>>>>> [rcas6132.rcf.bnl.gov:1095].
>>>>> Error in <TXNetFile::CreateTXNf>: Access to server failed (0)
>>>>> Error in <Disconnect>: Destroying nonexistent logconnid 0.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thank you,
>>>>
>>
--
,,,,,
( o o )
--m---U---m--
Jerome
|