Let's address this from a diferent angle: in which version did you "fix" those issues or know those "issues" have been fixed (the send/receive issue as a start) ?? - xrd Client: is the version compatible with ROOT 4.04.02 production version ?? - Server: which version it is - We can re-deploy (STAR patches included for LFN and PFN support included) and re-test faster than we can trace infrastructure not under our control ... Thank you, Fabrizio Furano wrote: > Hi, > > Jerome LAURET wrote: > >> >> Yes, it may be an unresponsive server but the end >> product is a user job crashing according to our users. Some >> of those connections (later retried) would lead to success. >> We can switch to level 3 and see as soon as we can design a >> large scale test: we also have flukes with authentication >> in general I have not reported this yet (I beleive it to be >> LDAP related as changing the LDAP setup changed failure rates >> from 50% to 3% ... but I cannot drop below the 3% failures). >> >> In general, could someone indicates what is the most >> reliable setting for Xrootd to retry upon failures (regardless >> of delays this may cause) ?? >> > > That kind of failure has been treated as exceptional and fatal for a > long time. So the policy was to abort. Newer versions are supposed to > extend the retry mechanisms also to that circumstance. But I believe > that it's not your case. > > From your answer I understand that the problem does not happen every > time. The retry mechanism could patch the problem, but the main issue is > that your server machine seems unable to receive/send the very first > bytes after the establishment of a connection. The initial idea was that > a machine which just accepted a connection was supposed to be able to > handle a transfer of 10-20 bytes, but we were wrong, and fixed that in a > later release. Although this is very hard to reproduce/debug. > > The log could help, but since you get the error only a few times, you > will get an enormous amount of log to document the trouble. If this is > the case, put the log level to 2 instead of 3. > > Personally I am not aware of interactions with LDAP which can interfere > with the handshake. > > Fabrizio > > > > > > > > > >> Thanks, >> >> Fabrizio Furano wrote: >> >>> Hi Jerome. >>> >>> This almost never happens, and makes me think about an unresponsive >>> server. But may be caused by weird connection troubles. Do you get >>> this immediately or in the middle of a communication? >>> >>> Since it's a very strance situation, I'd suggest you to put the >>> client side debug level to 3 and send everything to me. Also the >>> server side log (after having enabled it, of course) will be useful. >>> >>> The version included in root4 is rather old, but well known and >>> tested. I am looking forward hoping that everybody will be switching >>> to the newer one asap.... at least I will be no more dealing with N >>> versions of the same code.... >>> >>> Fabrizio >>> >>> >>> >>> Jerome LAURET wrote: >>> >>>> Has anyone experienced this kind of issues and if so, what >>>> to do to resolve it ?? >>>> >>>>> Error in <DoHandShake>: Error reading 4 bytes from the server >>>>> [rcas6132.rcf.bnl.gov:1095]. >>>>> Info in <GetAccessToSrv>: HandShake failed with server >>>>> [rcas6132.rcf.bnl.gov:1095]. >>>>> Error in <TXNetFile::CreateTXNf>: Access to server failed (0) >>>>> Error in <Disconnect>: Destroying nonexistent logconnid 0. >>>>> Error in <DoHandShake>: Error reading 4 bytes from the server >>>>> [rcas6132.rcf.bnl.gov:1095]. >>>>> Info in <GetAccessToSrv>: HandShake failed with server >>>>> [rcas6132.rcf.bnl.gov:1095]. >>>>> Error in <TXNetFile::CreateTXNf>: Access to server failed (0) >>>>> Error in <Disconnect>: Destroying nonexistent logconnid 0. >>>> >>>> >>>> >>>> >>>> >>>> Thank you, >>>> >> -- ,,,,, ( o o ) --m---U---m-- Jerome