Print

Print


Hi,

Unfortunately the actual error causing most of the grief for us in trying to support ATLAS is that connections to the http side of xrootd fail due to an authentication issue internally.
The authn-error seems to happen when a worker within xrootd attempts to re-use a http connection which has previously been authenticated using an external x509 cert.

from our xrootd logs:
  step1) Incoming http connection requests a keep alive and is authenticated using external x509 using a worker within xroot (this completes according to the logs)
  step2) xroot does other things on other workers
  step3) Connection is attempted using worker from step1 which should be authenticated with http.secretkey. This then fails due to a authn problem.

Using debug logging I've verified that the rejection is occurring within the xrootd code and not the dpm plugins which is why tracking this down from a DPM perspective took so long as requests from one DPM component never reach the other leading to strange errors.

From what I can gather step3 seems to be failing due to trying to use external (untrusted in this context) x509 credentials to perform an action that only trusts internal connections authenticated by http.secretkey.
(Otherwise there is some undiagnosed bug in the authentication within http(s) handling and I've no idea how to dig deeply enough to diagnose this).

The above accounts for some 90%+ of the deletion problems at our site.

The 2nd issue (in the same GGUS ticket) which seem to be impacting us is also a random issue to do with file descriptors which given is at the level of connection handling unfortunately goes beyond the available manpower/expertise required to accurately diagnose.
It has been suggested this is due to the glibc malloc we observe the same intermittent problem on our headnode using tcmalloc unless this is potentially a problem introduced by both.
Happy to open an issue about this but diagnosing a rare random bug in the connection handling but there is no more manpower at the site to track down something so fundamental.

Both of these errors seem to be strongly mitigated by restarting the service very frequently even though this is a last extreme resort.

-- 
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/xrootd/xrootd/issues/1251#issuecomment-666495017

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1