Hi Rob, I don't understand step1), as DPM works differently. External https/davs clients go to Apache, port 443. Then the Apache plugin contacts dome ( = xrootd daemon with plugins ) to get metadata, but this is just internal gymnastics. Could you please clarify the workflow of a failing client? Is someone sending DAV clients to port 1094? That would be wrong. Thanks Fabrizio Il 30/07/20 18:10, Robert Currie ha scritto: > Hi, > > Unfortunately the actual error causing most of the grief for us in > trying to support ATLAS is that connections to the http side of xrootd > fail due to an authentication issue internally. > The authn-error seems to happen when a worker within xrootd attempts to > re-use a http connection which has previously been authenticated using > an external x509 cert. > > from our xrootd logs: > step1) Incoming http connection requests a keep alive and is > authenticated using external x509 using a worker within xroot (this > completes according to the logs) > step2) xroot does other things on other workers > step3) Connection is attempted using worker from step1 which should be > authenticated with http.secretkey. This then fails due to a authn problem. > > Using debug logging I've verified that the rejection is occurring within > the xrootd code and not the dpm plugins which is why tracking this down > from a DPM perspective took so long as requests from one DPM component > never reach the other leading to strange errors. > > From what I can gather step3 seems to be failing due to trying to use > external (untrusted in this context) x509 credentials to perform an > action that only trusts internal connections authenticated by > http.secretkey. > (Otherwise there is some undiagnosed bug in the authentication within > http(s) handling and I've no idea how to dig deeply enough to diagnose > this). > > The above accounts for some 90%+ of the deletion problems at our site. > > The 2nd issue (in the same GGUS ticket) which seem to be impacting us is > also a random issue to do with file descriptors which given is at the > level of connection handling unfortunately goes beyond the available > manpower/expertise required to accurately diagnose. > It has been suggested this is due to the glibc malloc we observe the > same intermittent problem on our headnode using tcmalloc unless this is > potentially a problem introduced by both. > Happy to open an issue about this but diagnosing a rare random bug in > the connection handling but there is no more manpower at the site to > track down something so fundamental. > > Both of these errors seem to be strongly mitigated by restarting the > service very frequently even though this is a last extreme resort. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/xrootd/xrootd/issues/1251#issuecomment-666495017>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABJBUT5HLOKFJVR2ARO2IXLR6GLOHANCNFSM4OYPYKGA>. > -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/xrootd/xrootd/issues/1251#issuecomment-666507556 ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-DEV list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1