Sorry for the similarly late response.
I believe this is a 4.8.4 client with a 4.9.1 server. I had disregarded that line as we see that at the start of every transfer, even successful ones, so I assumed it was a warning
that could be ignored. I may be completely wrong on this however!
I’m no longer particularly convinced these failures are indicative of a problem with the XRootD server. As Andy pointed out to me, a broken pipe send failure in the logs can be caused
by something as simple as the client being killed during the transfer.
Unfortunately the failing jobs are user analysis jobs, so the failures are presented as somewhat cryptic (to me anyway!) experiment framework errors, and the entirely transient nature
of these failures means that we haven’t been able to reproduce the failures. We’re working with the VO to try and isolate the issue better.
Hopefully I’ll be able to update you all with the resolution, or more useful error messages as we get a better handle on the issue.
Sorry for the late response, from the logs it looks like a post 4.9.x client tries to delegate
a credential to a pre 4.9.x server. Could you check what are the versions of the server and
[log in to unmask] [[log in to unmask]] on
behalf of Thomas Byrne - UKRI STFC [[log in to unmask]]
Sent: 25 July 2019 12:02
To: [log in to unmask]
Subject: Help understanding xrootd broken pipe failures
Sorry for the vague nature of this email, I’m having significant issues understanding and tracking down the root cause of this problem, so any help/advice would be appreciated!
We are experiencing a large number of ‘broken pipe’ events causing job failures, mainly on files opened for streaming. For example:
190724 14:04:36 25445 secgsi_ServerDoCert: no signed DH parameters from client:tlhcb005.201:[log in to unmask] : will not delegate x509 proxy to it
190724 14:04:36 25445 XrootdXeq: tlhcb005.201:[log in to unmask] pvt IPv4 login as lhcbuser
190724 14:07:12 29851 XrdLink: Unable to send to tlhcb005.201:[log in to unmask]; broken pipe
190724 14:07:12 29851 XrootdXeq: tlhcb005.201:[log in to unmask] disc 0:02:37 (send failure)
We are seeing this on XRootD server containers on our worker nodes, with jobs running on the same worker node. The file may coming from the local disk cache, or from another local XRootD server container which
is talking to a Ceph cluster via libxrdceph.
Has anyone seen this before? Given the specific error message I am assuming that there is some communication issue, but I was wondering if anyone could shed some light on what exactly is breaking here, or how I
can get more detail about the issue (debugging settings etc.).
Storage System Administrator
Scientific Computing Department
Science and Technology Facilities Council
Rutherford Appleton Laboratory
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-L list, click the following link: