Print

Print


Dear experts,

We are observing issues between xrootd (both redirector and disk node) and FTS servers like these:

220712 09:41:26 10649 XrdLink: Unable to send to 7dc42260.5721:[log in to unmask]; connection timed out
220712 09:41:26 10649 XrootdXeq: 7dc42260.5721:[log in to unmask] disc 2:12:21 (socket error)

which cause most of our transfers to fail (failures are 10-100 times larger than successes)

On the FTS side we see this:

INFO    Tue, 12 Jul 2022 10:14:02 +0200; [1657613642182] DEST http_plugin	CHECKSUM:ENTER	
WARNING Tue, 12 Jul 2022 10:29:52 +0200; Timeout stopped
ERR     Tue, 12 Jul 2022 10:29:52 +0200; Recoverable error: [112] DESTINATION CHECKSUM (Neon): Could not read status line: Connection timed out

In short: transfer succeeds (we can see the files on disk), but the checksum part always times out.
We've tried many things to make it work, including a custom plugin for calculation.
We do not observe huge delays in the checksum calculation - nothing that would explain 🟥 15 minute delays 🟥!
Both source and destination use davs:// for copy → HTTPS via XROOTD path on our side.

Logs

FTS log example:

Full xrootd.log:

Versions, config and operations

We are running both redirector (xrootd.phy.bris.ac.uk:1094) and disk server (io-37-02.acrc.bris.ac.uk:1194) via Docker with --net=host.
FTS servers are reachable from within the containers (and host) via IPv4 and IPv6.

Our config can be found on https://github.com/BristolComputing/xrootd-se/tree/main/etc/xrootd (clustered + config.d).

Installed xrootd versions and plugins:

|libmacaroons.x86_64           |          0.3.0-2.el7 |           epel |
|scitokens-cpp.x86_64          |          0.7.1-1.el7 |           epel |
|voms.x86_64                   |   2.1.0-0.24.rc2.el7 |           epel |
|xrootd.x86_64                 | 1:5.4.3-1.1.osg36.el7 |            osg |
|xrootd-client.x86_64          | 1:5.4.3-1.1.osg36.el7 |            osg |
|xrootd-client-libs.x86_64     | 1:5.4.3-1.1.osg36.el7 |            osg |
|xrootd-cmstfc.x86_64          |    1.5.2-6.osg36.el7 |    osg-contrib |
|xrootd-lcmaps.x86_64          |       99-1.osg36.el7 |            osg |
|xrootd-libs.x86_64            | 1:5.4.3-1.1.osg36.el7 |            osg |
|xrootd-scitokens.x86_64       | 1:5.4.3-1.1.osg36.el7 |            osg |
|xrootd-selinux.noarch         | 1:5.4.3-1.1.osg36.el7 |            osg |
|xrootd-server.x86_64          | 1:5.4.3-1.1.osg36.el7 |            osg |
|xrootd-server-libs.x86_64     | 1:5.4.3-1.1.osg36.el7 |            osg |
|xrootd-voms.x86_64            | 1:5.4.3-1.1.osg36.el7 |            osg |

xrootd-hdfs

Hadoop 3.3.1
Source code repository https://github.com/apache/hadoop.git -r a3b9c37a397ad4188041dd80621bdeefc46885f2
Compiled by ubuntu on 2021-06-15T05:13Z
Compiled with protoc 3.7.1
From source with checksum 88a4ddb2299aca054416d6b7f81ca55
This command was run using /opt/hadoop/share/hadoop/common/hadoop-common-3.3.1.jar

xrootd-hdfs 2.2.0
Source code repository https://github.com/uobdic/xrootd-hdfs.git -b kreczko-checksum-debug -r 66d7c97
Compiled by CentOS Linux release 7.9.2009 (Core) on 2022-07-08T13:59Z

Other monitoring

Failures

image

Successes

image

99% of the failures are due to the mentioned timeout. Please note the different scales for the y-axis.


Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.Message ID: <xrootd/xrootd/issues/1736@github.com>

[ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/xrootd/xrootd/issues/1736", "url": "https://github.com/xrootd/xrootd/issues/1736", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1