Print

Print


While running XrdTpc in production in EOSLHCB we've noticed some strange behavior during HTTP TPC transfers. We are running the following configuration, as an example:

[root@st-048-e99ba873 ~]# uname -a
Linux st-048-e99ba873.cern.ch 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[root@st-048-e99ba873 ~]# rpm -qa | grep "curl\|nss\|sssd"
collectd-curl-5.8.1-1.el7.cern8.x86_64
libcurl-7.29.0-59.el7_9.1.x86_64
libsss_nss_idmap-1.16.5-10.el7_9.7.x86_64
sssd-common-pac-1.16.5-10.el7_9.7.x86_64
nss-tools-3.53.1-3.el7_9.x86_64
curl-7.29.0-59.el7_9.1.x86_64
nss-softokn-3.53.1-6.el7_9.x86_64
sssd-ldap-1.16.5-10.el7_9.7.x86_64
sssd-proxy-1.16.5-10.el7_9.7.x86_64
sssd-1.16.5-10.el7_9.7.x86_64
nss-pem-1.0.3-7.el7.x86_64
nss-softokn-freebl-3.53.1-6.el7_9.x86_64
nss-util-3.53.1-1.el7_9.x86_64
sssd-common-1.16.5-10.el7_9.7.x86_64
sssd-ipa-1.16.5-10.el7_9.7.x86_64
openssh-7.4p1-21.el7.x86_64
python-sssdconfig-1.16.5-10.el7_9.7.noarch
jansson-2.10-1.el7.x86_64
sssd-client-1.16.5-10.el7_9.7.x86_64
sssd-ad-1.16.5-10.el7_9.7.x86_64
openssh-server-7.4p1-21.el7.x86_64
nss-3.53.1-3.el7_9.x86_64
openssh-clients-7.4p1-21.el7.x86_64
openssl-libs-1.0.2k-21.el7_9.x86_64
nss-sysinit-3.53.1-3.el7_9.x86_64
sssd-krb5-1.16.5-10.el7_9.7.x86_64
python-pycurl-7.19.0-19.el7.x86_64
openssl-1.0.2k-21.el7_9.x86_64
sssd-krb5-common-1.16.5-10.el7_9.7.x86_64

So this is the latest CentOS7 with all the packages up to date.
We see that sometimes, curl fails to connect throwing the following error:

[root@st-048-e99ba873 ~]# curl -v --capath /etc/grid-security/certificates/ https://eoslhcb.cern.ch
* About to connect() to eoslhcb.cern.ch port 443 (#0)
*   Trying 2001:1458:301:62::100:5...
* Connected to eoslhcb.cern.ch (2001:1458:301:62::100:5) port 443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: /etc/grid-security/certificates/
* NSS error -5938 (PR_END_OF_FILE_ERROR)
* Encountered end of file
* Closing connection 0
curl: (35) Encountered end of file

This is not something 100% reproducible, sometimes it works, sometimes it doesn't. I suspect it is related to the load on the machine and the pressure on the nss subsystem.

The exact same command ran a few seconds later works as expected:

[root@st-048-e99ba873 ~]# curl -v --capath /etc/grid-security/certificates/ https://eoslhcb.cern.ch
* About to connect() to eoslhcb.cern.ch port 443 (#0)
*   Trying 2001:1458:301:62::100:5...
* Connected to eoslhcb.cern.ch (2001:1458:301:62::100:5) port 443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: /etc/grid-security/certificates/
* NSS: client certificate not found (nickname not specified)
* SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
* Server certificate:
* 	subject: CN=eoslhcb-qdb-52bedb7c30.cern.ch,OU=computers,DC=cern,DC=ch
* 	start date: Mar 13 10:48:02 2021 GMT
* 	expire date: Apr 17 10:48:02 2022 GMT
* 	common name: eoslhcb-qdb-52bedb7c30.cern.ch
* 	issuer: CN=CERN Grid Certification Authority,DC=cern,DC=ch
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: eoslhcb.cern.ch:443
> Accept: */*
>
< HTTP/1.1 403 token authorization failed
< Connection: Keep-Alive
< Content-Length: 26
<

* Connection #0 to host eoslhcb.cern.ch left intact

This shows up the FTS logs as the following error:

ERR     Tue, 16 Mar 2021 15:19:25 +0100; /builddir/build/BUILD/fts-3.10.1/src/url-copy/UrlCopyProcess.cpp; run; 599; Recoverable error: [5] [gfalt_copy_file][perform_copy] TRANSFER  [gfal_http_copy] ERROR: Copy failed with mode 3rd pull, with error: [davix2gliberr] Transfer failed: failure: SSL connect error

and in the XRootD logs as the following:

210316 15:19:25 10981 TPC_PullRequest: event=TRANSFER_FAIL, local=/eos/lhcb/grid/user/lhcb/user/c/chaen/finalCTA/CERN-CTA-TEST_CERN-USER_40.TPC.test, remote=https://eosctalhcb.cern.ch:8444/eos/ctalhcb/archivetest/lhcb/user/c/chaen/finalCTA/CERN-CTA-TEST_CERN-USER_40.TPC.test, user=(anonymous), bytes_transferred=0; HTTP library failure: SSL connect error

I read a bit about this type of error coming from NSS but there seems to be no good solution for this. Most of the time an upgrade of both curl and sssd are suggest but also this is not a bullet proof solution, apparently. I tried various things like clearing the sssd cache, restarting the daemon but nothing seems to help.
Has anyone been confronted with this? This is starting to affect 5-10% of the transfers on busy disk nodes in our instances.

Thanks,
Elvin


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.

[ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/xrootd/xrootd/issues/1428", "url": "https://github.com/xrootd/xrootd/issues/1428", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1