As per discussions on #1251 there is a connection bug in xrootd impacting storage at ECDF.

GGUS: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146771

The key part of the bug here is that xrootd will reach a state where a growing large percentage of connections will fail with a connection reset.

We observe that this bug happens a lot less when we have deactivated an external tcp probe of the xrootd service.
(There are functional tests that others run, as a site admin I just care if xrootd has crashed)

The tcp probe we're using on an external host is the tcp probe within blackbox_exporter: https://github.com/prometheus/blackbox_exporter

(Aside:
The xrootd service which fails is configured as part of DPM and has both xrootd and http on at the same time. xrootd protocol being used to serve data and the http side parsing/sending DPM-DOME commands.)

After some period of time the problem presents itself such that a curl against the service will receive a connection reset rather than a reply:

[root@pool7 ~]# curl -vvvk localhost:1095
* About to connect() to localhost port 1095 (#0)
* Trying ::1...
* Connected to localhost (::1) port 1095 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: localhost:1095
> Accept: */*
>
* Recv failure: Connection reset by peer
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer
[root@pool7 ~]# curl -vvvk localhost:1095
* About to connect() to localhost port 1095 (#0)
* Trying ::1...
* Connected to localhost (::1) port 1095 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: localhost:1095
> Accept: */*
>
* Empty reply from server
* Connection #0 to host localhost left intact
curl: (52) Empty reply from server

From the xrootd logs the error presents itself as the following:

200727 18:38:30 4359 XrdAccept: Unable to allocate new link for srm.glite.ecdf.ed.ac.uk; cannot allocate memory
200727 18:38:47 8213 XrdLink: attempt to reuse active link
200727 18:38:47 8213 XrdAccept: Unable to allocate new link for srm.glite.ecdf.ed.ac.uk; cannot allocate memory
200727 18:39:10 8225 cms_Finder: Waiting for cms path /var/spool/xrootd/dpmdisk/.olb/olbd.admin
200727 18:39:31 8213 XrdLink: attempt to reuse active link
200727 18:39:31 8213 XrdAccept: Unable to allocate new link for srm.glite.ecdf.ed.ac.uk; cannot allocate memory
200727 18:39:48 8213 XrdLink: attempt to reuse active link
200727 18:39:48 8213 XrdAccept: Unable to allocate new link for srm.glite.ecdf.ed.ac.uk; cannot allocate memory
200727 18:40:20 8225 cms_Finder: Waiting for cms path /var/spool/xrootd/dpmdisk/.olb/olbd.admin
200727 18:40:49 8813 XrdLink: attempt to reuse active link

It has been suggested that this could be a malloc related problem, unfortunately on both machines there are several Gb of free RAM and no anomalously high load which could suggest the RAM being constantly allocated/deallocated.

This has been observed on a system using glibc-malloc as well as a system running xrootd with tcmalloc. It is possible this could be fixed by switching to jemalloc which hasn't been tested. This has been observed on 3 of our newer CentOS7 storage deployments.

We've mitigated this for the time being by not using the above tcp probe. Since then the xrootd services on our storage nodes have not had to restart for a few days.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.

[ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/xrootd/xrootd/issues/1266", "url": "https://github.com/xrootd/xrootd/issues/1266", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1