Hi,

We're seeing this problem on a DPM system, but only with the xrootd protocol on one particular server, so I think this list can offer more help.

One (and only one) of our disk pool nodes is currently in a very upset state, with transfers to xrootd clients dying fairly predictably and rapidly after only transferring a small amount of data. The load on the disk server (in terms of attempted connections) is quite high - but throttling transfers and other methods of shaping traffic seems to not affect the failures which occur. (We can max out the network bandwidth and or the HBA backplane bandwidth - but throttling to prevent this does not prevent errors.)

The observed phenomena are:

1) on the clients - xrdcopy times out after sitting for a long time (most of which, when straced, is the xrdcopy waiting for data from the server, with intermittent transfers of small amounts of data)

2) on the server - with xrootd 4.6.1 installed (there are other disk servers with this release which are perfectly happy right at this moment)

- log shows (eg):

171026 14:09:01 118522 XrdLink: Unable to send to pilatlas.6607:[log in to unmask]; broken pipe

171026 14:09:01 118522 XrootdXeq: pilatlas.6607:[log in to unmask] disc 0:02:01 (send failure)

- on stracing a particular xrootd process (lots of context included here around the actual failure):

14:21:53.963641 poll([{fd=515, events=POLLIN|POLLRDNORM}], 1, 3000) = 1 ([{fd=515, revents=POLLIN|POLLRDNORM}])

14:21:53.963664 recvfrom(515, "/dpm/gla.scotgrid.ac.uk/home/atl"..., 349, 0, NULL, NULL) = 349

14:21:53.963704 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3661, ...}) = 0

14:21:53.963748 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3661, ...}) = 0

14:21:53.963807 fstat(32705, {st_mode=S_IFREG|0664, st_size=5416767812, ...}) = 0

14:21:53.963835 writev(515, [{"\3\0\0\0\0\0\0\4", 8}, {"\2\0\0\0", 4}], 2) = 12

14:21:53.963867 poll([{fd=515, events=POLLIN|POLLRDNORM}], 1, 3000) = 1 ([{fd=515, revents=POLLIN|POLLRDNORM}])

14:21:53.963994 recvfrom(515, "\4\0\v\305\2\0\0\0\0\0\0\0P\0\0\0\1\0\0\0\0\0\0\10", 24, 0, NULL, NULL) = 24

14:21:53.964019 poll([{fd=515, events=POLLIN|POLLRDNORM}], 1, 3000) = 1 ([{fd=515, revents=POLLIN|POLLRDNORM}])

14:21:53.964042 recvfrom(515, "\0\0\0\0\0\0\0\0", 8, 0, NULL, NULL) = 8

14:21:53.964065 setsockopt(515, SOL_TCP, TCP_CORK, [1], 4) = 0

14:21:53.964089 write(515, "\4\0\0\0\1\0\0\0", 8) = 8

14:21:53.964114 sendfile(515, 32705, [1342177280], 16777216) = 363504

14:23:55.002571 sendfile(515, 32705, [1342904288], 16413712) = -1 EPIPE (Broken pipe)

14:23:55.097654 gettid() = 118522

14:23:55.097711 writev(2, [{"171026 14:23:55 118522 ", 23}, {"Xrd", 3}, {"Link", 4}, {": Unable to ", 12}, {"send file to", 12}, {" ", 1}, {"prdatlas.4349:[log in to unmask]"..., 41}, {"; ", 2}, {"broken pipe", 11}, {"\n", 1}], 10) = 110

14:23:55.097829 gettid() = 118522

14:23:55.097856 writev(2, [{"171026 14:23:55 118522 ", 23}, {"Xrootd", 6}, {"Xeq", 3}, {": ", 2}, {"prdatlas.4349:[log in to unmask]"..., 41}, {" ", 1}, {"disc", 4}, {" ", 1}, {"0:02:02 (sendfile failure)", 26}, {"\n", 1}], 10) = 108

14:23:55.097962 close(515) = 0

14:23:55.098082 futex(0x610878, FUTEX_WAIT_PRIVATE, 0, NULL) = 0

14:23:57.660446 poll([{fd=170, events=POLLIN|POLLRDNORM}], 1, 3000) = 1 ([{fd=170, revents=POLLIN|POLLRDNORM}])

14:23:57.660517 recvfrom(170, "\10\0\v\305\0\0\0\0\0\0\0\0\1\360D\350\0\0\1\227\0\0\0\10", 24, 0, NULL, NULL) = 24

14:23:57.660556 poll([{fd=170, events=POLLIN|POLLRDNORM}], 1, 3000) = 1 ([{fd=170, revents=POLLIN|POLLRDNORM}])

14:23:57.660585 recvfrom(170, "\0\0\0\0\0\0\0\0", 8, 0, NULL, NULL) = 8

14:23:57.660621 pread(32644, "\0\0\1\227\3\354\0\0\4YZb\362L\0\203\0\33\0\0\0\0\1\360D\350\0\0\0\0\0\0"..., 407, 32523496) = 407

Restarting xrootd services does not help. (We have also observed a large number of CLOSE_WAIT connections accumulating - up to 500+ - and what seemed to be a spontaneous restart of the xrootd service itself.)

Any assistance would be appreciated.

Sam Skipsey

UKI-SCOTGRID-GLASGOW