Hi Heiko, There is nothing in xrootd or the client that introduces stalls. Usually, when such things happen, it's introduced by some external factor. I've seen such stalls in a VM environment when the hypervisor decides you used up too much resource and just puts the VM to sleep for a while or migrates it to another more idle place. In a Kubernetes setup containers can get evicted for a while due to excessive resource usage. Given that your test is essentially a high-stress all out resource hungry test, I wouldn't be suprised that even the TCP stack may throttle you because it sees too high of a packet loss. Finding what causes these stalls is really hard. The first step is to assess the actual environment you are using. Andy On Wed, 4 Apr 2018, Michal Kamil Simon wrote: > Hi Heiko, > > After a quick look at the debug output, from the client perspective it > looks ok, the client sent 4 parallel read requests (default configuration) > and is waiting for the response from the server. If would have left it running > for bit more the client would hit a stream timeout and reset the connection. > > Cheers, > Michal > ________________________________________ > From: [log in to unmask] [[log in to unmask]] on behalf of Heiko Schröter [[log in to unmask]] > Sent: 04 April 2018 15:01 > To: [log in to unmask] > Subject: xrdcp 4.8.1 stalls > > Hello all, > > i'am new to xrootd and evaluate a test setup in a 10GBit network. > > During a looped copy job of a 1GB large test file xrootd 4.8.1 stalls > every 4 to 7 copy jobs. > > The copying is done between RAM-Network-RAM to exclude disk i/o. > Sometimes xrootd comes back, sometimes the copy job has to be killed. > Restarting the xrootd job reverts everything to normal until the stall > reappears. > > I would expect xrootd not to stall even under such circumstances. But i > agree this is a somewhat artificial usecase. > > > Best > > Heiko > > > xrootd.cf, Ver 4.8.1: > > all.export /xrootd > > set xrdr=REDIRECTOR > set inventory=/var/log/xrootd/inventory > all.manager $(xrdr):3121 > > if $(xrdr) && named cns > all.export $(inventory) > xrd.port 1095 > else if $(xrdr) > all.role manager > ofs.forward 3way $(xrdr):1095 mv rm rmdir trunc > xrd.port 1094 > else > all.role server > ofs.notify closew create mkdir mv rm rmdir trunc | > /usr/bin/XrdCnsd -d -D 2 -i 90 -b $(xrdr):1095:$(inventory) > ofs.notifymsg create $TID create $FMODE $LFN?$CGI > ofs.notifymsg closew $TID closew $LFN $FSIZE > fi > > > The brute force test: > > for ((i=0;i<=100;i++));do rm -f /mnt/ramdisk/test.dat; xrdcp -d 3 -v > root://REDIRECTOR//xrootd/test.dat /mnt/ramdisk/test.dat; rm -f > /mnt/ramdisk/test.dat; sleep 1; done > > > xrdcp debug output: > > [2018-04-04 14:42:59.221301 +0200][Debug ][File ] > [0x24e27a0@file://localhost/mnt/ramdisk/test.dat?oss.asize=1073741824] > Sending a write command for handle 0xb to localhost > [2018-04-04 14:42:59.228163 +0200][Dump ][Utility ] URL: > file://localhost/mnt/ramdisk/test.dat?oss.asize=1073741824 > [2018-04-04 14:42:59.228163 +0200][Dump ][Utility ] > Protocol: file > [2018-04-04 14:42:59.228163 +0200][Dump ][Utility ] User Name: > [2018-04-04 14:42:59.228163 +0200][Dump ][Utility ] Password: > [2018-04-04 14:42:59.228163 +0200][Dump ][Utility ] Host > Name: localhost > [2018-04-04 14:42:59.228163 +0200][Dump ][Utility ] > Port: 1094 > [2018-04-04 14:42:59.228163 +0200][Dump ][Utility ] > Path: /mnt/ramdisk/test.dat > [2018-04-04 14:42:59.228229 +0200][Debug ][File ] > [0x24dd0d0@root://REDIRECTOR:1094//xrootd/test.dat] Sending a read > command for handle 0x0 to 192.168.16.120:1094 > [2018-04-04 14:42:59.228233 +0200][Dump ][File ] > [0x24e27a0@file://localhost/mnt/ramdisk/test.dat?oss.asize=1073741824] > Got state response for message kXR_write (handle: 0x0b000000, offset: > 503316480, size: 16777216) > [2018-04-04 14:42:59.228254 +0200][Dump ][XRootD ] > [192.168.16.120:1094] Sending message kXR_read (handle: 0x00000000, > offset: 570425344, size: 16777216) > [2018-04-04 14:42:59.228272 +0200][Dump ][PostMaster ] > [192.168.16.120:1094 #0] Sending message kXR_read (handle: 0x00000000, > offset: 570425344, size: 16777216) (0x24dd9e0) through substream 0 > expecting answer at 0 > [2018-04-04 14:42:59.228305 +0200][Dump ][AsyncSock ] > [192.168.16.120:1094 #0.0] Wrote a message: kXR_read (handle: > 0x00000000, offset: 570425344, size: 16777216) (0x24dd9e0), 32 bytes > [2018-04-04 14:42:59.228329 +0200][Dump ][AsyncSock ] > [192.168.16.120:1094 #0.0] Successfully sent message: kXR_read (handle: > 0x00000000, offset: 570425344, size: 16777216) (0x24dd9e0). > [2018-04-04 14:42:59.228340 +0200][Dump ][XRootD ] > [192.168.16.120:1094] Message kXR_read (handle: 0x00000000, offset: > 570425344, size: 16777216) has been successfully sent. > [2018-04-04 14:42:59.228353 +0200][Dump ][PostMaster ] > [192.168.16.120:1094 #0.0] All messages consumed, disable uplink > [2018-04-04 14:42:59.750894 +0200][Dump ][TaskMgr ] Running > task: "FileTimer task" > [2018-04-04 14:42:59.750934 +0200][Dump ][TaskMgr ] Will > rerun task "FileTimer task" at [2018-04-04 14:43:14 +0200] > [2018-04-04 14:43:13.464015 +0200][Dump ][XRootDTransport ] > [REDIRECTOR:1094 #0.0] Stream inactive since 15 seconds, TTL: 1200, > allocated SIDs: 0, open files: 0 > [2018-04-04 14:43:13.464039 +0200][Dump ][XRootDTransport ] > [REDIRECTOR:1094 #0.0] Stream inactive since 15 seconds, stream timeout: > 60, allocated SIDs: 0, wait barrier: 2018-04-04 14:42:58 +0200 > [2018-04-04 14:43:13.751694 +0200][Dump ][TaskMgr ] Running > task: "TickGeneratorTask for: REDIRECTOR:1094" > [2018-04-04 14:43:13.751737 +0200][Dump ][TaskMgr ] Will > rerun task "TickGeneratorTask for: REDIRECTOR:1094" at [2018-04-04 > 14:43:28 +0200] > [2018-04-04 14:43:13.751753 +0200][Dump ][TaskMgr ] Running > task: "TickGeneratorTask for: 192.168.16.120:1094" > [2018-04-04 14:43:13.751764 +0200][Dump ][TaskMgr ] Will > rerun task "TickGeneratorTask for: 192.168.16.120:1094" at [2018-04-04 > 14:43:28 +0200] > [2018-04-04 14:43:14.751830 +0200][Dump ][TaskMgr ] Running > task: "FileTimer task" > [2018-04-04 14:43:14.751849 +0200][Dump ][TaskMgr ] Will > rerun task "FileTimer task" at [2018-04-04 14:43:29 +0200] > [2018-04-04 14:43:28.752586 +0200][Dump ][TaskMgr ] Running > task: "TickGeneratorTask for: REDIRECTOR:1094" > [2018-04-04 14:43:28.752656 +0200][Dump ][TaskMgr ] Will > rerun task "TickGeneratorTask for: REDIRECTOR:1094" at [2018-04-04 > 14:43:43 +0200] > [2018-04-04 14:43:28.752691 +0200][Dump ][TaskMgr ] Running > task: "TickGeneratorTask for: 192.168.16.120:1094" > [2018-04-04 14:43:28.752727 +0200][Dump ][TaskMgr ] Will > rerun task "TickGeneratorTask for: 192.168.16.120:1094" at [2018-04-04 > 14:43:43 +0200] > [2018-04-04 14:43:28.785950 +0200][Dump ][XRootDTransport ] > [REDIRECTOR:1094 #0.0] Stream inactive since 30 seconds, TTL: 1200, > allocated SIDs: 0, open files: 0 > [2018-04-04 14:43:28.786026 +0200][Dump ][XRootDTransport ] > [REDIRECTOR:1094 #0.0] Stream inactive since 30 seconds, stream timeout: > 60, allocated SIDs: 0, wait barrier: 2018-04-04 14:42:58 +0200 > [2018-04-04 14:43:29.752822 +0200][Dump ][TaskMgr ] Running > task: "FileTimer task" > [2018-04-04 14:43:29.752892 +0200][Dump ][TaskMgr ] Will > rerun task "FileTimer task" at [2018-04-04 14:43:44 +0200] > [2018-04-04 14:43:40.846051 +0200][Dump ][XRootDTransport ] > [192.168.16.120:1094 #0.0] Stream inactive since 15 seconds, TTL: 300, > allocated SIDs: 4, open files: 1 > [2018-04-04 14:43:40.846125 +0200][Dump ][XRootDTransport ] > [192.168.16.120:1094 #0.0] Stream inactive since 15 seconds, stream > timeout: 60, allocated SIDs: 4, wait barrier: 2018-04-04 14:42:59 +0200 > [2018-04-04 14:43:43.753676 +0200][Dump ][TaskMgr ] Running > task: "TickGeneratorTask for: REDIRECTOR:1094" > [2018-04-04 14:43:43.753760 +0200][Dump ][TaskMgr ] Will > rerun task "TickGeneratorTask for: REDIRECTOR:1094" at [2018-04-04 > 14:43:58 +0200] > [2018-04-04 14:43:43.753775 +0200][Dump ][TaskMgr ] Running > task: "TickGeneratorTask for: 192.168.16.120:1094" > [2018-04-04 14:43:43.753786 +0200][Dump ][TaskMgr ] Will > rerun task "TickGeneratorTask for: 192.168.16.120:1094" at [2018-04-04 > 14:43:58 +0200] > [2018-04-04 14:43:43.854343 +0200][Dump ][XRootDTransport ] > [REDIRECTOR:1094 #0.0] Stream inactive since 45 seconds, TTL: 1200, > allocated SIDs: 0, open files: 0 > [2018-04-04 14:43:43.854399 +0200][Dump ][XRootDTransport ] > [REDIRECTOR:1094 #0.0] Stream inactive since 45 seconds, stream timeout: > 60, allocated SIDs: 0, wait barrier: 2018-04-04 14:42:58 +0200 > [2018-04-04 14:43:44.753880 +0200][Dump ][TaskMgr ] Running > task: "FileTimer task" > [2018-04-04 14:43:44.753958 +0200][Dump ][TaskMgr ] Will > rerun task "FileTimer task" at [2018-04-04 14:43:59 +0200] > > > > -- > ----------------------------------------------------------------------- > Heiko Schröter > Institute of Environmental Physics (IUP) phone: ++49-(0)421-218-62092 > Institute of Remote Sensing (IFE) fax: ++49-(0)421-218-62070 > University of Bremen (FB1) > P.O. Box 330440 email: [log in to unmask] > Otto-Hahn-Allee 1 > 28359 Bremen > Germany > ----------------------------------------------------------------------- > > ######################################################################## > Use REPLY-ALL to reply to list > > To unsubscribe from the XROOTD-L list, click the following link: > https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 > > ######################################################################## > Use REPLY-ALL to reply to list > > To unsubscribe from the XROOTD-L list, click the following link: > https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 > ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-L list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1