Hi Heiko,
There is nothing in xrootd or the client that introduces stalls. Usually,
when such things happen, it's introduced by some external factor. I've
seen such stalls in a VM environment when the hypervisor decides you used
up too much resource and just puts the VM to sleep for a while or
migrates it to another more idle place. In a Kubernetes setup containers
can get evicted for a while due to excessive resource usage. Given that
your test is essentially a high-stress all out resource hungry test, I
wouldn't be suprised that even the TCP stack may throttle you because it
sees too high of a packet loss. Finding what causes these stalls is really
hard. The first step is to assess the actual environment you are using.
Andy
On Wed, 4 Apr 2018, Michal Kamil Simon wrote:
> Hi Heiko,
>
> After a quick look at the debug output, from the client perspective it
> looks ok, the client sent 4 parallel read requests (default configuration)
> and is waiting for the response from the server. If would have left it running
> for bit more the client would hit a stream timeout and reset the connection.
>
> Cheers,
> Michal
> ________________________________________
> From: [log in to unmask] [[log in to unmask]] on behalf of Heiko Schröter [[log in to unmask]]
> Sent: 04 April 2018 15:01
> To: [log in to unmask]
> Subject: xrdcp 4.8.1 stalls
>
> Hello all,
>
> i'am new to xrootd and evaluate a test setup in a 10GBit network.
>
> During a looped copy job of a 1GB large test file xrootd 4.8.1 stalls
> every 4 to 7 copy jobs.
>
> The copying is done between RAM-Network-RAM to exclude disk i/o.
> Sometimes xrootd comes back, sometimes the copy job has to be killed.
> Restarting the xrootd job reverts everything to normal until the stall
> reappears.
>
> I would expect xrootd not to stall even under such circumstances. But i
> agree this is a somewhat artificial usecase.
>
>
> Best
>
> Heiko
>
>
> xrootd.cf, Ver 4.8.1:
>
> all.export /xrootd
>
> set xrdr=REDIRECTOR
> set inventory=/var/log/xrootd/inventory
> all.manager $(xrdr):3121
>
> if $(xrdr) && named cns
> all.export $(inventory)
> xrd.port 1095
> else if $(xrdr)
> all.role manager
> ofs.forward 3way $(xrdr):1095 mv rm rmdir trunc
> xrd.port 1094
> else
> all.role server
> ofs.notify closew create mkdir mv rm rmdir trunc |
> /usr/bin/XrdCnsd -d -D 2 -i 90 -b $(xrdr):1095:$(inventory)
> ofs.notifymsg create $TID create $FMODE $LFN?$CGI
> ofs.notifymsg closew $TID closew $LFN $FSIZE
> fi
>
>
> The brute force test:
>
> for ((i=0;i<=100;i++));do rm -f /mnt/ramdisk/test.dat; xrdcp -d 3 -v
> root://REDIRECTOR//xrootd/test.dat /mnt/ramdisk/test.dat; rm -f
> /mnt/ramdisk/test.dat; sleep 1; done
>
>
> xrdcp debug output:
>
> [2018-04-04 14:42:59.221301 +0200][Debug ][File ]
> [0x24e27a0@file://localhost/mnt/ramdisk/test.dat?oss.asize=1073741824]
> Sending a write command for handle 0xb to localhost
> [2018-04-04 14:42:59.228163 +0200][Dump ][Utility ] URL:
> file://localhost/mnt/ramdisk/test.dat?oss.asize=1073741824
> [2018-04-04 14:42:59.228163 +0200][Dump ][Utility ]
> Protocol: file
> [2018-04-04 14:42:59.228163 +0200][Dump ][Utility ] User Name:
> [2018-04-04 14:42:59.228163 +0200][Dump ][Utility ] Password:
> [2018-04-04 14:42:59.228163 +0200][Dump ][Utility ] Host
> Name: localhost
> [2018-04-04 14:42:59.228163 +0200][Dump ][Utility ]
> Port: 1094
> [2018-04-04 14:42:59.228163 +0200][Dump ][Utility ]
> Path: /mnt/ramdisk/test.dat
> [2018-04-04 14:42:59.228229 +0200][Debug ][File ]
> [0x24dd0d0@root://REDIRECTOR:1094//xrootd/test.dat] Sending a read
> command for handle 0x0 to 192.168.16.120:1094
> [2018-04-04 14:42:59.228233 +0200][Dump ][File ]
> [0x24e27a0@file://localhost/mnt/ramdisk/test.dat?oss.asize=1073741824]
> Got state response for message kXR_write (handle: 0x0b000000, offset:
> 503316480, size: 16777216)
> [2018-04-04 14:42:59.228254 +0200][Dump ][XRootD ]
> [192.168.16.120:1094] Sending message kXR_read (handle: 0x00000000,
> offset: 570425344, size: 16777216)
> [2018-04-04 14:42:59.228272 +0200][Dump ][PostMaster ]
> [192.168.16.120:1094 #0] Sending message kXR_read (handle: 0x00000000,
> offset: 570425344, size: 16777216) (0x24dd9e0) through substream 0
> expecting answer at 0
> [2018-04-04 14:42:59.228305 +0200][Dump ][AsyncSock ]
> [192.168.16.120:1094 #0.0] Wrote a message: kXR_read (handle:
> 0x00000000, offset: 570425344, size: 16777216) (0x24dd9e0), 32 bytes
> [2018-04-04 14:42:59.228329 +0200][Dump ][AsyncSock ]
> [192.168.16.120:1094 #0.0] Successfully sent message: kXR_read (handle:
> 0x00000000, offset: 570425344, size: 16777216) (0x24dd9e0).
> [2018-04-04 14:42:59.228340 +0200][Dump ][XRootD ]
> [192.168.16.120:1094] Message kXR_read (handle: 0x00000000, offset:
> 570425344, size: 16777216) has been successfully sent.
> [2018-04-04 14:42:59.228353 +0200][Dump ][PostMaster ]
> [192.168.16.120:1094 #0.0] All messages consumed, disable uplink
> [2018-04-04 14:42:59.750894 +0200][Dump ][TaskMgr ] Running
> task: "FileTimer task"
> [2018-04-04 14:42:59.750934 +0200][Dump ][TaskMgr ] Will
> rerun task "FileTimer task" at [2018-04-04 14:43:14 +0200]
> [2018-04-04 14:43:13.464015 +0200][Dump ][XRootDTransport ]
> [REDIRECTOR:1094 #0.0] Stream inactive since 15 seconds, TTL: 1200,
> allocated SIDs: 0, open files: 0
> [2018-04-04 14:43:13.464039 +0200][Dump ][XRootDTransport ]
> [REDIRECTOR:1094 #0.0] Stream inactive since 15 seconds, stream timeout:
> 60, allocated SIDs: 0, wait barrier: 2018-04-04 14:42:58 +0200
> [2018-04-04 14:43:13.751694 +0200][Dump ][TaskMgr ] Running
> task: "TickGeneratorTask for: REDIRECTOR:1094"
> [2018-04-04 14:43:13.751737 +0200][Dump ][TaskMgr ] Will
> rerun task "TickGeneratorTask for: REDIRECTOR:1094" at [2018-04-04
> 14:43:28 +0200]
> [2018-04-04 14:43:13.751753 +0200][Dump ][TaskMgr ] Running
> task: "TickGeneratorTask for: 192.168.16.120:1094"
> [2018-04-04 14:43:13.751764 +0200][Dump ][TaskMgr ] Will
> rerun task "TickGeneratorTask for: 192.168.16.120:1094" at [2018-04-04
> 14:43:28 +0200]
> [2018-04-04 14:43:14.751830 +0200][Dump ][TaskMgr ] Running
> task: "FileTimer task"
> [2018-04-04 14:43:14.751849 +0200][Dump ][TaskMgr ] Will
> rerun task "FileTimer task" at [2018-04-04 14:43:29 +0200]
> [2018-04-04 14:43:28.752586 +0200][Dump ][TaskMgr ] Running
> task: "TickGeneratorTask for: REDIRECTOR:1094"
> [2018-04-04 14:43:28.752656 +0200][Dump ][TaskMgr ] Will
> rerun task "TickGeneratorTask for: REDIRECTOR:1094" at [2018-04-04
> 14:43:43 +0200]
> [2018-04-04 14:43:28.752691 +0200][Dump ][TaskMgr ] Running
> task: "TickGeneratorTask for: 192.168.16.120:1094"
> [2018-04-04 14:43:28.752727 +0200][Dump ][TaskMgr ] Will
> rerun task "TickGeneratorTask for: 192.168.16.120:1094" at [2018-04-04
> 14:43:43 +0200]
> [2018-04-04 14:43:28.785950 +0200][Dump ][XRootDTransport ]
> [REDIRECTOR:1094 #0.0] Stream inactive since 30 seconds, TTL: 1200,
> allocated SIDs: 0, open files: 0
> [2018-04-04 14:43:28.786026 +0200][Dump ][XRootDTransport ]
> [REDIRECTOR:1094 #0.0] Stream inactive since 30 seconds, stream timeout:
> 60, allocated SIDs: 0, wait barrier: 2018-04-04 14:42:58 +0200
> [2018-04-04 14:43:29.752822 +0200][Dump ][TaskMgr ] Running
> task: "FileTimer task"
> [2018-04-04 14:43:29.752892 +0200][Dump ][TaskMgr ] Will
> rerun task "FileTimer task" at [2018-04-04 14:43:44 +0200]
> [2018-04-04 14:43:40.846051 +0200][Dump ][XRootDTransport ]
> [192.168.16.120:1094 #0.0] Stream inactive since 15 seconds, TTL: 300,
> allocated SIDs: 4, open files: 1
> [2018-04-04 14:43:40.846125 +0200][Dump ][XRootDTransport ]
> [192.168.16.120:1094 #0.0] Stream inactive since 15 seconds, stream
> timeout: 60, allocated SIDs: 4, wait barrier: 2018-04-04 14:42:59 +0200
> [2018-04-04 14:43:43.753676 +0200][Dump ][TaskMgr ] Running
> task: "TickGeneratorTask for: REDIRECTOR:1094"
> [2018-04-04 14:43:43.753760 +0200][Dump ][TaskMgr ] Will
> rerun task "TickGeneratorTask for: REDIRECTOR:1094" at [2018-04-04
> 14:43:58 +0200]
> [2018-04-04 14:43:43.753775 +0200][Dump ][TaskMgr ] Running
> task: "TickGeneratorTask for: 192.168.16.120:1094"
> [2018-04-04 14:43:43.753786 +0200][Dump ][TaskMgr ] Will
> rerun task "TickGeneratorTask for: 192.168.16.120:1094" at [2018-04-04
> 14:43:58 +0200]
> [2018-04-04 14:43:43.854343 +0200][Dump ][XRootDTransport ]
> [REDIRECTOR:1094 #0.0] Stream inactive since 45 seconds, TTL: 1200,
> allocated SIDs: 0, open files: 0
> [2018-04-04 14:43:43.854399 +0200][Dump ][XRootDTransport ]
> [REDIRECTOR:1094 #0.0] Stream inactive since 45 seconds, stream timeout:
> 60, allocated SIDs: 0, wait barrier: 2018-04-04 14:42:58 +0200
> [2018-04-04 14:43:44.753880 +0200][Dump ][TaskMgr ] Running
> task: "FileTimer task"
> [2018-04-04 14:43:44.753958 +0200][Dump ][TaskMgr ] Will
> rerun task "FileTimer task" at [2018-04-04 14:43:59 +0200]
>
>
>
> --
> -----------------------------------------------------------------------
> Heiko Schröter
> Institute of Environmental Physics (IUP) phone: ++49-(0)421-218-62092
> Institute of Remote Sensing (IFE) fax: ++49-(0)421-218-62070
> University of Bremen (FB1)
> P.O. Box 330440 email: [log in to unmask]
> Otto-Hahn-Allee 1
> 28359 Bremen
> Germany
> -----------------------------------------------------------------------
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the XROOTD-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the XROOTD-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>
########################################################################
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
|