Print

Print


Hi Heiko,

There is nothing in xrootd or the client that introduces stalls. Usually, 
when such things happen, it's introduced by some external factor. I've 
seen such stalls in a VM environment when the hypervisor decides you used 
up too much resource and just puts the VM to sleep for a while or 
migrates it to another more idle place. In a Kubernetes setup containers 
can get evicted for a while due to excessive resource usage. Given that 
your test is essentially a high-stress all out resource hungry test, I 
wouldn't be suprised that even the TCP stack may throttle you because it 
sees too high of a packet loss. Finding what causes these stalls is really 
hard. The first step is to assess the actual environment you are using.

Andy

On Wed, 4 Apr 2018, Michal Kamil Simon wrote:

> Hi Heiko,
>
> After a quick look at the debug output, from the client perspective it
> looks ok, the client sent 4 parallel read requests (default configuration)
> and is waiting for the response from the server. If would have left it running
> for bit more the client would hit a stream timeout and reset the connection.
>
> Cheers,
> Michal
> ________________________________________
> From: [log in to unmask] [[log in to unmask]] on behalf of Heiko Schröter [[log in to unmask]]
> Sent: 04 April 2018 15:01
> To: [log in to unmask]
> Subject: xrdcp 4.8.1 stalls
>
> Hello all,
>
> i'am new to xrootd and evaluate a test setup in a 10GBit network.
>
> During a looped copy job of a 1GB large test file xrootd 4.8.1 stalls
> every 4 to 7 copy jobs.
>
> The copying is done between RAM-Network-RAM to exclude disk i/o.
> Sometimes xrootd comes back, sometimes the copy job has to be killed.
> Restarting the xrootd job reverts everything to normal until the stall
> reappears.
>
> I would expect xrootd not to stall even under such circumstances. But i
> agree this is a somewhat artificial usecase.
>
>
> Best
>
> Heiko
>
>
> xrootd.cf, Ver 4.8.1:
>
> all.export /xrootd
>
> set xrdr=REDIRECTOR
> set inventory=/var/log/xrootd/inventory
> all.manager $(xrdr):3121
>
> if $(xrdr) && named cns
>       all.export $(inventory)
>       xrd.port 1095
> else if $(xrdr)
>       all.role manager
>       ofs.forward 3way $(xrdr):1095 mv rm rmdir trunc
>       xrd.port 1094
> else
>       all.role server
>       ofs.notify closew create mkdir mv rm rmdir trunc |
> /usr/bin/XrdCnsd -d -D 2 -i 90 -b $(xrdr):1095:$(inventory)
>       ofs.notifymsg create $TID create $FMODE $LFN?$CGI
>       ofs.notifymsg closew $TID closew $LFN $FSIZE
> fi
>
>
> The brute force test:
>
> for ((i=0;i<=100;i++));do rm -f /mnt/ramdisk/test.dat; xrdcp -d 3  -v
> root://REDIRECTOR//xrootd/test.dat /mnt/ramdisk/test.dat;  rm -f
> /mnt/ramdisk/test.dat; sleep 1; done
>
>
> xrdcp debug output:
>
> [2018-04-04 14:42:59.221301 +0200][Debug  ][File              ]
> [0x24e27a0@file://localhost/mnt/ramdisk/test.dat?oss.asize=1073741824]
> Sending a write command for handle 0xb to localhost
> [2018-04-04 14:42:59.228163 +0200][Dump   ][Utility           ] URL:
> file://localhost/mnt/ramdisk/test.dat?oss.asize=1073741824
> [2018-04-04 14:42:59.228163 +0200][Dump   ][Utility           ]
> Protocol:  file
> [2018-04-04 14:42:59.228163 +0200][Dump   ][Utility           ] User Name:
> [2018-04-04 14:42:59.228163 +0200][Dump   ][Utility           ] Password:
> [2018-04-04 14:42:59.228163 +0200][Dump   ][Utility           ] Host
> Name: localhost
> [2018-04-04 14:42:59.228163 +0200][Dump   ][Utility           ]
> Port:      1094
> [2018-04-04 14:42:59.228163 +0200][Dump   ][Utility           ]
> Path:      /mnt/ramdisk/test.dat
> [2018-04-04 14:42:59.228229 +0200][Debug  ][File              ]
> [0x24dd0d0@root://REDIRECTOR:1094//xrootd/test.dat] Sending a read
> command for handle 0x0 to 192.168.16.120:1094
> [2018-04-04 14:42:59.228233 +0200][Dump   ][File              ]
> [0x24e27a0@file://localhost/mnt/ramdisk/test.dat?oss.asize=1073741824]
> Got state response for message kXR_write (handle: 0x0b000000, offset:
> 503316480, size: 16777216)
> [2018-04-04 14:42:59.228254 +0200][Dump   ][XRootD            ]
> [192.168.16.120:1094] Sending message kXR_read (handle: 0x00000000,
> offset: 570425344, size: 16777216)
> [2018-04-04 14:42:59.228272 +0200][Dump   ][PostMaster        ]
> [192.168.16.120:1094 #0] Sending message kXR_read (handle: 0x00000000,
> offset: 570425344, size: 16777216) (0x24dd9e0) through substream 0
> expecting answer at 0
> [2018-04-04 14:42:59.228305 +0200][Dump   ][AsyncSock         ]
> [192.168.16.120:1094 #0.0] Wrote a message: kXR_read (handle:
> 0x00000000, offset: 570425344, size: 16777216) (0x24dd9e0), 32 bytes
> [2018-04-04 14:42:59.228329 +0200][Dump   ][AsyncSock         ]
> [192.168.16.120:1094 #0.0] Successfully sent message: kXR_read (handle:
> 0x00000000, offset: 570425344, size: 16777216) (0x24dd9e0).
> [2018-04-04 14:42:59.228340 +0200][Dump   ][XRootD            ]
> [192.168.16.120:1094] Message kXR_read (handle: 0x00000000, offset:
> 570425344, size: 16777216) has been successfully sent.
> [2018-04-04 14:42:59.228353 +0200][Dump   ][PostMaster        ]
> [192.168.16.120:1094 #0.0] All messages consumed, disable uplink
> [2018-04-04 14:42:59.750894 +0200][Dump   ][TaskMgr           ] Running
> task: "FileTimer task"
> [2018-04-04 14:42:59.750934 +0200][Dump   ][TaskMgr           ] Will
> rerun task "FileTimer task" at [2018-04-04 14:43:14 +0200]
> [2018-04-04 14:43:13.464015 +0200][Dump   ][XRootDTransport   ]
> [REDIRECTOR:1094 #0.0] Stream inactive since 15 seconds, TTL: 1200,
> allocated SIDs: 0, open files: 0
> [2018-04-04 14:43:13.464039 +0200][Dump   ][XRootDTransport   ]
> [REDIRECTOR:1094 #0.0] Stream inactive since 15 seconds, stream timeout:
> 60, allocated SIDs: 0, wait barrier: 2018-04-04 14:42:58 +0200
> [2018-04-04 14:43:13.751694 +0200][Dump   ][TaskMgr           ] Running
> task: "TickGeneratorTask for: REDIRECTOR:1094"
> [2018-04-04 14:43:13.751737 +0200][Dump   ][TaskMgr           ] Will
> rerun task "TickGeneratorTask for: REDIRECTOR:1094" at [2018-04-04
> 14:43:28 +0200]
> [2018-04-04 14:43:13.751753 +0200][Dump   ][TaskMgr           ] Running
> task: "TickGeneratorTask for: 192.168.16.120:1094"
> [2018-04-04 14:43:13.751764 +0200][Dump   ][TaskMgr           ] Will
> rerun task "TickGeneratorTask for: 192.168.16.120:1094" at [2018-04-04
> 14:43:28 +0200]
> [2018-04-04 14:43:14.751830 +0200][Dump   ][TaskMgr           ] Running
> task: "FileTimer task"
> [2018-04-04 14:43:14.751849 +0200][Dump   ][TaskMgr           ] Will
> rerun task "FileTimer task" at [2018-04-04 14:43:29 +0200]
> [2018-04-04 14:43:28.752586 +0200][Dump   ][TaskMgr           ] Running
> task: "TickGeneratorTask for: REDIRECTOR:1094"
> [2018-04-04 14:43:28.752656 +0200][Dump   ][TaskMgr           ] Will
> rerun task "TickGeneratorTask for: REDIRECTOR:1094" at [2018-04-04
> 14:43:43 +0200]
> [2018-04-04 14:43:28.752691 +0200][Dump   ][TaskMgr           ] Running
> task: "TickGeneratorTask for: 192.168.16.120:1094"
> [2018-04-04 14:43:28.752727 +0200][Dump   ][TaskMgr           ] Will
> rerun task "TickGeneratorTask for: 192.168.16.120:1094" at [2018-04-04
> 14:43:43 +0200]
> [2018-04-04 14:43:28.785950 +0200][Dump   ][XRootDTransport   ]
> [REDIRECTOR:1094 #0.0] Stream inactive since 30 seconds, TTL: 1200,
> allocated SIDs: 0, open files: 0
> [2018-04-04 14:43:28.786026 +0200][Dump   ][XRootDTransport   ]
> [REDIRECTOR:1094 #0.0] Stream inactive since 30 seconds, stream timeout:
> 60, allocated SIDs: 0, wait barrier: 2018-04-04 14:42:58 +0200
> [2018-04-04 14:43:29.752822 +0200][Dump   ][TaskMgr           ] Running
> task: "FileTimer task"
> [2018-04-04 14:43:29.752892 +0200][Dump   ][TaskMgr           ] Will
> rerun task "FileTimer task" at [2018-04-04 14:43:44 +0200]
> [2018-04-04 14:43:40.846051 +0200][Dump   ][XRootDTransport   ]
> [192.168.16.120:1094 #0.0] Stream inactive since 15 seconds, TTL: 300,
> allocated SIDs: 4, open files: 1
> [2018-04-04 14:43:40.846125 +0200][Dump   ][XRootDTransport   ]
> [192.168.16.120:1094 #0.0] Stream inactive since 15 seconds, stream
> timeout: 60, allocated SIDs: 4, wait barrier: 2018-04-04 14:42:59 +0200
> [2018-04-04 14:43:43.753676 +0200][Dump   ][TaskMgr           ] Running
> task: "TickGeneratorTask for: REDIRECTOR:1094"
> [2018-04-04 14:43:43.753760 +0200][Dump   ][TaskMgr           ] Will
> rerun task "TickGeneratorTask for: REDIRECTOR:1094" at [2018-04-04
> 14:43:58 +0200]
> [2018-04-04 14:43:43.753775 +0200][Dump   ][TaskMgr           ] Running
> task: "TickGeneratorTask for: 192.168.16.120:1094"
> [2018-04-04 14:43:43.753786 +0200][Dump   ][TaskMgr           ] Will
> rerun task "TickGeneratorTask for: 192.168.16.120:1094" at [2018-04-04
> 14:43:58 +0200]
> [2018-04-04 14:43:43.854343 +0200][Dump   ][XRootDTransport   ]
> [REDIRECTOR:1094 #0.0] Stream inactive since 45 seconds, TTL: 1200,
> allocated SIDs: 0, open files: 0
> [2018-04-04 14:43:43.854399 +0200][Dump   ][XRootDTransport   ]
> [REDIRECTOR:1094 #0.0] Stream inactive since 45 seconds, stream timeout:
> 60, allocated SIDs: 0, wait barrier: 2018-04-04 14:42:58 +0200
> [2018-04-04 14:43:44.753880 +0200][Dump   ][TaskMgr           ] Running
> task: "FileTimer task"
> [2018-04-04 14:43:44.753958 +0200][Dump   ][TaskMgr           ] Will
> rerun task "FileTimer task" at [2018-04-04 14:43:59 +0200]
>
>
>
> --
> -----------------------------------------------------------------------
> Heiko Schröter
> Institute of Environmental Physics (IUP)   phone: ++49-(0)421-218-62092
> Institute of Remote Sensing (IFE)          fax:   ++49-(0)421-218-62070
> University of Bremen (FB1)
> P.O. Box 330440               email:  [log in to unmask]
> Otto-Hahn-Allee 1
> 28359 Bremen
> Germany
> -----------------------------------------------------------------------
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the XROOTD-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the XROOTD-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1