Print

Print


Hi Andy,

hm, that doesn't agree with my observations so far. But i might still be 
wrong. So i did setup xrootd-4.8.3-rc1 for 12 data servers summing up to 
620TB.

At 8:33am the redirector was restartet and at 8:38am a ls command sent 
with no stalling. At 8:39am an xrdcp job started with stalling promptly 
for 5sec.
The stalling can be forced when using a new file name which is not known 
to xrootd (caching ??).

Yes you are right that the stalling occurs repeatedly when the 
redirector is not "resumed" after startup.

Pls let me know if i could do some more debugging because this is still 
a test setup.

Best
Heiko


The log of the Redirector:

8:33am: Restart of Redirector for 12 data servers with 620TB total space
<snip>
180425 08:32:02 23926 XrdXeq: Admin traffic thread started
180425 08:32:02 23924 XrdXeq: async callback thread started
180425 08:33:32 23923 Receive glogin1 0 bytes on 0
180425 08:33:32 23923 setStatus REDIRECTOR sent resume event
180425 08:33:32 23923 cms_setStatus: Manager REDIRECTOR resumed
<snap>

8:38am: xrdfs rdr ls /xrootd/schroete  (no stalling)
<snip>
180425 08:38:22 23916 XrdSched: Now have 3 workers
180425 08:38:22 23916 XrdSched: running main accept inq=0
180425 08:38:22 24065 XrdXeq: Worker thread started
180425 08:38:22 23915 XrdInet: Accepted connection from 7@CLIENT
180425 08:38:22 23915 XrdProtocol: matched protocol xrootd
180425 08:38:22 23915 ?:7@CLIENT XrdPoll: FD 7 attached to poller 0; num=1
180425 08:38:22 23915 ?:7@CLIENT XrootdProtocol: 0000 req=login dlen=108
180425 08:38:22 23915 schroete.64810:7@CLIENT XrootdResponse: 0000 
sending 16 data bytes
180425 08:38:22 23915 XrootdXeq: schroete.64810:7@CLIENT pvt IPv4 login
180425 08:38:22 23915 schroete.64810:7@CLIENT XrootdProtocol: 0100 
req=locate dlen=18
180425 08:38:22 23915 schroete.64810:7@CLIENT XrootdProtocol: 0100 
locate n */xrootd/schroete/
180425 08:38:22 23915 schroete.64810:7@CLIENT ofs_fsctl: 
fn=*/xrootd/schroete/
180425 08:38:22 23923 Receive glogin1 315 bytes on 3071
180425 08:38:22 23923 Decode glogin1 sent schroete.64810:7@CLIENT 
'Sw[::192.168.16.146]:1094 Sw[::192.168.16.147]:1094 
Sw[::192.168.16.127]:1094 Sw[::192.168.16.97]:1094 
Sw[::192.168.16.120]:1094 Sw[::192.168.16.139]:1094 
Sw[::192.168.16.217]:1094 Sw[::192.168.16.144]:1094 
Sw[::192.168.16.134]:1094 Sw[::192.168.16.196]:1094 
Sw[::192.168.16.195]:1094 Sw[::192.168.16.121]:1094' */xrootd/schroete/
180425 08:38:22 23915 schroete.64810:7@CLIENT XrootdProtocol: 0100 
rc=-1024 locate */xrootd/schroete/
180425 08:38:22 23915 schroete.64810:7@CLIENT XrootdResponse: 0100 
sending 311 data bytes
180425 08:38:22 23915 XrootdXeq: schroete.64810:7@CLIENT disc 0:00:00
180425 08:38:22 23915 schroete.64810:7@CLIENT XrdPoll: FD 7 detached 
from poller 0; num=0
<snap>

8:39am: xrdcp and stalling 5sec
<snip>
180425 08:39:07 24065 XrdSched: running main accept inq=0
180425 08:39:07 23916 XrdInet: Accepted connection from 20@CLIENT
180425 08:39:07 23916 XrdProtocol: matched protocol xrootd
180425 08:39:07 23916 ?:20@CLIENT XrdPoll: FD 20 attached to poller 0; num=1
180425 08:39:07 23916 ?:20@CLIENT XrootdProtocol: 0000 req=login dlen=108
180425 08:39:07 23916 schroete.64831:20@CLIENT XrootdResponse: 0000 
sending 16 data bytes
180425 08:39:07 23916 XrootdXeq: schroete.64831:20@CLIENT pvt IPv4 login
180425 08:39:07 23916 schroete.64831:20@CLIENT XrootdProtocol: 0100 
req=stat dlen=17
180425 08:39:07 23916 schroete.64831:20@CLIENT ofs_stat: 
fn=/xrootd/schroete/
180425 08:39:07 23923 Receive glogin1 19 bytes on 4095
180425 08:39:07 23923 Decode glogin1 redirects schroete.64831:20@CLIENT 
to 192.168.16.144:1094 /xrootd/schroete/
180425 08:39:07 23916 schroete.64831:20@CLIENT XrootdProtocol: 0100 
rc=-256 stat /xrootd/schroete/
180425 08:39:07 23916 schroete.64831:20@CLIENT XrootdProtocol: 0100 
redirecting to 192.168.16.144:1094
180425 08:39:07 23916 schroete.64831:20@CLIENT XrootdResponse: 0100 
sending 18 data bytes; status=4004
180425 08:39:07 23916 schroete.64831:20@CLIENT XrootdProtocol: 0100 
req=open dlen=43
180425 08:39:07 23916 schroete.64831:20@CLIENT XrootdProtocol: 0100 open 
unmat /xrootd/schroete//hd.tst?oss.asize=52428800
180425 08:39:07 23916 schroete.64831:20@CLIENT ofs_open: 102-40644 
fn=/xrootd/schroete/hd.tst
180425 08:39:07 23923 Receive glogin1 4 bytes on 5119
180425 08:39:07 23923 Decode glogin1 delays schroete.64831:20@CLIENT 5 
/xrootd/schroete/hd.tst
*******************
180425 08:39:07 23916 schroete.64831:20@CLIENT XrootdProtocol: 0100 
stalling client for 5 sec
*******************
180425 08:39:07 23916 schroete.64831:20@CLIENT XrootdResponse: 0100 
sending 4 data bytes; status=4005
180425 08:39:07 23916 schroete.64831:20@CLIENT ofs_close: use=0 fn=dummy
180425 08:39:10 23916 schroete.64831:20@CLIENT XrootdProtocol: 0100 
request timeout; read 0 of 24 bytes
180425 08:39:10 23916 XrdPoll: Poller 0 enabled schroete.64831:20@CLIENT
180425 08:39:12 23915 XrdSched: running schroete.64831:20@CLIENT inq=0
180425 08:39:12 23915 schroete.64831:20@CLIENT XrootdProtocol: 0100 
req=open dlen=43
180425 08:39:12 23915 schroete.64831:20@CLIENT XrootdProtocol: 0100 open 
unmat /xrootd/schroete//hd.tst?oss.asize=52428800
180425 08:39:12 23915 schroete.64831:20@CLIENT ofs_open: 102-40644 
fn=/xrootd/schroete/hd.tst
180425 08:39:12 23923 Receive glogin1 19 bytes on 6143
180425 08:39:12 23923 Decode glogin1 redirects schroete.64831:20@CLIENT 
to 192.168.16.121:1094 /xrootd/schroete/hd.tst
180425 08:39:12 23915 schroete.64831:20@CLIENT XrootdProtocol: 0100 
redirecting to 192.168.16.121:1094
180425 08:39:12 23915 schroete.64831:20@CLIENT XrootdResponse: 0100 
sending 18 data bytes; status=4004
180425 08:39:12 23915 schroete.64831:20@CLIENT ofs_close: use=0 fn=dummy
180425 08:39:12 23915 XrootdXeq: schroete.64831:20@CLIENT disc 0:00:05
180425 08:39:12 23915 schroete.64831:20@CLIENT XrdPoll: FD 20 detached 
from poller 0; num=0
<snap>





Am 25.04.2018 um 04:36 schrieb Andrew Hanushevsky:
> Hi Heiko,
>
> It would seem that the stalls are occurring because the redirector a) 
> the redirector has not beed up long enough (default requires 30 
> seconds to pass) or b) does not think it has any working data servers 
> (which will be the case of they login much later than you started the 
> copy).
>
> Andy
>
>
>
> -----Original Message----- From: Heiko Schröter
> Sent: Thursday, April 19, 2018 5:37 AM
> To: Michal Kamil Simon ; [log in to unmask]
> Cc: [log in to unmask] ; [log in to unmask]
> Subject: Re: Stalling client when copying files (xrdcp 4.8.2)
>
> I've been talking too fast. The stalling occurs with 4.8.3-rc1 as well.
> But only for the first connection of a file transfer.
> If you rm and recopy the file, the stalling does not occur.
>
>
> Am 19.04.2018 um 10:06 schrieb Michal Kamil Simon:
>> Hi Heiko,
>>
>> That's interesting, could you give me more details on your scenario,
>> are you using xrdcp from a script or XrdCl C++ API (or Python bindings)?
>>
>> Could you also provide client side logs from a run when you observed
>> stalling?
>>
>> Cheers,
>> Michal
>> ________________________________________
>> From: [log in to unmask] [[log in to unmask]] on 
>> behalf of Heiko Schröter [[log in to unmask]]
>> Sent: 18 April 2018 20:26
>> To: [log in to unmask]
>> Cc: [log in to unmask]; [log in to unmask]
>> Subject: Re: Stalling client when copying files (xrdcp 4.8.2)
>>
>> This stalling does not occur with the 4.8.3-rc1.
>>
>>
>> Am 17.04.2018 um 19:24 schrieb Heiko Schröter:
>>> Hello,
>>>
>>> we do observe that when copying a file the client is stalled for 
>>> some time.
>>>
>>> 180417 19:17:28 17252 schroete.97360:7@qc08 XrootdProtocol: 0100
>>> stalling client for 5 sec
>>>
>>> Sometimes it is for 10sec and this gets repeated without a recognizable
>>> pattern.
>>>
>>> The client is not stalled when the copied file is removed at once and
>>> recopied.
>>>
>>> It looks like a similar issue as this one:
>>> https://listserv.slac.stanford.edu/cgi-bin/wa?A2=ind1203&L=XROOTD-L&P=R598&1=XROOTD-L&9=A&I=-3&J=on&d=No+Match%3BMatch%3BMatches&z=4 
>>>
>>>
>>>
>>> Is this a settable parameter or something we did wron in our setup ?
>>>
>>> We have one redirector and 12 data server on a 10GBit network. Client
>>> access is very limited because this is a test setup. 

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1