Print

Print


https://github.com/xrootd/xrootd/issues/1739

(I'll take an hour to get results with changed XRD_CPUSEPGWRTRD)

Petr

On 7/18/22 10:18, Michal Simon wrote:
> Hi Petr,
>
> Thanks for informing us about this problem. Could you please create an 
> issue in our GitHub so we can properly track this problem.
>
> Could you also give us the server configuration? Is both the server 
> and client at version 5.4.3?
>
> Could you try disabling pgreads/pgwrites in the client and see if this 
> helps:
>
> export XRD_CPUSEPGWRTRD=0
>
> Cheers,
> Michal
>
>> On 16 Jul 2022, at 20:16, Petr Vokac <[log in to unmask]> wrote:
>>
>> Hi,
>>
>> we are facing problems with DPM running on top of XRootD 5.4.3 from 
>> EPEL, because transfers with xroot protocol gets stuck with gfal2. 
>> Transfers with simple xrdcopy recovers but only after client detect 
>> 60s with no activity and create new connection.
>>
>> * why gfal2 can't recover like xrdcopy?
>> * what is the difference between 5.4.2 (DPM with this version works 
>> fine) and 5.4.3 that causes these troubles?
>>
>> On the xrdcopy client side we see something like (details 
>> https://vokac.web.cern.ch/vokac/tmp/xrootd-client-5.4.3.log):
>> ...
>> [2022-07-16 15:23:11.774189 +0200][Debug  ][AsyncSock         ] [dpmhead.farm.particle.cz  <http://dpmhead.farm.particle.cz>:1094.0] Async connection call returned
>> ...
>> [2022-07-16 15:23:17.734260 +0200][Debug  ][ExDbgMsg          ] [msg: 0x7803e538] Assigned MsgHandler: 0x67e650.
>> [2022-07-16 15:23:17.741195 +0200][Dump   ][AsyncSock         ] [dpmdisk2.farm.particle.cz  <http://dpmdisk2.farm.particle.cz>:1095.0] Received message 0x7803e538 of 65632 bytes
>> [2022-07-16 15:23:17.741242 +0200][Dump   ][PostMaster        ] [dpmdisk2.farm.particle.cz:1095  <http://dpmdisk2.farm.particle.cz:1095>] Handling received message: 0x7803e538.
>> [2022-07-16 15:23:17.741248 +0200][Dump   ][PostMaster        ] [dpmdisk2.farm.particle.cz:1095  <http://dpmdisk2.farm.particle.cz:1095>] Ignoring the processing handler for: 0x619538.
>> [2022-07-16 15:23:17.741262 +0200][Dump   ][XRootDTransport   ] [msg: 0x7803e5c8] Expecting 24 bytes of message body
>> [2022-07-16 15:23:17.741266 +0200][Dump   ][AsyncSock         ] [dpmdisk2.farm.particle.cz  <http://dpmdisk2.farm.particle.cz>:1095.0] Received message header for 0x7803e5c8 size: 8
>> [2022-07-16 15:23:17.741272 +0200][Dump   ][XRootD            ] [dpmdisk2.farm.particle.cz:1095  <http://dpmdisk2.farm.particle.cz:1095>] Got a kXR_status response to request kXR_pgread (handle: 0x00000000, offset: 419430400, size: 8388608)
>> [2022-07-16 15:23:17.741276 +0200][Debug  ][ExDbgMsg          ] [msg: 0x7803e5c8] Assigned MsgHandler: 0x67e650.
>> [2022-07-16 15:23:26.773143 +0200][Dump   ][TaskMgr           ] Running task: "TickGeneratorTask for:root://dpmhead.farm.particle.cz:1094"
>> [2022-07-16 15:23:26.773215 +0200][Dump   ][TaskMgr           ] Will rerun task "TickGeneratorTask for:root://dpmhead.farm.particle.cz:1094" at [2022-07-16 15:23:41 +0200]
>> [2022-07-16 15:23:26.773221 +0200][Dump   ][TaskMgr           ] Running task: "FileTimer task"
>> [2022-07-16 15:23:26.773229 +0200][Dump   ][TaskMgr           ] Will rerun task "FileTimer task" at [2022-07-16 15:23:41 +0200]
>> [2022-07-16 15:23:28.446164 +0200][Dump   ][XRootDTransport   ] [dpmhead.farm.particle.cz  <http://dpmhead.farm.particle.cz>:1094.0] Stream inactive since 15 seconds, TTL: 1200, allocated SIDs: 0, open files: 0, bound file objects: 0
>> [2022-07-16 15:23:28.446217 +0200][Dump   ][XRootDTransport   ] [dpmhead.farm.particle.cz  <http://dpmhead.farm.particle.cz>:1094.0] Stream inactive since 15 seconds, stream timeout: 60, allocated SIDs: 0, wait barrier: 2022-07-16 15:23:13 +0200
>> [2022-07-16 15:23:28.775142 +0200][Dump   ][TaskMgr           ] Running task: "TickGeneratorTask for:root://dpmdisk2.farm.particle.cz:1095"
>> [2022-07-16 15:23:28.775208 +0200][Dump   ][TaskMgr           ] Will rerun task "TickGeneratorTask for:root://dpmdisk2.farm.particle.cz:1095" at [2022-07-16 15:23:43 +0200]
>> ...
>> [2022-07-16 15:24:18.625518 +0200][Debug  ][ExDbgMsg          ] [dpmdisk2.farm.particle.cz:1095  <http://dpmdisk2.farm.particle.cz:1095>] Calling MsgHandler: 0x680e50 (message: kXR_pgread (handle: 0x00000000, offset: 436207616, size: 8388608) ) with status: [ERROR] Socket timeout.
>> [2022-07-16 15:24:18.625528 +0200][Dump   ][File              ] [0x676610@root://dpmhead.farm.particle.cz:1094//dpm/farm.particle.cz/home/dteam/5G.disk2?xrdcl.requuid=79be2b53-1bfc-414b-baa7-4ade1508f5fe] File state error encountered. Message kXR_pgread (handle: 0x00000000, offset: 436207616, size: 8388608) returned with [ERROR] Socket timeout
>> [2022-07-16 15:24:18.625535 +0200][Dump   ][File              ] [0x676610@root://dpmhead.farm.particle.cz:1094//dpm/farm.particle.cz/home/dteam/5G.disk2?xrdcl.requuid=79be2b53-1bfc-414b-baa7-4ade1508f5fe] Putting message kXR_pgread (handle: 0x00000000, offset: 436207616, size: 8388608) in the recovery list
>> [2022-07-16 15:24:18.625545 +0200][Debug  ][ExDbgMsg          ] [dpmdisk2.farm.particle.cz:1095  <http://dpmdisk2.farm.particle.cz:1095>] Destroying MsgHandler: 0x680e50.
>> [2022-07-16 15:24:18.625555 +0200][Debug  ][ExDbgMsg          ] [dpmdisk2.farm.particle.cz:1095  <http://dpmdisk2.farm.particle.cz:1095>] Calling MsgHandler: 0x67dc90 (message: kXR_pgread (handle: 0x00000000, offset: 411041792, size: 8388608) ) with status: [ERROR] Socket timeout.
>> [2022-07-16 15:24:18.625560 +0200][Dump   ][File              ] [0x676610@root://dpmhead.farm.particle.cz:1094//dpm/farm.particle.cz/home/dteam/5G.disk2?xrdcl.requuid=79be2b53-1bfc-414b-baa7-4ade1508f5fe] File state error encountered. Message kXR_pgread (handle: 0x00000000, offset: 411041792, size: 8388608) returned with [ERROR] Socket timeout
>> [2022-07-16 15:24:18.625564 +0200][Dump   ][File              ] [0x676610@root://dpmhead.farm.particle.cz:1094//dpm/farm.particle.cz/home/dteam/5G.disk2?xrdcl.requuid=79be2b53-1bfc-414b-baa7-4ade1508f5fe] Putting message kXR_pgread (handle: 0x00000000, offset: 411041792, size: 8388608) in the recovery list
>> [2022-07-16 15:24:18.625569 +0200][Debug  ][ExDbgMsg          ] [dpmdisk2.farm.particle.cz:1095  <http://dpmdisk2.farm.particle.cz:1095>] Destroying MsgHandler: 0x67dc90.
>> [2022-07-16 15:24:18.625593 +0200][Debug  ][ExDbgMsg          ] [dpmdisk2.farm.particle.cz:1095  <http://dpmdisk2.farm.particle.cz:1095>] Calling MsgHandler: 0x67e650 (message: kXR_pgread (handle: 0x00000000, offset: 419430400, size: 8388608) ) with status: [ERROR] Socket timeout.
>> [2022-07-16 15:24:18.625598 +0200][Dump   ][File              ] [0x676610@root://dpmhead.farm.particle.cz:1094//dpm/farm.particle.cz/home/dteam/5G.disk2?xrdcl.requuid=79be2b53-1bfc-414b-baa7-4ade1508f5fe] File state error encountered. Message kXR_pgread (handle: 0x00000000, offset: 419430400, size: 8388608) returned with [ERROR] Socket timeout
>> [2022-07-16 15:24:18.625602 +0200][Dump   ][File              ] [0x676610@root://dpmhead.farm.particle.cz:1094//dpm/farm.particle.cz/home/dteam/5G.disk2?xrdcl.requuid=79be2b53-1bfc-414b-baa7-4ade1508f5fe] Putting message kXR_pgread (handle: 0x00000000, offset: 419430400, size: 8388608) in the recovery list
>> [2022-07-16 15:24:18.625607 +0200][Debug  ][ExDbgMsg          ] [dpmdisk2.farm.particle.cz:1095  <http://dpmdisk2.farm.particle.cz:1095>] Destroying MsgHandler: 0x67e650.
>> [2022-07-16 15:24:18.627151 +0200][Debug  ][ExDbgMsg          ] [dpmdisk2.farm.particle.cz:1095  <http://dpmdisk2.farm.particle.cz:1095>] Calling MsgHandler: 0x67e170 (message: kXR_pgread (handle: 0x00000000, offset: 427819008, size: 8388608) ) with status: [ERROR] Socket timeout.
>> [2022-07-16 15:24:18.627186 +0200][Dump   ][File              ] [0x676610@root://dpmhead.farm.particle.cz:1094//dpm/farm.particle.cz/home/dteam/5G.disk2?xrdcl.requuid=79be2b53-1bfc-414b-baa7-4ade1508f5fe] File state error encountered. Message kXR_pgread (handle: 0x00000000, offset: 427819008, size: 8388608) returned with [ERROR] Socket timeout
>> [2022-07-16 15:24:18.627193 +0200][Dump   ][File              ] [0x676610@root://dpmhead.farm.particle.cz:1094//dpm/farm.particle.cz/home/dteam/5G.disk2?xrdcl.requuid=79be2b53-1bfc-414b-baa7-4ade1508f5fe] Putting message kXR_pgread (handle: 0x00000000, offset: 427819008, size: 8388608) in the recovery list
>> [2022-07-16 15:24:18.627197 +0200][Debug  ][File              ] [0x676610@root://dpmhead.farm.particle.cz:1094//dpm/farm.particle.cz/home/dteam/5G.disk2?xrdcl.requuid=79be2b53-1bfc-414b-baa7-4ade1508f5fe] Running the recovery procedure
>> [2022-07-16 15:24:18.627204 +0200][Dump   ][File              ] [0x676610@root://dpmhead.farm.particle.cz:1094//dpm/farm.particle.cz/home/dteam/5G.disk2?xrdcl.requuid=79be2b53-1bfc-414b-baa7-4ade1508f5fe] Sending a recovery open command toroot://dpmhead.farm.particle.cz:1094//dpm/farm.particle.cz/home/dteam/5G.disk2?xrdcl.requuid=79be2b53-1bfc-414b-baa7-4ade1508f5fe
>> [2022-07-16 15:24:18.627224 +0200][Dump   ][XRootD            ] [dpmhead.farm.particle.cz:1094  <http://dpmhead.farm.particle.cz:1094>] Sending message kXR_open (file: /dpm/farm.particle.cz/home/dteam/5G.disk2  <http://farm.particle.cz/home/dteam/5G.disk2>, mode: 00, flags: kXR_open_read )
>> [2022-07-16 15:24:18.627241 +0200][Debug  ][ExDbgMsg          ] [dpmhead.farm.particle.cz:1094  <http://dpmhead.farm.particle.cz:1094>] MsgHandler created: 0x7c003850 (message: kXR_open (file: /dpm/farm.particle.cz/home/dteam/5G.disk2  <http://farm.particle.cz/home/dteam/5G.disk2>, mode: 00, flags: kXR_open_read ) ).
>> [2022-07-16 15:24:18.627251 +0200][Dump   ][PostMaster        ] [dpmhead.farm.particle.cz:1094  <http://dpmhead.farm.particle.cz:1094>] Sending message kXR_open (file: /dpm/farm.particle.cz/home/dteam/5G.disk2  <http://farm.particle.cz/home/dteam/5G.disk2>, mode: 00, flags: kXR_open_read ) (0x7c002070) through substream 0 expecting answer at 0
>> [2022-07-16 15:24:18.627273 +0200][Debug  ][ExDbgMsg          ] [dpmdisk2.farm.particle.cz:1095  <http://dpmdisk2.farm.particle.cz:1095>] Destroying MsgHandler: 0x67e170.
>> [2022-07-16 15:24:18.627827 +0200][Dump   ][AsyncSock         ] [dpmhead.farm.particle.cz  <http://dpmhead.farm.particle.cz>:1094.0] Wrote a message: kXR_open (file: /dpm/farm.particle.cz/home/dteam/5G.disk2  <http://farm.particle.cz/home/dteam/5G.disk2>, mode: 00, flags: kXR_open_read ) (0x7c002070), 65 bytes
>> [2022-07-16 15:24:18.627875 +0200][Dump   ][AsyncSock         ] [dpmhead.farm.particle.cz  <http://dpmhead.farm.particle.cz>:1094.0] Successfully sent message: kXR_open (file: /dpm/farm.particle.cz/home/dteam/5G.disk2  <http://farm.particle.cz/home/dteam/5G.disk2>, mode: 00, flags: kXR_open_read ) (0x7c002070).
>> [2022-07-16 15:24:18.627881 +0200][Dump   ][XRootD            ] [dpmhead.farm.particle.cz:1094  <http://dpmhead.farm.particle.cz:1094>] Message kXR_open (file: /dpm/farm.particle.cz/home/dteam/5G.disk2  <http://farm.particle.cz/home/dteam/5G.disk2>, mode: 00, flags: kXR_open_read ) has been successfully sent.
>> [2022-07-16 15:24:18.627886 +0200][Debug  ][ExDbgMsg          ] [dpmhead.farm.particle.cz:1094  <http://dpmhead.farm.particle.cz:1094>] Moving MsgHandler: 0x7c003850 (message: kXR_open (file: /dpm/farm.particle.cz/home/dteam/5G.disk2  <http://farm.particle.cz/home/dteam/5G.disk2>, mode: 00, flags: kXR_open_read ) ) from out-queu to in-queue.
>> ...
>> it seems that client vs. server time was 1s out of sync, server side 
>> looks like (details 
>> https://vokac.web.cern.ch/vokac/tmp/xrootd-server-5.4.3.log)
>> 220716 15:23:13 28043 Xrd_Inet: Accepted connection on port 1095 [log in to unmask]
>> ...
>> 220716 15:23:18 28023 oss_AioWait: read completed forvokac.1773:[log in to unmask]; rc=0 result=65536 aiocb=0x7f039df2a700
>> 220716 15:23:18 28023 oss_AioWait: read completed forvokac.1773:[log in to unmask]; rc=0 result=65536 aiocb=0x7f039df8ee00
>> 220716 15:23:18 28023 oss_AioWait: read completed forvokac.1773:[log in to unmask]; rc=0 result=65536 aiocb=0x7f039df8e000
>> 220716 15:23:18 28023 oss_AioWait: read completed forvokac.1773:[log in to unmask]; rc=0 result=65536 aiocb=0x7f039307ee00
>> 220716 15:23:18 28019vokac.1773:[log in to unmask]  ofs_aioread: 65536@444530688 fn=/dpm/farm.particle.cz/home/dteam/5G.disk2  <http://farm.particle.cz/home/dteam/5G.disk2>
>> 220716 15:23:18 28019vokac.1773:[log in to unmask]  oss_AioRead: fd=8000 read 65536@444530688 started; aiocb=0x7f039df90a00
>> 220716 15:23:18 28019vokac.1773:[log in to unmask]  Xrootd_PgrwAio: 0100 pgrd beg 65536@444530688 inF=8
>> 220716 15:23:18 28019vokac.1773:[log in to unmask]  Xrootd_PgrwAio: 0100 pgrd end 65536@444071936 result=65536 D-S=False-1 inF=7
>> 220716 15:23:18 28019vokac.1773:[log in to unmask]  Xrootd_Response: 0100 sending partial 8 info and 65600 data bytes
>> 220716 15:23:18 28019 Xrootd_PgrwBuff:  Recycle 65536@444071936 numF=8
>> 220716 15:23:18 28019vokac.1773:[log in to unmask]  Xrootd_PgrwAio: 0100 pgrd end 65536@444137472 result=65536 D-S=False-1 inF=6
>> 220716 15:23:18 28019vokac.1773:[log in to unmask]  Xrootd_Response: 0100 sending partial 8 info and 65600 data bytes
>> 220716 15:23:18 28019 Xrootd_PgrwBuff:  Recycle 65536@444137472 numF=9
>> 220716 15:23:18 28019vokac.1773:[log in to unmask]  Xrootd_PgrwAio: 0100 pgrd end 65536@444203008 result=65536 D-S=False-1 inF=5
>> 220716 15:23:18 28019vokac.1773:[log in to unmask]  Xrootd_Response: 0100 sending partial 8 info and 65600 data bytes
>> 220716 15:23:18 28019 Xrootd_PgrwBuff:  Recycle 65536@444203008 numF=10
>> 220716 15:23:18 28019vokac.1773:[log in to unmask]  Xrootd_PgrwAio: 0100 pgrd end 65536@444268544 result=65536 D-S=False-1 inF=4
>> 220716 15:23:18 28019vokac.1773:[log in to unmask]  Xrootd_Response: 0100 sending partial 8 info and 65600 data bytes
>> 220716 15:23:18 28019 Xrootd_PgrwBuff:  Recycle 65536@444268544 numF=11
>> 220716 15:23:18 28019vokac.1773:[log in to unmask]  Xrootd_PgrwAio: 0100 pgrd end 65536@444334080 result=65536 D-S=False-1 inF=3
>> 220716 15:23:18 28019vokac.1773:[log in to unmask]  Xrootd_Response: 0100 sending partial 8 info and 65600 data bytes
>> 220716 15:23:18 28019 Xrootd_PgrwBuff:  Recycle 65536@444334080 numF=12
>> 220716 15:23:18 28019vokac.1773:[log in to unmask]  Xrootd_PgrwAio: 0100 pgrd end 65536@444399616 result=65536 D-S=False-1 inF=2
>> 220716 15:23:18 28019vokac.1773:[log in to unmask]  Xrootd_Response: 0100 sending partial 8 info and 65600 data bytes
>> 220716 15:23:18 28019 Xrootd_PgrwBuff:  Recycle 65536@444399616 numF=13
>> 220716 15:23:18 28019vokac.1773:[log in to unmask]  Xrootd_PgrwAio: 0100 pgrd end 65536@444465152 result=65536 D-S=False-1 inF=1
>> 220716 15:23:18 28019vokac.1773:[log in to unmask]  Xrootd_Response: 0100 sending partial 8 info and 65600 data bytes
>> 220716 15:23:18 28019 Xrootd_PgrwBuff:  Recycle 65536@444465152 numF=14
>> 220716 15:23:18 28023 oss_AioWait: read completed forvokac.1773:[log in to unmask]; rc=0 result=65536 aiocb=0x7f039df90a00
>> 220716 15:23:18 28019vokac.1773:[log in to unmask]  Xrootd_PgrwAio: 0100 pgrd end 65536@444530688 result=65536 D-S=False-1 inF=0
>> 220716 15:23:18 28019vokac.1773:[log in to unmask]  Xrootd_Response: 0100 sending final 8 info and 65600 data bytes
>> 220716 15:23:18 28019 Xrootd_PgrwBuff:  Recycle 65536@444530688 numF=15
>> 220716 15:23:18 28019vokac.1773:[log in to unmask]  Xrd_Link: Setting FD 30 ref to 2+-1 post=0
>> 220716 15:23:18 28019vokac.1773:[log in to unmask]  Xrootd_File: File::Ref 0 after +-1
>> 220716 15:23:18 28019vokac.1773:[log in to unmask]  Xrootd_PgrwAio: 0100 pgrw recycle R D-S=True-1
>> 220716 15:23:20 28043vokac.1773:[log in to unmask]  Xrootd_Protocol: 0100 request timeout; read 0 of 24 bytes
>> 220716 15:23:20 28043 Xrd_Poll: Poller 0 enabledvokac.1773:[log in to unmask]
>> 220716 15:23:37 28018 Xrd_Inet: Accepted connection on port 1095 [log in to unmask]
>> 220716 15:23:37 28018 Xrd_ProtLoad: matched port 1095 protocol XrdHttp
>> 220716 15:23:37 28018 anon.0:28@dpmhead Xrd_Poll: FD 28 attached to poller 1; num=1
>> 220716 15:23:37 28018 Xrd_Poll: Poller 1 enabled anon.0:28@dpmhead
>> 220716 15:23:37 28042 Xrd_Sched: running main accept inq=0
>> 220716 15:23:37 28021 anon.0:28@dpmhead Xrd_Poll: Link 28 terminating: hangup
>> 220716 15:23:37 28045 Xrd_Sched: running anon.0:28@dpmhead inq=0
>> 220716 15:23:37 28045 anon.0:28@dpmhead Xrd_Poll: Poller 1 removing FD 28
>> 220716 15:23:37 28045 anon.0:28@dpmhead Xrd_Poll: FD 28 detached from poller 1; num=0
>> 220716 15:23:41 28027 cms_Finder: Waiting for cms path /var/spool/xrootd/dpmdisk/.olb/olbd.admin
>> 220716 15:24:18 28020vokac.1773:[log in to unmask]  Xrd_Poll: Link 30 terminating: hangup
>> 220716 15:24:18 28019 Xrd_Sched: runningvokac.1773:[log in to unmask]  inq=0
>> 220716 15:24:18 28019 XrootdXeq:vokac.1773:[log in to unmask]  disc 0:01:05
>> 220716 15:24:18 28019vokac.1773:[log in to unmask]  Xrootd_File: closing r /dpm/farm.particle.cz/home/dteam/5G.disk2  <http://farm.particle.cz/home/dteam/5G.disk2>
>> 220716 15:24:18 28019vokac.1773:[log in to unmask]  ofs_close: use=1 fn=/dpm/farm.particle.cz/home/dteam/5G.disk2  <http://farm.particle.cz/home/dteam/5G.disk2>
>> 220716 15:24:18 28019vokac.1773:[log in to unmask]  dpmoss_Close: return 0
>> 220716 15:24:18 28019vokac.1773:[log in to unmask]  Xrd_Poll: Poller 0 removing FD 30
>> 220716 15:24:18 28019vokac.1773:[log in to unmask]  Xrd_Poll: FD 30 detached from poller 0; num=0
>> 220716 15:24:18 28042 Xrd_Inet: Accepted connection on port 1095 [log in to unmask]
>>
>> Petr
>
>
> ------------------------------------------------------------------------
>
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the XROOTD-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 
> <https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1>
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1