https://github.com/xrootd/xrootd/issues/1739 (I'll take an hour to get results with changed XRD_CPUSEPGWRTRD) Petr On 7/18/22 10:18, Michal Simon wrote: > Hi Petr, > > Thanks for informing us about this problem. Could you please create an > issue in our GitHub so we can properly track this problem. > > Could you also give us the server configuration? Is both the server > and client at version 5.4.3? > > Could you try disabling pgreads/pgwrites in the client and see if this > helps: > > export XRD_CPUSEPGWRTRD=0 > > Cheers, > Michal > >> On 16 Jul 2022, at 20:16, Petr Vokac <[log in to unmask]> wrote: >> >> Hi, >> >> we are facing problems with DPM running on top of XRootD 5.4.3 from >> EPEL, because transfers with xroot protocol gets stuck with gfal2. >> Transfers with simple xrdcopy recovers but only after client detect >> 60s with no activity and create new connection. >> >> * why gfal2 can't recover like xrdcopy? >> * what is the difference between 5.4.2 (DPM with this version works >> fine) and 5.4.3 that causes these troubles? >> >> On the xrdcopy client side we see something like (details >> https://vokac.web.cern.ch/vokac/tmp/xrootd-client-5.4.3.log): >> ... >> [2022-07-16 15:23:11.774189 +0200][Debug ][AsyncSock ] [dpmhead.farm.particle.cz <http://dpmhead.farm.particle.cz>:1094.0] Async connection call returned >> ... >> [2022-07-16 15:23:17.734260 +0200][Debug ][ExDbgMsg ] [msg: 0x7803e538] Assigned MsgHandler: 0x67e650. >> [2022-07-16 15:23:17.741195 +0200][Dump ][AsyncSock ] [dpmdisk2.farm.particle.cz <http://dpmdisk2.farm.particle.cz>:1095.0] Received message 0x7803e538 of 65632 bytes >> [2022-07-16 15:23:17.741242 +0200][Dump ][PostMaster ] [dpmdisk2.farm.particle.cz:1095 <http://dpmdisk2.farm.particle.cz:1095>] Handling received message: 0x7803e538. >> [2022-07-16 15:23:17.741248 +0200][Dump ][PostMaster ] [dpmdisk2.farm.particle.cz:1095 <http://dpmdisk2.farm.particle.cz:1095>] Ignoring the processing handler for: 0x619538. >> [2022-07-16 15:23:17.741262 +0200][Dump ][XRootDTransport ] [msg: 0x7803e5c8] Expecting 24 bytes of message body >> [2022-07-16 15:23:17.741266 +0200][Dump ][AsyncSock ] [dpmdisk2.farm.particle.cz <http://dpmdisk2.farm.particle.cz>:1095.0] Received message header for 0x7803e5c8 size: 8 >> [2022-07-16 15:23:17.741272 +0200][Dump ][XRootD ] [dpmdisk2.farm.particle.cz:1095 <http://dpmdisk2.farm.particle.cz:1095>] Got a kXR_status response to request kXR_pgread (handle: 0x00000000, offset: 419430400, size: 8388608) >> [2022-07-16 15:23:17.741276 +0200][Debug ][ExDbgMsg ] [msg: 0x7803e5c8] Assigned MsgHandler: 0x67e650. >> [2022-07-16 15:23:26.773143 +0200][Dump ][TaskMgr ] Running task: "TickGeneratorTask for:root://dpmhead.farm.particle.cz:1094" >> [2022-07-16 15:23:26.773215 +0200][Dump ][TaskMgr ] Will rerun task "TickGeneratorTask for:root://dpmhead.farm.particle.cz:1094" at [2022-07-16 15:23:41 +0200] >> [2022-07-16 15:23:26.773221 +0200][Dump ][TaskMgr ] Running task: "FileTimer task" >> [2022-07-16 15:23:26.773229 +0200][Dump ][TaskMgr ] Will rerun task "FileTimer task" at [2022-07-16 15:23:41 +0200] >> [2022-07-16 15:23:28.446164 +0200][Dump ][XRootDTransport ] [dpmhead.farm.particle.cz <http://dpmhead.farm.particle.cz>:1094.0] Stream inactive since 15 seconds, TTL: 1200, allocated SIDs: 0, open files: 0, bound file objects: 0 >> [2022-07-16 15:23:28.446217 +0200][Dump ][XRootDTransport ] [dpmhead.farm.particle.cz <http://dpmhead.farm.particle.cz>:1094.0] Stream inactive since 15 seconds, stream timeout: 60, allocated SIDs: 0, wait barrier: 2022-07-16 15:23:13 +0200 >> [2022-07-16 15:23:28.775142 +0200][Dump ][TaskMgr ] Running task: "TickGeneratorTask for:root://dpmdisk2.farm.particle.cz:1095" >> [2022-07-16 15:23:28.775208 +0200][Dump ][TaskMgr ] Will rerun task "TickGeneratorTask for:root://dpmdisk2.farm.particle.cz:1095" at [2022-07-16 15:23:43 +0200] >> ... >> [2022-07-16 15:24:18.625518 +0200][Debug ][ExDbgMsg ] [dpmdisk2.farm.particle.cz:1095 <http://dpmdisk2.farm.particle.cz:1095>] Calling MsgHandler: 0x680e50 (message: kXR_pgread (handle: 0x00000000, offset: 436207616, size: 8388608) ) with status: [ERROR] Socket timeout. >> [2022-07-16 15:24:18.625528 +0200][Dump ][File ] [0x676610@root://dpmhead.farm.particle.cz:1094//dpm/farm.particle.cz/home/dteam/5G.disk2?xrdcl.requuid=79be2b53-1bfc-414b-baa7-4ade1508f5fe] File state error encountered. Message kXR_pgread (handle: 0x00000000, offset: 436207616, size: 8388608) returned with [ERROR] Socket timeout >> [2022-07-16 15:24:18.625535 +0200][Dump ][File ] [0x676610@root://dpmhead.farm.particle.cz:1094//dpm/farm.particle.cz/home/dteam/5G.disk2?xrdcl.requuid=79be2b53-1bfc-414b-baa7-4ade1508f5fe] Putting message kXR_pgread (handle: 0x00000000, offset: 436207616, size: 8388608) in the recovery list >> [2022-07-16 15:24:18.625545 +0200][Debug ][ExDbgMsg ] [dpmdisk2.farm.particle.cz:1095 <http://dpmdisk2.farm.particle.cz:1095>] Destroying MsgHandler: 0x680e50. >> [2022-07-16 15:24:18.625555 +0200][Debug ][ExDbgMsg ] [dpmdisk2.farm.particle.cz:1095 <http://dpmdisk2.farm.particle.cz:1095>] Calling MsgHandler: 0x67dc90 (message: kXR_pgread (handle: 0x00000000, offset: 411041792, size: 8388608) ) with status: [ERROR] Socket timeout. >> [2022-07-16 15:24:18.625560 +0200][Dump ][File ] [0x676610@root://dpmhead.farm.particle.cz:1094//dpm/farm.particle.cz/home/dteam/5G.disk2?xrdcl.requuid=79be2b53-1bfc-414b-baa7-4ade1508f5fe] File state error encountered. Message kXR_pgread (handle: 0x00000000, offset: 411041792, size: 8388608) returned with [ERROR] Socket timeout >> [2022-07-16 15:24:18.625564 +0200][Dump ][File ] [0x676610@root://dpmhead.farm.particle.cz:1094//dpm/farm.particle.cz/home/dteam/5G.disk2?xrdcl.requuid=79be2b53-1bfc-414b-baa7-4ade1508f5fe] Putting message kXR_pgread (handle: 0x00000000, offset: 411041792, size: 8388608) in the recovery list >> [2022-07-16 15:24:18.625569 +0200][Debug ][ExDbgMsg ] [dpmdisk2.farm.particle.cz:1095 <http://dpmdisk2.farm.particle.cz:1095>] Destroying MsgHandler: 0x67dc90. >> [2022-07-16 15:24:18.625593 +0200][Debug ][ExDbgMsg ] [dpmdisk2.farm.particle.cz:1095 <http://dpmdisk2.farm.particle.cz:1095>] Calling MsgHandler: 0x67e650 (message: kXR_pgread (handle: 0x00000000, offset: 419430400, size: 8388608) ) with status: [ERROR] Socket timeout. >> [2022-07-16 15:24:18.625598 +0200][Dump ][File ] [0x676610@root://dpmhead.farm.particle.cz:1094//dpm/farm.particle.cz/home/dteam/5G.disk2?xrdcl.requuid=79be2b53-1bfc-414b-baa7-4ade1508f5fe] File state error encountered. Message kXR_pgread (handle: 0x00000000, offset: 419430400, size: 8388608) returned with [ERROR] Socket timeout >> [2022-07-16 15:24:18.625602 +0200][Dump ][File ] [0x676610@root://dpmhead.farm.particle.cz:1094//dpm/farm.particle.cz/home/dteam/5G.disk2?xrdcl.requuid=79be2b53-1bfc-414b-baa7-4ade1508f5fe] Putting message kXR_pgread (handle: 0x00000000, offset: 419430400, size: 8388608) in the recovery list >> [2022-07-16 15:24:18.625607 +0200][Debug ][ExDbgMsg ] [dpmdisk2.farm.particle.cz:1095 <http://dpmdisk2.farm.particle.cz:1095>] Destroying MsgHandler: 0x67e650. >> [2022-07-16 15:24:18.627151 +0200][Debug ][ExDbgMsg ] [dpmdisk2.farm.particle.cz:1095 <http://dpmdisk2.farm.particle.cz:1095>] Calling MsgHandler: 0x67e170 (message: kXR_pgread (handle: 0x00000000, offset: 427819008, size: 8388608) ) with status: [ERROR] Socket timeout. >> [2022-07-16 15:24:18.627186 +0200][Dump ][File ] [0x676610@root://dpmhead.farm.particle.cz:1094//dpm/farm.particle.cz/home/dteam/5G.disk2?xrdcl.requuid=79be2b53-1bfc-414b-baa7-4ade1508f5fe] File state error encountered. Message kXR_pgread (handle: 0x00000000, offset: 427819008, size: 8388608) returned with [ERROR] Socket timeout >> [2022-07-16 15:24:18.627193 +0200][Dump ][File ] [0x676610@root://dpmhead.farm.particle.cz:1094//dpm/farm.particle.cz/home/dteam/5G.disk2?xrdcl.requuid=79be2b53-1bfc-414b-baa7-4ade1508f5fe] Putting message kXR_pgread (handle: 0x00000000, offset: 427819008, size: 8388608) in the recovery list >> [2022-07-16 15:24:18.627197 +0200][Debug ][File ] [0x676610@root://dpmhead.farm.particle.cz:1094//dpm/farm.particle.cz/home/dteam/5G.disk2?xrdcl.requuid=79be2b53-1bfc-414b-baa7-4ade1508f5fe] Running the recovery procedure >> [2022-07-16 15:24:18.627204 +0200][Dump ][File ] [0x676610@root://dpmhead.farm.particle.cz:1094//dpm/farm.particle.cz/home/dteam/5G.disk2?xrdcl.requuid=79be2b53-1bfc-414b-baa7-4ade1508f5fe] Sending a recovery open command toroot://dpmhead.farm.particle.cz:1094//dpm/farm.particle.cz/home/dteam/5G.disk2?xrdcl.requuid=79be2b53-1bfc-414b-baa7-4ade1508f5fe >> [2022-07-16 15:24:18.627224 +0200][Dump ][XRootD ] [dpmhead.farm.particle.cz:1094 <http://dpmhead.farm.particle.cz:1094>] Sending message kXR_open (file: /dpm/farm.particle.cz/home/dteam/5G.disk2 <http://farm.particle.cz/home/dteam/5G.disk2>, mode: 00, flags: kXR_open_read ) >> [2022-07-16 15:24:18.627241 +0200][Debug ][ExDbgMsg ] [dpmhead.farm.particle.cz:1094 <http://dpmhead.farm.particle.cz:1094>] MsgHandler created: 0x7c003850 (message: kXR_open (file: /dpm/farm.particle.cz/home/dteam/5G.disk2 <http://farm.particle.cz/home/dteam/5G.disk2>, mode: 00, flags: kXR_open_read ) ). >> [2022-07-16 15:24:18.627251 +0200][Dump ][PostMaster ] [dpmhead.farm.particle.cz:1094 <http://dpmhead.farm.particle.cz:1094>] Sending message kXR_open (file: /dpm/farm.particle.cz/home/dteam/5G.disk2 <http://farm.particle.cz/home/dteam/5G.disk2>, mode: 00, flags: kXR_open_read ) (0x7c002070) through substream 0 expecting answer at 0 >> [2022-07-16 15:24:18.627273 +0200][Debug ][ExDbgMsg ] [dpmdisk2.farm.particle.cz:1095 <http://dpmdisk2.farm.particle.cz:1095>] Destroying MsgHandler: 0x67e170. >> [2022-07-16 15:24:18.627827 +0200][Dump ][AsyncSock ] [dpmhead.farm.particle.cz <http://dpmhead.farm.particle.cz>:1094.0] Wrote a message: kXR_open (file: /dpm/farm.particle.cz/home/dteam/5G.disk2 <http://farm.particle.cz/home/dteam/5G.disk2>, mode: 00, flags: kXR_open_read ) (0x7c002070), 65 bytes >> [2022-07-16 15:24:18.627875 +0200][Dump ][AsyncSock ] [dpmhead.farm.particle.cz <http://dpmhead.farm.particle.cz>:1094.0] Successfully sent message: kXR_open (file: /dpm/farm.particle.cz/home/dteam/5G.disk2 <http://farm.particle.cz/home/dteam/5G.disk2>, mode: 00, flags: kXR_open_read ) (0x7c002070). >> [2022-07-16 15:24:18.627881 +0200][Dump ][XRootD ] [dpmhead.farm.particle.cz:1094 <http://dpmhead.farm.particle.cz:1094>] Message kXR_open (file: /dpm/farm.particle.cz/home/dteam/5G.disk2 <http://farm.particle.cz/home/dteam/5G.disk2>, mode: 00, flags: kXR_open_read ) has been successfully sent. >> [2022-07-16 15:24:18.627886 +0200][Debug ][ExDbgMsg ] [dpmhead.farm.particle.cz:1094 <http://dpmhead.farm.particle.cz:1094>] Moving MsgHandler: 0x7c003850 (message: kXR_open (file: /dpm/farm.particle.cz/home/dteam/5G.disk2 <http://farm.particle.cz/home/dteam/5G.disk2>, mode: 00, flags: kXR_open_read ) ) from out-queu to in-queue. >> ... >> it seems that client vs. server time was 1s out of sync, server side >> looks like (details >> https://vokac.web.cern.ch/vokac/tmp/xrootd-server-5.4.3.log) >> 220716 15:23:13 28043 Xrd_Inet: Accepted connection on port 1095 [log in to unmask] >> ... >> 220716 15:23:18 28023 oss_AioWait: read completed forvokac.1773:[log in to unmask]; rc=0 result=65536 aiocb=0x7f039df2a700 >> 220716 15:23:18 28023 oss_AioWait: read completed forvokac.1773:[log in to unmask]; rc=0 result=65536 aiocb=0x7f039df8ee00 >> 220716 15:23:18 28023 oss_AioWait: read completed forvokac.1773:[log in to unmask]; rc=0 result=65536 aiocb=0x7f039df8e000 >> 220716 15:23:18 28023 oss_AioWait: read completed forvokac.1773:[log in to unmask]; rc=0 result=65536 aiocb=0x7f039307ee00 >> 220716 15:23:18 28019vokac.1773:[log in to unmask] ofs_aioread: 65536@444530688 fn=/dpm/farm.particle.cz/home/dteam/5G.disk2 <http://farm.particle.cz/home/dteam/5G.disk2> >> 220716 15:23:18 28019vokac.1773:[log in to unmask] oss_AioRead: fd=8000 read 65536@444530688 started; aiocb=0x7f039df90a00 >> 220716 15:23:18 28019vokac.1773:[log in to unmask] Xrootd_PgrwAio: 0100 pgrd beg 65536@444530688 inF=8 >> 220716 15:23:18 28019vokac.1773:[log in to unmask] Xrootd_PgrwAio: 0100 pgrd end 65536@444071936 result=65536 D-S=False-1 inF=7 >> 220716 15:23:18 28019vokac.1773:[log in to unmask] Xrootd_Response: 0100 sending partial 8 info and 65600 data bytes >> 220716 15:23:18 28019 Xrootd_PgrwBuff: Recycle 65536@444071936 numF=8 >> 220716 15:23:18 28019vokac.1773:[log in to unmask] Xrootd_PgrwAio: 0100 pgrd end 65536@444137472 result=65536 D-S=False-1 inF=6 >> 220716 15:23:18 28019vokac.1773:[log in to unmask] Xrootd_Response: 0100 sending partial 8 info and 65600 data bytes >> 220716 15:23:18 28019 Xrootd_PgrwBuff: Recycle 65536@444137472 numF=9 >> 220716 15:23:18 28019vokac.1773:[log in to unmask] Xrootd_PgrwAio: 0100 pgrd end 65536@444203008 result=65536 D-S=False-1 inF=5 >> 220716 15:23:18 28019vokac.1773:[log in to unmask] Xrootd_Response: 0100 sending partial 8 info and 65600 data bytes >> 220716 15:23:18 28019 Xrootd_PgrwBuff: Recycle 65536@444203008 numF=10 >> 220716 15:23:18 28019vokac.1773:[log in to unmask] Xrootd_PgrwAio: 0100 pgrd end 65536@444268544 result=65536 D-S=False-1 inF=4 >> 220716 15:23:18 28019vokac.1773:[log in to unmask] Xrootd_Response: 0100 sending partial 8 info and 65600 data bytes >> 220716 15:23:18 28019 Xrootd_PgrwBuff: Recycle 65536@444268544 numF=11 >> 220716 15:23:18 28019vokac.1773:[log in to unmask] Xrootd_PgrwAio: 0100 pgrd end 65536@444334080 result=65536 D-S=False-1 inF=3 >> 220716 15:23:18 28019vokac.1773:[log in to unmask] Xrootd_Response: 0100 sending partial 8 info and 65600 data bytes >> 220716 15:23:18 28019 Xrootd_PgrwBuff: Recycle 65536@444334080 numF=12 >> 220716 15:23:18 28019vokac.1773:[log in to unmask] Xrootd_PgrwAio: 0100 pgrd end 65536@444399616 result=65536 D-S=False-1 inF=2 >> 220716 15:23:18 28019vokac.1773:[log in to unmask] Xrootd_Response: 0100 sending partial 8 info and 65600 data bytes >> 220716 15:23:18 28019 Xrootd_PgrwBuff: Recycle 65536@444399616 numF=13 >> 220716 15:23:18 28019vokac.1773:[log in to unmask] Xrootd_PgrwAio: 0100 pgrd end 65536@444465152 result=65536 D-S=False-1 inF=1 >> 220716 15:23:18 28019vokac.1773:[log in to unmask] Xrootd_Response: 0100 sending partial 8 info and 65600 data bytes >> 220716 15:23:18 28019 Xrootd_PgrwBuff: Recycle 65536@444465152 numF=14 >> 220716 15:23:18 28023 oss_AioWait: read completed forvokac.1773:[log in to unmask]; rc=0 result=65536 aiocb=0x7f039df90a00 >> 220716 15:23:18 28019vokac.1773:[log in to unmask] Xrootd_PgrwAio: 0100 pgrd end 65536@444530688 result=65536 D-S=False-1 inF=0 >> 220716 15:23:18 28019vokac.1773:[log in to unmask] Xrootd_Response: 0100 sending final 8 info and 65600 data bytes >> 220716 15:23:18 28019 Xrootd_PgrwBuff: Recycle 65536@444530688 numF=15 >> 220716 15:23:18 28019vokac.1773:[log in to unmask] Xrd_Link: Setting FD 30 ref to 2+-1 post=0 >> 220716 15:23:18 28019vokac.1773:[log in to unmask] Xrootd_File: File::Ref 0 after +-1 >> 220716 15:23:18 28019vokac.1773:[log in to unmask] Xrootd_PgrwAio: 0100 pgrw recycle R D-S=True-1 >> 220716 15:23:20 28043vokac.1773:[log in to unmask] Xrootd_Protocol: 0100 request timeout; read 0 of 24 bytes >> 220716 15:23:20 28043 Xrd_Poll: Poller 0 enabledvokac.1773:[log in to unmask] >> 220716 15:23:37 28018 Xrd_Inet: Accepted connection on port 1095 [log in to unmask] >> 220716 15:23:37 28018 Xrd_ProtLoad: matched port 1095 protocol XrdHttp >> 220716 15:23:37 28018 anon.0:28@dpmhead Xrd_Poll: FD 28 attached to poller 1; num=1 >> 220716 15:23:37 28018 Xrd_Poll: Poller 1 enabled anon.0:28@dpmhead >> 220716 15:23:37 28042 Xrd_Sched: running main accept inq=0 >> 220716 15:23:37 28021 anon.0:28@dpmhead Xrd_Poll: Link 28 terminating: hangup >> 220716 15:23:37 28045 Xrd_Sched: running anon.0:28@dpmhead inq=0 >> 220716 15:23:37 28045 anon.0:28@dpmhead Xrd_Poll: Poller 1 removing FD 28 >> 220716 15:23:37 28045 anon.0:28@dpmhead Xrd_Poll: FD 28 detached from poller 1; num=0 >> 220716 15:23:41 28027 cms_Finder: Waiting for cms path /var/spool/xrootd/dpmdisk/.olb/olbd.admin >> 220716 15:24:18 28020vokac.1773:[log in to unmask] Xrd_Poll: Link 30 terminating: hangup >> 220716 15:24:18 28019 Xrd_Sched: runningvokac.1773:[log in to unmask] inq=0 >> 220716 15:24:18 28019 XrootdXeq:vokac.1773:[log in to unmask] disc 0:01:05 >> 220716 15:24:18 28019vokac.1773:[log in to unmask] Xrootd_File: closing r /dpm/farm.particle.cz/home/dteam/5G.disk2 <http://farm.particle.cz/home/dteam/5G.disk2> >> 220716 15:24:18 28019vokac.1773:[log in to unmask] ofs_close: use=1 fn=/dpm/farm.particle.cz/home/dteam/5G.disk2 <http://farm.particle.cz/home/dteam/5G.disk2> >> 220716 15:24:18 28019vokac.1773:[log in to unmask] dpmoss_Close: return 0 >> 220716 15:24:18 28019vokac.1773:[log in to unmask] Xrd_Poll: Poller 0 removing FD 30 >> 220716 15:24:18 28019vokac.1773:[log in to unmask] Xrd_Poll: FD 30 detached from poller 0; num=0 >> 220716 15:24:18 28042 Xrd_Inet: Accepted connection on port 1095 [log in to unmask] >> >> Petr > > > ------------------------------------------------------------------------ > > Use REPLY-ALL to reply to list > > To unsubscribe from the XROOTD-L list, click the following link: > https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 > <https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1> > ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-L list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1