Print

Print


I'll complement the thread with Elvin's findings as I think this is crucial to understand the problem:

Below I've retraced what happened from both the client and the server perspective.

First there is the open of file id 1633520448 for writing:

[2018-03-16 17:14:27.601768 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Message kXR_open (file: [log in to unmask]&[log in to unmask]&castor.pfn2=0:15511:ae8aa0b2-2588-42d5-b758-66526f24d8cc&castor.signature=QZLFVGTqrdWgudfJxCSROSDY/Se446SI26b4WjozzPe5olJk38zqr/OEZSSC1bgHTytJTjtjCBmqCvqFRlp7Mg==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) has been successfully sent.

After this the client continues to write and then at 17:14:37 sends a close request for this file:

[2018-03-16 17:14:37.138279 +0100][Debug  ][File              ] [0x68008a40@root://tpsrv452.cern.ch:[log in to unmask]&castor.pfn1=/diskserver/48/
[log in to unmask]&castor.pfn2=0:15511:ae8aa0b2-2588-42d5-b758-66526f24d8cc&castor.signature=QZLFVGTqrdWgudfJxCSROSDY/Se446SI26b4WjozzPe5olJk38zqr/OEZSSC1bgHTytJTjtjCBmqCvqFRlp7Mg==&castor.txt
ype=tape] Sending a close command for handle 0x0 to tpsrv452.cern.ch:1095

On the server side we have a perfect match of the previous operations:

180316 17:14:27 time=1521216867.602143 func=open                     level=INFO  logid=19b3d1aa-2935-11e8-a8c6-a4bf0112f566 [log in to unmask]:1094 tid=140280503043840 source=XrdxCastor2Ofs:697   ti
dent=stage.27009:30@tpsrv210 [log in to unmask], opaque=castor.exptime=1521220467&[log in to unmask]&castor.pfn2=0:15511:ae8aa0b2-258
8-42d5-b758-66526f24d8cc&castor.signature=QZLFVGTqrdWgudfJxCSROSDY/Se446SI26b4WjozzPe5olJk38zqr/OEZSSC1bgHTytJTjtjCBmqCvqFRlp7Mg==&castor.txtype=tape, isRW=1, open_mode=4200, file_ptr=0x7f95740033d0
180316 17:14:27 165099 stage.27009:30@tpsrv210 castor2ofs_open: 4200-40600 [log in to unmask]
180316 17:14:37 165099 stage.27009:30@tpsrv210 castor2ofs_close: use=1 [log in to unmask]

The nice part of the last line of the logs is that we can see the "use=1" which is sort of reference count for this file. Now the close goes indeed in the CASTOR plugin which indeed seems to block the request. The reasons could be checksum recomputation or slowness of the diskmanager in replying to the close message. Nevertheless, this is blocked.

I've looked over the XRootD code and indeed there is a Link->Serialize() call done for open and close:
https://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdXeq.cc#L482  - for close
https://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdXeq.cc#L1353  - for open

While the link on the server side is blocked the client is trying to open multiple other files including also the one we see triggering the open retry, namely id 1633530374

[2018-03-16 17:16:21.366785 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Message kXR_open (file: [log in to unmask]&[log in to unmask]&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) has been successfully sent.

This open is never replied to due to the link being blocked by the previous close. After 60 seconds the socket is declared dead and all requests sent through that socket are timed out:

[2018-03-16 17:17:22.007993 +0100][Dump   ][XRootDTransport   ] [tpsrv452.cern.ch:1095 #0.0] Stream inactive since 61 seconds, stream timeout: 60, allocated SIDs: 4, wait barrier: 2018-03-16 17:14:37 +0100
[2018-03-16 17:17:22.008001 +0100][Dump   ][File              ] [0x64008ce0@root://tpsrv441.cern.ch:[log in to unmask]&castor.pfn1=/diskserver/91/
[log in to unmask]&castor.pfn2=0:15511:260734cd-4198-4f88-9aca-276ed72935b0&castor.signature=a60v/QBb9YfC0wEfFv22P9xqg4LMXSGIlNaaYWG/+PvtIGH8lSiEVEEKQSrrvocGKcylMkmHKROPNmwWgFsgzw==&castor.txt
ype=tape] Got state response for message kXR_write (handle: 0x00000000, offset: 1630535680, size: 5242880)
[2018-03-16 17:17:22.008032 +0100][Debug  ][AsyncSock         ] [tpsrv452.cern.ch:1095 #0.0] Closing the socket
[2018-03-16 17:17:22.008049 +0100][Debug  ][Poller            ] <[::ffff:137.138.222.81]:42468><--><[::ffff:10.32.133.140]:1095> Removing socket from the poller
[2018-03-16 17:17:22.008093 +0100][Debug  ][PostMaster        ] [tpsrv452.cern.ch:1095 #0] Recovering error for stream #0: [ERROR] Socket error.
[2018-03-16 17:17:22.008106 +0100][Debug  ][PostMaster        ] [tpsrv452.cern.ch:1095 #0] Reporting disconnection to queued message handlers.
[2018-03-16 17:17:22.008118 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Stream event reported for msg kXR_open (file: [log in to unmask]
5&[log in to unmask]&castor.pfn2=0:15511:a665e40e-6c12-406e-abcd-cc770a7b5fd0&castor.signature=AXlRWEpWwCisj0dX1yhcJYywv/nJuBKiXSdNEJ3kjnEdFOT02T/6ysbA6Mi9lhGeWvnucC
kQXdGwa6f+NIM8SA==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat )
[2018-03-16 17:17:22.008130 +0100][Debug  ][XRootD            ] [tpsrv452.cern.ch:1095] Handling error while processing kXR_open (file: [log in to unmask]
555&[log in to unmask]&castor.pfn2=0:15511:a665e40e-6c12-406e-abcd-cc770a7b5fd0&castor.signature=AXlRWEpWwCisj0dX1yhcJYywv/nJuBKiXSdNEJ3kjnEdFOT02T/6ysbA6Mi9lhGeWvnu
cCkQXdGwa6f+NIM8SA==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ): [ERROR] Socket error.
[2018-03-16 17:17:22.008146 +0100][Dump   ][PostMaster        ] [tpsrv452.cern.ch:1095 #0] Sending message kXR_open (file: [log in to unmask]&castor.pf
[log in to unmask]&castor.pfn2=0:15511:a665e40e-6c12-406e-abcd-cc770a7b5fd0&castor.signature=AXlRWEpWwCisj0dX1yhcJYywv/nJuBKiXSdNEJ3kjnEdFOT02T/6ysbA6Mi9lhGeWvnucCkQXdGwa6f+N
IM8SA==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) (0x7400a1a0) through substream 0 expecting answer at 0
[2018-03-16 17:17:22.009109 +0100][Debug  ][PostMaster        ] [tpsrv452.cern.ch:1095] Found 1 address(es): [::ffff:10.32.133.140]:1095
[2018-03-16 17:17:22.009149 +0100][Debug  ][AsyncSock         ] [tpsrv452.cern.ch:1095 #0.0] Attempting connection to [::ffff:10.32.133.140]:1095
[2018-03-16 17:17:22.009182 +0100][Debug  ][Poller            ] Adding socket 0x70005d60 to the poller
[2018-03-16 17:17:22.009209 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Stream event reported for msg kXR_open (file: [log in to unmask]
6&[log in to unmask]&castor.pfn2=0:15511:c3ea1292-06c4-4051-ae16-63c9ffa84884&castor.signature=ekXuu3hq3/ZYlcqOf2zVouol9mM8wJZ3pUkaf1k5474mk38cJPk8hqTpBmUpVgQjNBh75W
lm9fEDYJrtryRDKg==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat )
[2018-03-16 17:17:22.009221 +0100][Debug  ][XRootD            ] [tpsrv452.cern.ch:1095] Handling error while processing kXR_open (file: [log in to unmask]
516&[log in to unmask]&castor.pfn2=0:15511:c3ea1292-06c4-4051-ae16-63c9ffa84884&castor.signature=ekXuu3hq3/ZYlcqOf2zVouol9mM8wJZ3pUkaf1k5474mk38cJPk8hqTpBmUpVgQjNBh7
5Wlm9fEDYJrtryRDKg==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ): [ERROR] Socket error.
[2018-03-16 17:17:22.009244 +0100][Dump   ][PostMaster        ] [tpsrv452.cern.ch:1095 #0] Sending message kXR_open (file: [log in to unmask]&castor.pf
[log in to unmask]&castor.pfn2=0:15511:c3ea1292-06c4-4051-ae16-63c9ffa84884&castor.signature=ekXuu3hq3/ZYlcqOf2zVouol9mM8wJZ3pUkaf1k5474mk38cJPk8hqTpBmUpVgQjNBh75Wlm9fEDYJrtr
yRDKg==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) (0x58005220) through substream 0 expecting answer at 0
[2018-03-16 17:17:22.009255 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Stream event reported for msg kXR_close (handle: 0x00000000)
[2018-03-16 17:17:22.009265 +0100][Debug  ][XRootD            ] [tpsrv452.cern.ch:1095] Handling error while processing kXR_close (handle: 0x00000000): [ERROR] Socket error.
[2018-03-16 17:17:22.009275 +0100][Error  ][XRootD            ] [tpsrv452.cern.ch:1095] Unable to get the response to request kXR_close (handle: 0x00000000)
[2018-03-16 17:17:22.009295 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Stream event reported for msg kXR_open (file: [log in to unmask]
1&[log in to unmask]&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqk
vH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat )
[2018-03-16 17:17:22.009306 +0100][Debug  ][XRootD            ] [tpsrv452.cern.ch:1095] Handling error while processing kXR_open (file: [log in to unmask]&[log in to unmask]&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ): [ERROR] Socket error.
[2018-03-16 17:17:22.009309 +0100][Debug  ][File              ] [0x68008a40@root://tpsrv452.cern.ch:[log in to unmask]&[log in to unmask]&castor.pfn2=0:15511:ae8aa0b2-2588-42d5-b758-66526f24d8cc&castor.signature=QZLFVGTqrdWgudfJxCSROSDY/Se446SI26b4WjozzPe5olJk38zqr/OEZSSC1bgHTytJTjtjCBmqCvqFRlp7Mg==&castor.txtype=tape] Close returned from tpsrv452.cern.ch:1095 with: [ERROR] Socket error

As can be seen from the logs above there are actually 3 open requests which are dropped and retried since the socket is declared dead and on close request which is the one from the first transfer that never returned.
This I think also matches some observation from Eric who was suspecting cross-talk in the client ...

The more interesting part is on the server side, where we see the effect of all these recoveries. For example for file id 1633531326

180316 17:17:22 time=1521217042.015626 func=open                     level=INFO  logid=81a93368-2935-11e8-8389-a4bf0112f566 [log in to unmask]:1094 tid=140281188312832 source=XrdxCastor2Ofs:697   ti
dent=stage.27009:35@tpsrv210 [log in to unmask], opaque=castor.exptime=1521220555&[log in to unmask]&castor.pfn2=0:15511:a665e40e-6c1
2-406e-abcd-cc770a7b5fd0&castor.signature=AXlRWEpWwCisj0dX1yhcJYywv/nJuBKiXSdNEJ3kjnEdFOT02T/6ysbA6Mi9lhGeWvnucCkQXdGwa6f+NIM8SA==&castor.txtype=tape, isRW=1, open_mode=4200, file_ptr=0x7f95ac104810
180316 17:17:22 153326 stage.27009:35@tpsrv210 castor2ofs_open: 4200-40600 [log in to unmask]

The above is the open that came through the new socket as the result of the recovery. And below at some point later on when the blocking close finished and the link mutex is released we actually see the first open going through.

180316 17:18:59 time=1521217139.965050 func=open                     level=INFO  logid=bc0b1b0c-2935-11e8-a8c6-a4bf0112f566 [log in to unmask]:1094 tid=140280503043840 source=XrdxCastor2Ofs:697   tident=stage.27009:30@tpsrv210 [log in to unmask], opaque=castor.exptime=1521220555&[log in to unmask]&castor.pfn2=0:15511:a665e40e-6c12-406e-abcd-cc770a7b5fd0&castor.signature=AXlRWEpWwCisj0dX1yhcJYywv/nJuBKiXSdNEJ3kjnEdFOT02T/6ysbA6Mi9lhGeWvnucCkQXdGwa6f+NIM8SA==&castor.txtype=tape, isRW=1, open_mode=4200, file_ptr=0x7f95740033d0
180316 17:18:59 165099 stage.27009:30@tpsrv210 castor2ofs_open: 4200-40600 [log in to unmask]
180316 17:19:00 165099 stage.27009:30@tpsrv210 castor2ofs_open: attach use=2 [log in to unmask]
180316 17:19:00 165099 stage.27009:30@tpsrv210 castor2ofs_close: use=2 [log in to unmask]
180316 17:19:00 time=1521217140.187768 func=close                    level=ERROR logid=bc0b1b0c-2935-11e8-a8c6-a4bf0112f566 [log in to unmask]:1094 tid=140280503043840 source=XrdxCastor2Ofs:974   tident=stage.27009:30@tpsrv210 [log in to unmask] msg="closed via destructor"
180316 17:19:00 165099 castor2ofs_close: stage.27009:30@tpsrv210 Unable to close file - delete via destructor ; connection reset by peer
180316 17:19:00 time=1521217140.187819 func=close                    level=INFO  logid=bc0b1b0c-2935-11e8-a8c6-a4bf0112f566 [log in to unmask]:1094 tid=140280503043840 source=XrdxCastor2Ofs:1007  tident=stage.27009:30@tpsrv210 msg="send to diskmanager" errc=104 errmsg="Unable to close file - delete via destructor ; connection reset by peer"
180316 17:19:00 165099 stage.27009:30@tpsrv210 castor2ofs_close: use=0 fn=dummy
180316 17:19:00 165099 XrootdXeq: Output file [log in to unmask] is already opened by 1 writer; open denied.
180316 17:19:00 165099 stage.27009:30@tpsrv210 XrootdResponse: sending err 3003: Output file [log in to unmask] is already opened by 1 writer; open denied.

Surprisingly, or not, we can see that XRootD detected that this is an open of an already opened (notice the use=2) for write file and says open denied. But this happens after the CASTOR plugin is called therefore is useless since the damage is done. I suppose this is the thing that Andy fixed - that the check happens before the plugin call.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/xrootd/xrootd","title":"xrootd/xrootd","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/xrootd/xrootd"}},"updates":{"snippets":[{"icon":"PERSON","message":"@simonmichal in #673: I'll complement the thread with Elvin's findings as I think this is crucial to understand the problem:\r\n\r\n```\r\nBelow I've retraced what happened from both the client and the server perspective.\r\n\r\nFirst there is the open of file id 1633520448 for writing:\r\n\r\n[2018-03-16 17:14:27.601768 +0100][Dump ][XRootD ] [tpsrv452.cern.ch:1095] Message kXR_open (file: [log in to unmask]\[log in to unmask]\u0026castor.pfn2=0:15511:ae8aa0b2-2588-42d5-b758-66526f24d8cc\u0026castor.signature=QZLFVGTqrdWgudfJxCSROSDY/Se446SI26b4WjozzPe5olJk38zqr/OEZSSC1bgHTytJTjtjCBmqCvqFRlp7Mg==\u0026castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) has been successfully sent.\r\n\r\nAfter this the client continues to write and then at 17:14:37 sends a close request for this file:\r\n\r\n[2018-03-16 17:14:37.138279 +0100][Debug ][File ] [0x68008a40@root://tpsrv452.cern.ch:[log in to unmask]\u0026castor.pfn1=/diskserver/48/\r\[log in to unmask]\u0026castor.pfn2=0:15511:ae8aa0b2-2588-42d5-b758-66526f24d8cc\u0026castor.signature=QZLFVGTqrdWgudfJxCSROSDY/Se446SI26b4WjozzPe5olJk38zqr/OEZSSC1bgHTytJTjtjCBmqCvqFRlp7Mg==\u0026castor.txt\r\nype=tape] Sending a close command for handle 0x0 to tpsrv452.cern.ch:1095\r\n\r\nOn the server side we have a perfect match of the previous operations:\r\n\r\n180316 17:14:27 time=1521216867.602143 func=open level=INFO logid=19b3d1aa-2935-11e8-a8c6-a4bf0112f566 [log in to unmask]:1094 tid=140280503043840 source=XrdxCastor2Ofs:697 ti\r\ndent=stage.27009:30@tpsrv210 [log in to unmask], opaque=castor.exptime=1521220467\[log in to unmask]\u0026castor.pfn2=0:15511:ae8aa0b2-258\r\n8-42d5-b758-66526f24d8cc\u0026castor.signature=QZLFVGTqrdWgudfJxCSROSDY/Se446SI26b4WjozzPe5olJk38zqr/OEZSSC1bgHTytJTjtjCBmqCvqFRlp7Mg==\u0026castor.txtype=tape, isRW=1, open_mode=4200, file_ptr=0x7f95740033d0\r\n180316 17:14:27 165099 stage.27009:30@tpsrv210 castor2ofs_open: 4200-40600 [log in to unmask]\r\n180316 17:14:37 165099 stage.27009:30@tpsrv210 castor2ofs_close: use=1 [log in to unmask]\r\n\r\nThe nice part of the last line of the logs is that we can see the \"use=1\" which is sort of reference count for this file. Now the close goes indeed in the CASTOR plugin which indeed seems to block the request. The reasons could be checksum recomputation or slowness of the diskmanager in replying to the close message. Nevertheless, this is blocked.\r\n\r\nI've looked over the XRootD code and indeed there is a Link-\u003eSerialize() call done for open and close:\r\nhttps://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdXeq.cc#L482 - for close\r\nhttps://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdXeq.cc#L1353 - for open\r\n\r\nWhile the link on the server side is blocked the client is trying to open multiple other files including also the one we see triggering the open retry, namely id 1633530374\r\n\r\n[2018-03-16 17:16:21.366785 +0100][Dump ][XRootD ] [tpsrv452.cern.ch:1095] Message kXR_open (file: [log in to unmask]\[log in to unmask]\u0026castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4\u0026castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==\u0026castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) has been successfully sent.\r\n\r\nThis open is never replied to due to the link being blocked by the previous close. After 60 seconds the socket is declared dead and all requests sent through that socket are timed out:\r\n\r\n[2018-03-16 17:17:22.007993 +0100][Dump ][XRootDTransport ] [tpsrv452.cern.ch:1095 #0.0] Stream inactive since 61 seconds, stream timeout: 60, allocated SIDs: 4, wait barrier: 2018-03-16 17:14:37 +0100\r\n[2018-03-16 17:17:22.008001 +0100][Dump ][File ] [0x64008ce0@root://tpsrv441.cern.ch:[log in to unmask]\u0026castor.pfn1=/diskserver/91/\r\[log in to unmask]\u0026castor.pfn2=0:15511:260734cd-4198-4f88-9aca-276ed72935b0\u0026castor.signature=a60v/QBb9YfC0wEfFv22P9xqg4LMXSGIlNaaYWG/+PvtIGH8lSiEVEEKQSrrvocGKcylMkmHKROPNmwWgFsgzw==\u0026castor.txt\r\nype=tape] Got state response for message kXR_write (handle: 0x00000000, offset: 1630535680, size: 5242880)\r\n[2018-03-16 17:17:22.008032 +0100][Debug ][AsyncSock ] [tpsrv452.cern.ch:1095 #0.0] Closing the socket\r\n[2018-03-16 17:17:22.008049 +0100][Debug ][Poller ] \u003c[::ffff:137.138.222.81]:42468\u003e\u003c--\u003e\u003c[::ffff:10.32.133.140]:1095\u003e Removing socket from the poller\r\n[2018-03-16 17:17:22.008093 +0100][Debug ][PostMaster ] [tpsrv452.cern.ch:1095 #0] Recovering error for stream #0: [ERROR] Socket error.\r\n[2018-03-16 17:17:22.008106 +0100][Debug ][PostMaster ] [tpsrv452.cern.ch:1095 #0] Reporting disconnection to queued message handlers.\r\n[2018-03-16 17:17:22.008118 +0100][Dump ][XRootD ] [tpsrv452.cern.ch:1095] Stream event reported for msg kXR_open (file: [log in to unmask]\r\n5\[log in to unmask]\u0026castor.pfn2=0:15511:a665e40e-6c12-406e-abcd-cc770a7b5fd0\u0026castor.signature=AXlRWEpWwCisj0dX1yhcJYywv/nJuBKiXSdNEJ3kjnEdFOT02T/6ysbA6Mi9lhGeWvnucC\r\nkQXdGwa6f+NIM8SA==\u0026castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat )\r\n[2018-03-16 17:17:22.008130 +0100][Debug ][XRootD ] [tpsrv452.cern.ch:1095] Handling error while processing kXR_open (file: [log in to unmask]\r\n555\[log in to unmask]\u0026castor.pfn2=0:15511:a665e40e-6c12-406e-abcd-cc770a7b5fd0\u0026castor.signature=AXlRWEpWwCisj0dX1yhcJYywv/nJuBKiXSdNEJ3kjnEdFOT02T/6ysbA6Mi9lhGeWvnu\r\ncCkQXdGwa6f+NIM8SA==\u0026castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ): [ERROR] Socket error.\r\n[2018-03-16 17:17:22.008146 +0100][Dump ][PostMaster ] [tpsrv452.cern.ch:1095 #0] Sending message kXR_open (file: [log in to unmask]\u0026castor.pf\r\[log in to unmask]\u0026castor.pfn2=0:15511:a665e40e-6c12-406e-abcd-cc770a7b5fd0\u0026castor.signature=AXlRWEpWwCisj0dX1yhcJYywv/nJuBKiXSdNEJ3kjnEdFOT02T/6ysbA6Mi9lhGeWvnucCkQXdGwa6f+N\r\nIM8SA==\u0026castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) (0x7400a1a0) through substream 0 expecting answer at 0\r\n[2018-03-16 17:17:22.009109 +0100][Debug ][PostMaster ] [tpsrv452.cern.ch:1095] Found 1 address(es): [::ffff:10.32.133.140]:1095\r\n[2018-03-16 17:17:22.009149 +0100][Debug ][AsyncSock ] [tpsrv452.cern.ch:1095 #0.0] Attempting connection to [::ffff:10.32.133.140]:1095\r\n[2018-03-16 17:17:22.009182 +0100][Debug ][Poller ] Adding socket 0x70005d60 to the poller\r\n[2018-03-16 17:17:22.009209 +0100][Dump ][XRootD ] [tpsrv452.cern.ch:1095] Stream event reported for msg kXR_open (file: [log in to unmask]\r\n6\[log in to unmask]\u0026castor.pfn2=0:15511:c3ea1292-06c4-4051-ae16-63c9ffa84884\u0026castor.signature=ekXuu3hq3/ZYlcqOf2zVouol9mM8wJZ3pUkaf1k5474mk38cJPk8hqTpBmUpVgQjNBh75W\r\nlm9fEDYJrtryRDKg==\u0026castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat )\r\n[2018-03-16 17:17:22.009221 +0100][Debug ][XRootD ] [tpsrv452.cern.ch:1095] Handling error while processing kXR_open (file: [log in to unmask]\r\n516\[log in to unmask]\u0026castor.pfn2=0:15511:c3ea1292-06c4-4051-ae16-63c9ffa84884\u0026castor.signature=ekXuu3hq3/ZYlcqOf2zVouol9mM8wJZ3pUkaf1k5474mk38cJPk8hqTpBmUpVgQjNBh7\r\n5Wlm9fEDYJrtryRDKg==\u0026castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ): [ERROR] Socket error.\r\n[2018-03-16 17:17:22.009244 +0100][Dump ][PostMaster ] [tpsrv452.cern.ch:1095 #0] Sending message kXR_open (file: [log in to unmask]\u0026castor.pf\r\[log in to unmask]\u0026castor.pfn2=0:15511:c3ea1292-06c4-4051-ae16-63c9ffa84884\u0026castor.signature=ekXuu3hq3/ZYlcqOf2zVouol9mM8wJZ3pUkaf1k5474mk38cJPk8hqTpBmUpVgQjNBh75Wlm9fEDYJrtr\r\nyRDKg==\u0026castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) (0x58005220) through substream 0 expecting answer at 0\r\n[2018-03-16 17:17:22.009255 +0100][Dump ][XRootD ] [tpsrv452.cern.ch:1095] Stream event reported for msg kXR_close (handle: 0x00000000)\r\n[2018-03-16 17:17:22.009265 +0100][Debug ][XRootD ] [tpsrv452.cern.ch:1095] Handling error while processing kXR_close (handle: 0x00000000): [ERROR] Socket error.\r\n[2018-03-16 17:17:22.009275 +0100][Error ][XRootD ] [tpsrv452.cern.ch:1095] Unable to get the response to request kXR_close (handle: 0x00000000)\r\n[2018-03-16 17:17:22.009295 +0100][Dump ][XRootD ] [tpsrv452.cern.ch:1095] Stream event reported for msg kXR_open (file: [log in to unmask]\r\n1\[log in to unmask]\u0026castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4\u0026castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqk\r\nvH5RvaK+vVhRQ/4g==\u0026castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat )\r\n[2018-03-16 17:17:22.009306 +0100][Debug ][XRootD ] [tpsrv452.cern.ch:1095] Handling error while processing kXR_open (file: [log in to unmask]\[log in to unmask]\u0026castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4\u0026castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==\u0026castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ): [ERROR] Socket error.\r\n[2018-03-16 17:17:22.009309 +0100][Debug ][File ] [0x68008a40@root://tpsrv452.cern.ch:[log in to unmask]\[log in to unmask]\u0026castor.pfn2=0:15511:ae8aa0b2-2588-42d5-b758-66526f24d8cc\u0026castor.signature=QZLFVGTqrdWgudfJxCSROSDY/Se446SI26b4WjozzPe5olJk38zqr/OEZSSC1bgHTytJTjtjCBmqCvqFRlp7Mg==\u0026castor.txtype=tape] Close returned from tpsrv452.cern.ch:1095 with: [ERROR] Socket error\r\n\r\nAs can be seen from the logs above there are actually 3 open requests which are dropped and retried since the socket is declared dead and on close request which is the one from the first transfer that never returned.\r\nThis I think also matches some observation from Eric who was suspecting cross-talk in the client ...\r\n\r\nThe more interesting part is on the server side, where we see the effect of all these recoveries. For example for file id 1633531326\r\n\r\n180316 17:17:22 time=1521217042.015626 func=open level=INFO logid=81a93368-2935-11e8-8389-a4bf0112f566 [log in to unmask]:1094 tid=140281188312832 source=XrdxCastor2Ofs:697 ti\r\ndent=stage.27009:35@tpsrv210 [log in to unmask], opaque=castor.exptime=1521220555\[log in to unmask]\u0026castor.pfn2=0:15511:a665e40e-6c1\r\n2-406e-abcd-cc770a7b5fd0\u0026castor.signature=AXlRWEpWwCisj0dX1yhcJYywv/nJuBKiXSdNEJ3kjnEdFOT02T/6ysbA6Mi9lhGeWvnucCkQXdGwa6f+NIM8SA==\u0026castor.txtype=tape, isRW=1, open_mode=4200, file_ptr=0x7f95ac104810\r\n180316 17:17:22 153326 stage.27009:35@tpsrv210 castor2ofs_open: 4200-40600 [log in to unmask]\r\n\r\nThe above is the open that came through the new socket as the result of the recovery. And below at some point later on when the blocking close finished and the link mutex is released we actually see the first open going through.\r\n\r\n180316 17:18:59 time=1521217139.965050 func=open level=INFO logid=bc0b1b0c-2935-11e8-a8c6-a4bf0112f566 [log in to unmask]:1094 tid=140280503043840 source=XrdxCastor2Ofs:697 tident=stage.27009:30@tpsrv210 [log in to unmask], opaque=castor.exptime=1521220555\[log in to unmask]\u0026castor.pfn2=0:15511:a665e40e-6c12-406e-abcd-cc770a7b5fd0\u0026castor.signature=AXlRWEpWwCisj0dX1yhcJYywv/nJuBKiXSdNEJ3kjnEdFOT02T/6ysbA6Mi9lhGeWvnucCkQXdGwa6f+NIM8SA==\u0026castor.txtype=tape, isRW=1, open_mode=4200, file_ptr=0x7f95740033d0\r\n180316 17:18:59 165099 stage.27009:30@tpsrv210 castor2ofs_open: 4200-40600 [log in to unmask]\r\n180316 17:19:00 165099 stage.27009:30@tpsrv210 castor2ofs_open: attach use=2 [log in to unmask]\r\n180316 17:19:00 165099 stage.27009:30@tpsrv210 castor2ofs_close: use=2 [log in to unmask]\r\n180316 17:19:00 time=1521217140.187768 func=close level=ERROR logid=bc0b1b0c-2935-11e8-a8c6-a4bf0112f566 [log in to unmask]:1094 tid=140280503043840 source=XrdxCastor2Ofs:974 tident=stage.27009:30@tpsrv210 [log in to unmask] msg=\"closed via destructor\"\r\n180316 17:19:00 165099 castor2ofs_close: stage.27009:30@tpsrv210 Unable to close file - delete via destructor ; connection reset by peer\r\n180316 17:19:00 time=1521217140.187819 func=close level=INFO logid=bc0b1b0c-2935-11e8-a8c6-a4bf0112f566 [log in to unmask]:1094 tid=140280503043840 source=XrdxCastor2Ofs:1007 tident=stage.27009:30@tpsrv210 msg=\"send to diskmanager\" errc=104 errmsg=\"Unable to close file - delete via destructor ; connection reset by peer\"\r\n180316 17:19:00 165099 stage.27009:30@tpsrv210 castor2ofs_close: use=0 fn=dummy\r\n180316 17:19:00 165099 XrootdXeq: Output file [log in to unmask] is already opened by 1 writer; open denied.\r\n180316 17:19:00 165099 stage.27009:30@tpsrv210 XrootdResponse: sending err 3003: Output file [log in to unmask] is already opened by 1 writer; open denied.\r\n\r\nSurprisingly, or not, we can see that XRootD detected that this is an open of an already opened (notice the use=2) for write file and says open denied. But this happens after the CASTOR plugin is called therefore is useless since the damage is done. I suppose this is the thing that Andy fixed - that the check happens before the plugin call.\r\n```"}],"action":{"name":"View Issue","url":"https://github.com/xrootd/xrootd/issues/673#issuecomment-379284551"}}}

Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1