Hi Andy,

First of all, thanks for the suggestion related to the SFS_STARTED return code for the XrdOfsFile::open function. I was able to reliably reproduce this "deadlock" in Castor by delaying the initiator of the TPC transfer, therefore having the destination connect to the source but without a tpc.key registered for the transfer. At this point the OFS layer replies with SFS_STARTED but I was converting it to SFS_ERROR in the Castor plugin. Now, this is well understood and fixed.

As a side node: the CastorOfs plugin actually inherits from the default OFS plugin, therefore the error object is the same and I don't need to call it explicitly. But nevertheless, thanks for the detailed explanation.

Now, after fixing the problems in the Castor plugin the server would not deadlock anymore but the slow TPC transfers would still not succeed - here by slow I mean a transfer in which the destination would first receive a SFS_STARTED response. When such a response is given there a several things happening:

[2015-10-06 00:38:03.422196 +0200][Dump   ][XRootDTransport   ] [msg: 0xf8000b70] Expecting 4 bytes of message body
[2015-10-06 00:38:03.422228 +0200][Dump   ][AsyncSock         ] [lxc2dev6d1.cern.ch:1095 #0.0] Received message header for 0xf8000b70 size: 8
[2015-10-06 00:38:03.422258 +0200][Dump   ][AsyncSock         ] [lxc2dev6d1.cern.ch:1095 #0.0] Received message 0xf8000b70 of 12 bytes
[2015-10-06 00:38:03.422285 +0200][Dump   ][PostMaster        ] [lxc2dev6d1.cern.ch:1095 #0] Handling received message: 0xf8000b70.
[2015-10-06 00:38:03.422310 +0200][Dump   ][PostMaster        ] [lxc2dev6d1.cern.ch:1095 #0] Ignoring the processing handler for: 0x60dd18.
[20
[2015-10-06 00:38:07.412154 +0200][Dump   ][PostMaster        ] [lxc2dev6d1.cern.ch:1095 #0] Queuing received message: 0xf8000b70.
[2015-10-06 00:38:07.412250 +0200][Dump   ][XRootD            ] [lxc2dev6d1.cern.ch:1095] Got an async response to message kXR_open (file: /castor/cern.ch/dev/e/esindril/dir_default/test2G_1.dat?tpc.key=00060120712f64035612fbcb&[log in to unmask], mode: 00, flags: kXR_open_read kXR_async kXR_retstat ), processing it
[2015-10-06 00:38:07.412363 +0200][Dump   ][XRootD            ] [lxc2dev6d1.cern.ch:1095] Invalid msg while unmarshalling body, resp->hdr.status=4005
[2015-10-06 00:38:07.412466 +0200][Debug  ][File              ] [0x10dde80@xroot://lxc2dev6d1.cern.ch:1095//castor/cern.ch/dev/e/esindril/dir_default/test2G_1.dat?tpc.key=00060120712f64035612fbcb&[log in to unmask]] Open has returned with status [FATAL] Invalid message
[2015-10-06 00:38:07.412497 +0200][Debug  ][File              ] [0x10dde80@xroot://lxc2dev6d1.cern.ch:1095//castor/cern.ch/dev/e/esindril/dir_default/test2G_1.dat?tpc.key=00060120712f64035612fbcb&[log in to unmask]] Error while opening at lxc2dev6d1.cern.ch:1095: [FATAL] Invalid message

So, in this case the transfer will fail with a [FATAL] Invalid message. What is the expected behaviour for this? Should the unmarshaling be adapted forkXR_attn` messages?

Thanks,
Elvin


Reply to this email directly or view it on GitHub.



Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1