Hi Andy,
First of all, thanks for the suggestion related to the SFS_STARTED return code for the XrdOfsFile::open
function. I was able to reliably reproduce this "deadlock" in Castor by delaying the initiator of the TPC transfer, therefore having the destination connect to the source but without a tpc.key
registered for the transfer. At this point the OFS layer replies with SFS_STARTED but I was converting it to SFS_ERROR in the Castor plugin. Now, this is well understood and fixed.
As a side node: the CastorOfs plugin actually inherits from the default OFS plugin, therefore the error
object is the same and I don't need to call it explicitly. But nevertheless, thanks for the detailed explanation.
Now, after fixing the problems in the Castor plugin the server would not deadlock anymore but the slow TPC transfers would still not succeed - here by slow I mean a transfer in which the destination would first receive a SFS_STARTED response. When such a response is given there a several things happening:
Ignore
flag:
https://github.com/xrootd/xrootd/blob/master/src/XrdCl/XrdClXRootDMsgHandler.cc#L150
Event without the Ignore
flag the timeout value would still be discarded but it would at least print some info about it here:
https://github.com/xrootd/xrootd/blob/master/src/XrdCl/XrdClXRootDMsgHandler.cc#L563
At the moment such a kXR_waitresp
message appears in the logs of the client as:[2015-10-06 00:38:03.422196 +0200][Dump ][XRootDTransport ] [msg: 0xf8000b70] Expecting 4 bytes of message body
[2015-10-06 00:38:03.422228 +0200][Dump ][AsyncSock ] [lxc2dev6d1.cern.ch:1095 #0.0] Received message header for 0xf8000b70 size: 8
[2015-10-06 00:38:03.422258 +0200][Dump ][AsyncSock ] [lxc2dev6d1.cern.ch:1095 #0.0] Received message 0xf8000b70 of 12 bytes
[2015-10-06 00:38:03.422285 +0200][Dump ][PostMaster ] [lxc2dev6d1.cern.ch:1095 #0] Handling received message: 0xf8000b70.
[2015-10-06 00:38:03.422310 +0200][Dump ][PostMaster ] [lxc2dev6d1.cern.ch:1095 #0] Ignoring the processing handler for: 0x60dd18.
[20
before the SFS_STARTED response was replied, there was a callback registered for the initial open operation which is triggered from here: https://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdXeq.cc#L2854
When the callback is executed it performs specific actions depending on the initial operation. In this case i.e. for open it will send a kXR_wait
response to the client with a value of 0. See: https://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdCallBack.cc#L146
this kXR_wait
response will actually be wrapped in a kXR_attn
response for kXR_asynresp
. The format of the message is here: http://www.xrootd.org/doc/dev42/XRdv300.htm#_Toc414892715
on the client side the parsing of this message is done in here: https://github.com/xrootd/xrootd/blob/master/src/XrdCl/XrdClXRootDMsgHandler.cc#L258
What happens here is that the fields specific to the kXR_attn
response are stripped and we are only left with the so called embeddedMsg
.
in our case this message is actually a kXR_wait
response. Therefore, when processing the response we call the unmarshaling function corresponding to this type of response: https://github.com/xrootd/xrootd/blob/master/src/XrdCl/XrdClXRootDTransport.cc#L881
Here we expect that the message has the format for the kXR_wait
response i.e.: http://www.xrootd.org/doc/dev42/XRdv300.htm#_Toc414892723
but if we look at our embedded message we see that it has the following structure which represents the last 4 entries from kXR_attn
response:
kXR_char streamid[2]
kXR_unt16 status
kXR_int32 dlen
kXR_char data[dlen]
so we are missing the seconds value which normally appears in the kXR_wait
response. Therefore, the unmarshaling of this response fails and in the client logs one can see the following message (scroll all the way to the right for the interesting part):
[2015-10-06 00:38:07.412154 +0200][Dump ][PostMaster ] [lxc2dev6d1.cern.ch:1095 #0] Queuing received message: 0xf8000b70.
[2015-10-06 00:38:07.412250 +0200][Dump ][XRootD ] [lxc2dev6d1.cern.ch:1095] Got an async response to message kXR_open (file: /castor/cern.ch/dev/e/esindril/dir_default/test2G_1.dat?tpc.key=00060120712f64035612fbcb&[log in to unmask], mode: 00, flags: kXR_open_read kXR_async kXR_retstat ), processing it
[2015-10-06 00:38:07.412363 +0200][Dump ][XRootD ] [lxc2dev6d1.cern.ch:1095] Invalid msg while unmarshalling body, resp->hdr.status=4005
[2015-10-06 00:38:07.412466 +0200][Debug ][File ] [0x10dde80@xroot://lxc2dev6d1.cern.ch:1095//castor/cern.ch/dev/e/esindril/dir_default/test2G_1.dat?tpc.key=00060120712f64035612fbcb&[log in to unmask]] Open has returned with status [FATAL] Invalid message
[2015-10-06 00:38:07.412497 +0200][Debug ][File ] [0x10dde80@xroot://lxc2dev6d1.cern.ch:1095//castor/cern.ch/dev/e/esindril/dir_default/test2G_1.dat?tpc.key=00060120712f64035612fbcb&[log in to unmask]] Error while opening at lxc2dev6d1.cern.ch:1095: [FATAL] Invalid message
So, in this case the transfer will fail with a [FATAL] Invalid message. What is the expected behaviour for this? Should the unmarshaling be adapted for
kXR_attn` messages?
Thanks,
Elvin
—
Reply to this email directly or view it on GitHub.
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1