Print

Print


Hi Elvin,

When I traced through the code, the wait time inthe waitresp is actually
set in the message handler early on and that value controls the wait (at
least that is what I saw). So, having it ignored should be just fine in
this context. The SFS_WAIT with 0 seconds is simply our way of saying
retry the operation. You should apply my latest fix (f8ec5c61d) as the TPC
code was indeed sending the message back without a value (well, it was
actually sending an SFS_OK which has no wait value). I caught that and it
should have made the client complain about an invalid message. Anyway,
with those fixes my stuff works here.

Andy

On Mon, 5 Oct 2015, Elvin Sindrilaru wrote:

>
> Hi Andy,
>
> First of all, thanks for the suggestion related to the SFS_STARTED return code for the `XrdOfsFile::open` function. I was able to reliably reproduce this "deadlock" in Castor by delaying the initiator of the TPC transfer, therefore having the destination connect to the source but without a `tpc.key` registered for the transfer. At this point the OFS layer replies with SFS_STARTED but I was converting it to SFS_ERROR in the Castor plugin. Now, this is well understood and fixed.
>
> As a side node: the CastorOfs plugin actually inherits from the default OFS plugin, therefore the `error` object is the same and I don't need to call it explicitly. But nevertheless, thanks for the detailed explanation.
>
> Now, after fixing the problems in the Castor plugin the server would not deadlock anymore but the slow TPC transfers would still not succeed - here by slow I mean a transfer in which the destination would first receive a SFS_STARTED response. When such a response is given there a several things happening:
>
> - there is kXR_waitresp message sent to the client which is completely ignored.
> https://github.com/xrootd/xrootd/blob/master/src/XrdCl/XrdClStream.cc#L449
> The time value specified by this response is not taken into consideration at all since the message is discarded. I believe the follwoing line has been modified recently by adding the `Ignore` flag:
> https://github.com/xrootd/xrootd/blob/master/src/XrdCl/XrdClXRootDMsgHandler.cc#L150
> Event without the `Ignore` flag the timeout value would still be discarded but it would at least print some info about it here:
> https://github.com/xrootd/xrootd/blob/master/src/XrdCl/XrdClXRootDMsgHandler.cc#L563
> At the moment such a `kXR_waitresp` message appears in the logs of the client as:
> ```bash
> [2015-10-06 00:38:03.422196 +0200][Dump ][XRootDTransport ] [msg: 0xf8000b70] Expecting 4 bytes of message body
> [2015-10-06 00:38:03.422228 +0200][Dump ][AsyncSock ] [lxc2dev6d1.cern.ch:1095 #0.0] Received message header for 0xf8000b70 size: 8
> [2015-10-06 00:38:03.422258 +0200][Dump ][AsyncSock ] [lxc2dev6d1.cern.ch:1095 #0.0] Received message 0xf8000b70 of 12 bytes
> [2015-10-06 00:38:03.422285 +0200][Dump ][PostMaster ] [lxc2dev6d1.cern.ch:1095 #0] Handling received message: 0xf8000b70.
> [2015-10-06 00:38:03.422310 +0200][Dump ][PostMaster ] [lxc2dev6d1.cern.ch:1095 #0] Ignoring the processing handler for: 0x60dd18.
> [20
> ```
>
> - before the SFS_STARTED response was replied, there was a callback registered for the initial open operation which is triggered from here: https://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdXeq.cc#L2854
> When the callback is executed it performs specific actions depending on the initial operation. In this case i.e. for open it will send a `kXR_wait` response to the client with a value of 0. See: https://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdCallBack.cc#L146
>
> - this `kXR_wait` response will actually be wrapped in a `kXR_attn` response for `kXR_asynresp`. The format of the message is here: http://www.xrootd.org/doc/dev42/XRdv300.htm#_Toc414892715
>
> - on the client side the parsing of this message is done in here: https://github.com/xrootd/xrootd/blob/master/src/XrdCl/XrdClXRootDMsgHandler.cc#L258
> What happens here is that the fields specific to the `kXR_attn` response are stripped and we are only left with the so called `embeddedMsg`.
>
> - in our case this message is actually a `kXR_wait` response. Therefore, when processing the response we call the unmarshaling function corresponding to this type of response: https://github.com/xrootd/xrootd/blob/master/src/XrdCl/XrdClXRootDTransport.cc#L881
> Here we expect that the message has the format for the `kXR_wait` response i.e.: http://www.xrootd.org/doc/dev42/XRdv300.htm#_Toc414892723
>
> - but if we look at our embedded message we see that it has the following structure which represents the last 4 entries from `kXR_attn` response:
> kXR_char streamid[2]
> kXR_unt16 status
> kXR_int32 dlen
> kXR_char data[dlen]
>
> - so we are missing the seconds value which normally appears in the `kXR_wait` response. Therefore, the unmarshaling of this response fails and in the client logs one can see the following message (scroll all the way to the right for the interesting part):
>
> ```bash
> [2015-10-06 00:38:07.412154 +0200][Dump ][PostMaster ] [lxc2dev6d1.cern.ch:1095 #0] Queuing received message: 0xf8000b70.
> [2015-10-06 00:38:07.412250 +0200][Dump ][XRootD ] [lxc2dev6d1.cern.ch:1095] Got an async response to message kXR_open (file: /castor/cern.ch/dev/e/esindril/dir_default/test2G_1.dat?tpc.key=00060120712f64035612fbcb&[log in to unmask], mode: 00, flags: kXR_open_read kXR_async kXR_retstat ), processing it
> [2015-10-06 00:38:07.412363 +0200][Dump ][XRootD ] [lxc2dev6d1.cern.ch:1095] Invalid msg while unmarshalling body, resp->hdr.status=4005
> [2015-10-06 00:38:07.412466 +0200][Debug ][File ] [0x10dde80@xroot://lxc2dev6d1.cern.ch:1095//castor/cern.ch/dev/e/esindril/dir_default/test2G_1.dat?tpc.key=00060120712f64035612fbcb&[log in to unmask]] Open has returned with status [FATAL] Invalid message
> [2015-10-06 00:38:07.412497 +0200][Debug ][File ] [0x10dde80@xroot://lxc2dev6d1.cern.ch:1095//castor/cern.ch/dev/e/esindril/dir_default/test2G_1.dat?tpc.key=00060120712f64035612fbcb&[log in to unmask]] Error while opening at lxc2dev6d1.cern.ch:1095: [FATAL] Invalid message
> ```
> So, in this case the transfer will fail with a `[FATAL] Invalid message. What is the expected behaviour for this? Should the unmarshaling be adapted for `kXR_attn` messages?
>
> Thanks,
> Elvin
>
>
> ---
> Reply to this email directly or view it on GitHub:
> https://github.com/xrootd/xrootd/issues/290#issuecomment-145699862


Reply to this email directly or view it on GitHub.



Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1