Print

Print


Hi. I'm adding some information outline some scenarios where this patch should avoid a problem.

First example uses the reproducer described for issue #1808, and I've added some extra logging to see how the current method ( XRootDTransport::GetBody ) is working. Here is the extra logging added to GetBody, i.e.

 XRootDStatus XRootDTransport::GetBody( Message &message, Socket *socket )
  {
    //--------------------------------------------------------------------------
    // Retrieve the body
    //--------------------------------------------------------------------------
    size_t   leftToBeRead = 0;
    uint32_t bodySize = 0;
    ServerResponseHeader* rsphdr = (ServerResponseHeader*)message.GetBuffer();
    if( rsphdr->status != kXR_status ) {
      bodySize = rsphdr->dlen;
      std::cerr << "Entered GetBody, non kXR_status call, set bodySize=" << bodySize << std::endl;
    } else
    {
      size_t stlen = sizeof( ServerResponseStatus ); // we read everything up to the offset
      std::cerr << "Entered GetBody, with a kXR_status comparing message.GetCursor()=" << message.GetCursor() << " with stlen=" << stlen << std::endl;
      if( message.GetCursor() < stlen ) {
        bodySize = rsphdr->dlen;
        std::cerr << "After comparing set bodySize to header dlen=" << bodySize << std::endl;
      } else
      {
        ServerResponseStatus *rspst = (ServerResponseStatus*)message.GetBuffer();
        bodySize = rspst->hdr.dlen + rspst->bdy.dlen;
        std::cerr << "After comparing set bodySize to header + body dlen=" << bodySize << std::endl;
      }

    }

    if( message.GetSize() < bodySize + 8 )
      message.ReAllocate( bodySize + 8 );

    leftToBeRead = bodySize-(message.GetCursor()-8);

    std::cerr << "In GetBody entering read loop with leftToBeRead initially " << leftToBeRead << std::endl;
    while( leftToBeRead )
    {
      int bytesRead = 0;
      XRootDStatus status = socket->Read( message.GetBufferAtCursor(), leftToBeRead, bytesRead );

      if( !status.IsOK() || status.code == suRetry ) {
        std::cerr << "Leaving GetBody " << (status.IsOK() ? "(suRery)" : "(!status.IsOK)") << " leftToBeRead=" << leftToBeRead << std::endl;
        return status;
      }

      leftToBeRead -= bytesRead;
      message.AdvanceCursor( bytesRead );
    }

    std::cerr << "Leaving GetBody suDone" << std::endl;
    return XRootDStatus( stOK, suDone );
  }

making the transfer, as in issue #1808

$ export XRD_LOGLEVEL=Dump
$ LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/data/dhsmith/xrootd4/xrootd-pybindings/bbuild/src/XrdCl  gfal-copy
--just-copy -f -t 60 root://xcache1.farm.particle.cz//root://eosatlas.cern.ch//eos/atlas/atlasscratchdisk/SAM/1G file:///tmp/1G.xrdcopy
[...]
[2022-11-24 11:30:21.044938 +0100][Dump   ][XRootDTransport   ] [msg: 0x9800dc50] Expecting 24 bytes of message body
[2022-11-24 11:30:21.044953 +0100][Dump   ][AsyncSock         ] [xcache1.farm.particle.cz:1094.0] Received message header for 0x9800dc50 size: 8
[2022-11-24 11:30:21.044966 +0100][Dump   ][XRootD            ] [xcache1.farm.particle.cz:1094] Got a kXR_status response to request kXR_pgread (handle: 0x00000000, offset: 352321536, size: 8388608)
[2022-11-24 11:30:21.044980 +0100][Debug  ][ExDbgMsg          ] [msg: 0x9800dc50] Assigned MsgHandler: 0xa400b1e0.
Entered GetBody, with a kXR_status comparing message.GetCursor()=8 with stlen=24
After comparing set bodySize to header dlen=24
In GetBody entering read loop with leftToBeRead initially 24
Leaving GetBody suDone
[2022-11-24 11:30:21.045084 +0100][Dump   ][AsyncSock         ] [xcache1.farm.particle.cz:1094.0] Received message 0x9800dc50 of 65632 bytes
[2022-11-24 11:30:21.045100 +0100][Dump   ][PostMaster        ] [xcache1.farm.particle.cz:1094] Handling received message: 0x9800dc50.
[2022-11-24 11:30:21.045110 +0100][Dump   ][PostMaster        ] [xcache1.farm.particle.cz:1094] Ignoring the processing handler for: 0xb7e693f8.
[2022-11-24 11:30:21.045132 +0100][Dump   ][XRootDTransport   ] [msg: 0x9800dd80] Expecting 24 bytes of message body
[2022-11-24 11:30:21.045147 +0100][Dump   ][AsyncSock         ] [xcache1.farm.particle.cz:1094.0] Received message header for 0x9800dd80 size: 8
[2022-11-24 11:30:21.045157 +0100][Dump   ][XRootD            ] [xcache1.farm.particle.cz:1094] Got a kXR_status response to request kXR_pgread (handle: 0x00000000, offset: 352321536, size: 8388608)
[2022-11-24 11:30:21.045169 +0100][Debug  ][ExDbgMsg          ] [msg: 0x9800dd80] Assigned MsgHandler: 0xa400b1e0.
Entered GetBody, with a kXR_status comparing message.GetCursor()=8 with stlen=24
After comparing set bodySize to header dlen=24
In GetBody entering read loop with leftToBeRead initially 24
Leaving GetBody suDone
[2022-11-24 11:30:21.045268 +0100][Dump   ][AsyncSock         ] [xcache1.farm.particle.cz:1094.0] Received message 0x9800dd80 of 65632 bytes
[2022-11-24 11:30:21.045283 +0100][Dump   ][PostMaster        ] [xcache1.farm.particle.cz:1094] Handling received message: 0x9800dd80.
[2022-11-24 11:30:21.045294 +0100][Dump   ][PostMaster        ] [xcache1.farm.particle.cz:1094] Ignoring the processing handler for: 0xb7e693f8.
[2022-11-24 11:30:21.045310 +0100][Dump   ][XRootDTransport   ] [msg: 0x9800de00] Expecting 24 bytes of message body
[2022-11-24 11:30:21.045321 +0100][Dump   ][AsyncSock         ] [xcache1.farm.particle.cz:1094.0] Received message header for 0x9800de00 size: 8
[2022-11-24 11:30:21.045331 +0100][Dump   ][XRootD            ] [xcache1.farm.particle.cz:1094] Got a kXR_status response to request kXR_pgread (handle: 0x00000000, offset: 352321536, size: 8388608)
[2022-11-24 11:30:21.045342 +0100][Debug  ][ExDbgMsg          ] [msg: 0x9800de00] Assigned MsgHandler: 0xa400b1e0.
Entered GetBody, with a kXR_status comparing message.GetCursor()=8 with stlen=24
After comparing set bodySize to header dlen=24
In GetBody entering read loop with leftToBeRead initially 24
Leaving GetBody (suRery) leftToBeRead=4
Entered GetBody, with a kXR_status comparing message.GetCursor()=28 with stlen=24
After comparing set bodySize to header + body dlen=1073742104
In GetBody entering read loop with leftToBeRead initially 1073742084
Leaving GetBody (suRery) leftToBeRead=1073485044
Entered GetBody, with a kXR_status comparing message.GetCursor()=257068 with stlen=24
After comparing set bodySize to header + body dlen=1073742104
In GetBody entering read loop with leftToBeRead initially 1073485044
Leaving GetBody (suRery) leftToBeRead=1073475048
Entered GetBody, with a kXR_status comparing message.GetCursor()=267064 with stlen=24
After comparing set bodySize to header + body dlen=1073742104
In GetBody entering read loop with leftToBeRead initially 1073475048
Leaving GetBody (suRery) leftToBeRead=1073356524
[...]

During an initial part of the transfer there are a series of calls to GetBody, one for each kxr_pgread (partial or full repose): it is understood that for each a call to GetHeader has already read 8 bytes from the stream and then GetBody reads 24 bytes further and returns complete (suDone). Eventually during one GetBody call, only 20 bytes are available to read and GetBody returns suRetry. On retry GetBody establishes a target of 1073742084 bytes to read, after which are many calls each returning suRetry, slowly reading the incoming stream towards the target count. At this point the transfer is stalled.

It is understood that the expected behaviour for kXR_status (like kxr_pgread) responses is that GetBody should read the up data[] and then return suDone. It was understood that GetBody also has the functionality to read the data[] portion of the message with another set of call(s) after the first suDone. This latter functionality is not usually used for kxr_pgread responses as the data portion is usually read by the AsyncPageReader.

For a kXR_pgread response the bytes before the data[] are a ServerResponseStatus+ServerResponseBody_pgRead (i.e. 32 bytes). This is what is read when all 8 (GetHeader) +24 (GetBody) = 32 bytes are available and there's 1 call of GetBody, at the end of which it returns suDone.

However if, e.g. 20 bytes are read during the first call of GetBody the method needs to be called again to read the reaming 4. However enough bytes have been read to have a complete ServerResponseStatus structure. The current check of "message.GetCursor() < stlen" is false, and the method goes ahead and includes the body length (i.e. ServerResponseBody_Status .dlen). At this point that quantity is still in network byte order. The bytes are 40 00 01 00 intending to indicate a length of 65600 (i.e. 16 pages of data+crc), but in reverse order this length is 1073742080, leading to the large size for leftToBeRead.

The patch attempts to ensure that the first suDone (possibly with suReady before) happens only when exactly the response header, i.e. 8 + (ServerResponseHeader.dlen), bytes have been read.

In a separate issue (i.e. not an issue on github; an Alice issue with a client side slow transfer, reported via eos channels and highlighted on a discussion channel at cern) there was also a problem which seems consistent with this; this time client 5.4.3 was involved. There was this logged:

[2022-11-18 09:58:02.241001 +0100][Dump   ][XRootDTransport   ] [msg: 0x70002a10] Expecting 24 bytes of message body
[2022-11-18 09:58:02.241007 +0100][Dump   ][AsyncSock         ] [st-096-hh1510b1.cern.ch:1095.0] Received message header for 0x70002a10 size: 8
[2022-11-18 09:58:02.241012 +0100][Dump   ][XRootD            ] [st-096-hh1510b1.cern.ch:1095] Got a kXR_status response to request kXR_pgread (handle: 0x00000000, offset: 83886080, size: 8388608)
[2022-11-18 09:58:02.241017 +0100][Debug  ][ExDbgMsg          ] [msg: 0x70002a10] Assigned MsgHandler: 0x1e893a0.
[2022-11-18 09:58:02.241029 +0100][Debug  ][ExDbgMsg          ] [handler: 0x1e893a0] Removed MsgHandler: 0x1e893a0 from the in-queue.
[2022-11-18 09:58:02.242113 +0100][Dump   ][File              ] [0x1e86770@file://localhost/var/tmp/dump?oss.asize=1086088693&xrdcl.requuid=4655718c-1150-4931-a35a-32b1d890b16a] Got state response for message kXR_write (handle: 0x0e000000, offset: 75497472, size: 8388608)
[2022-11-18 09:58:02.242941 +0100][Dump   ][AsyncSock         ] [st-096-hh1510b1.cern.ch:1095.0] Received message 0x70002a10 of 2099232 bytes
[2022-11-18 09:58:02.242959 +0100][Dump   ][PostMaster        ] [st-096-hh1510b1.cern.ch:1095] Handling received message: 0x70002a10.
[2022-11-18 09:58:02.242970 +0100][Dump   ][XRootDTransport   ] [msg: 0x70002a60] Expecting 24 bytes of message body
[2022-11-18 09:58:02.242975 +0100][Dump   ][AsyncSock         ] [st-096-hh1510b1.cern.ch:1095.0] Received message header for 0x70002a60 size: 8
[2022-11-18 09:58:02.242980 +0100][Dump   ][XRootD            ] [st-096-hh1510b1.cern.ch:1095] Got a kXR_status response to request kXR_pgread (handle: 0x00000000, offset: 92274688, size: 8388608)
[2022-11-18 09:58:02.242980 +0100][Dump   ][XRootD            ] [st-096-hh1510b1.cern.ch:1095] Got a kXR_status response to request kXR_pgread (handle: 0x00000000, offset: 83886080, size: 8388608)
[2022-11-18 09:58:02.242987 +0100][Debug  ][ExDbgMsg          ] [msg: 0x70002a60] Assigned MsgHandler: 0x1e8a7d0.
[2022-11-18 09:58:02.242994 +0100][Debug  ][ExDbgMsg          ] [st-096-hh1510b1.cern.ch:1095] Calling MsgHandler: 0x1e893a0 (message: kXR_pgread (handle: 0x00000000, offset: 83886080, size: 8388608) ) with status: [SUCCESS] .
[2022-11-18 09:58:02.243004 +0100][Dump   ][XRootD            ] [st-096-hh1510b1.cern.ch:1095] Parsing the response to kXR_pgread (handle: 0x00000000, offset: 83886080, size: 8388608) as PageInfo
[2022-11-18 09:58:02.243010 +0100][Dump   ][File              ] [0x1e816b0@root://localhost:1094//eos/alice/grid/02/54455/db3a3be1-6045-11ed-8086-0242b76f9cbf?eos.ruid=0&xrdcl.requuid=f1738628-3ea8-4333-82e9-1602356be81a] Got state response for message kXR_pgread (handle: 0x00000000, offset: 83886080, size: 8388608)
[2022-11-18 09:58:02.244544 +0100][Debug  ][ExDbgMsg          ] [st-096-hh1510b1.cern.ch:1095] Destroying MsgHandler: 0x1e893a0.
[2022-11-18 09:58:02.245610 +0100][Debug  ][File              ] [0x1e86770@file://localhost/var/tmp/dump?oss.asize=1086088693&xrdcl.requuid=4655718c-1150-4931-a35a-32b1d890b16a] Sending a write command for handle 0xe to localhost
[2022-11-18 09:58:02.245686 +0100][Debug  ][File              ] [0x1e816b0@root://localhost:1094//eos/alice/grid/02/54455/db3a3be1-6045-11ed-8086-0242b76f9cbf?eos.ruid=0&xrdcl.requuid=f1738628-3ea8-4333-82e9-1602356be81a] Sending a pgread command for handle 0x0 to st-096-hh1510b1.cern.ch:1095
[2022-11-18 09:58:02.245698 +0100][Dump   ][XRootD            ] [st-096-hh1510b1.cern.ch:1095] Sending message kXR_pgread (handle: 0x00000000, offset: 117440512, size: 8388608)
[2022-11-18 09:58:02.245706 +0100][Debug  ][ExDbgMsg          ] [st-096-hh1510b1.cern.ch:1095] MsgHandler created: 0x1e893a0 (message: kXR_pgread (handle: 0x00000000, offset: 117440512, size: 8388608) ).
[2022-11-18 09:58:02.245718 +0100][Dump   ][PostMaster        ] [st-096-hh1510b1.cern.ch:1095] Sending message kXR_pgread (handle: 0x00000000, offset: 117440512, size: 8388608) (0x1e884e0) through substream 0 expecting answer at 0
[2022-11-18 09:58:02.246397 +0100][Dump   ][AsyncSock         ] [st-096-hh1510b1.cern.ch:1095.0] Wrote a message: kXR_pgread (handle: 0x00000000, offset: 117440512, size: 8388608) (0x1e884e0), 26 bytes
[2022-11-18 09:58:02.246422 +0100][Dump   ][AsyncSock         ] [st-096-hh1510b1.cern.ch:1095.0] Successfully sent message: kXR_pgread (handle: 0x00000000, offset: 117440512, size: 8388608) (0x1e884e0).
[2022-11-18 09:58:02.246433 +0100][Dump   ][XRootD            ] [st-096-hh1510b1.cern.ch:1095] Message kXR_pgread (handle: 0x00000000, offset: 117440512, size: 8388608) has been successfully sent.
[2022-11-18 09:58:02.246441 +0100][Debug  ][ExDbgMsg          ] [st-096-hh1510b1.cern.ch:1095] Moving MsgHandler: 0x1e893a0 (message: kXR_pgread (handle: 0x00000000, offset: 117440512, size: 8388608) ) from out-queu to in-queue.
[2022-11-18 09:58:02.246451 +0100][Dump   ][PostMaster        ] [st-096-hh1510b1.cern.ch:1095.0] All messages consumed, disable uplink
[2022-11-18 09:58:02.246516 +0100][Dump   ][AsyncSock         ] [st-096-hh1510b1.cern.ch:1095.0] Received message 0x70002a60 of 2631712 bytes
[2022-11-18 09:58:02.246529 +0100][Dump   ][PostMaster        ] [st-096-hh1510b1.cern.ch:1095] Handling received message: 0x70002a60.
[2022-11-18 09:58:02.246535 +0100][Dump   ][PostMaster        ] [st-096-hh1510b1.cern.ch:1095] Ignoring the processing handler for: 0x618438.
[2022-11-18 09:58:02.246548 +0100][Dump   ][XRootDTransport   ] [msg: 0x700008d0] Expecting -1100653922 bytes of message body

i.e. generally response messages of 2099232 bytes were being received, until one of 2631712 after which there's a notice of an unlikely expected number of bytes '-1100653922' and then the transfer does not proceed for 60 seconds. The interpretation I saw was that the actual data size in the kxr_pgread response (ServerResponseBody_Status.dlen) was likely 00 20 08 00 (2099200 = 512 pages+crcs). As in the previously described scenario, 32 + 0x82000 = 532512 bytes are read by GetHeader+GetBody. Then the header is unmarshalled, and AsyncReader will read the intended 0x200800 pages+crcs; making a total of 2631712 bytes, as logged. (This covers all of the first response and part of the next, which is possible as there are several requests outstanding on the stream). The -1100653922 is then the attempt to read a next repose and will probably correspond to some non-length part of the stream, possibly page or crc data. [The above is an attempt to establish if a fix like the one in this PR should also help with this problem - if there's any uncertainty of this, this second issue would deserve its own tracker entry to investigate].


Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.Message ID: <xrootd/xrootd/pull/1829/c1326434429@github.com>

[ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/xrootd/xrootd/pull/1829#issuecomment-1326434429", "url": "https://github.com/xrootd/xrootd/pull/1829#issuecomment-1326434429", "name": "View Pull Request" }, "description": "View this Pull Request on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1