Print

Print


Hi Tom,

I guess I am a bit confused by...

"Again, this may be somewhat naive, but from my digging today it looks
like we might be able to make things a lot better if we could stop XCache
allowing vector reads to be passed through to the origin. I will
experiment with increasing the amount of ram it can use (pfc.ram) and
reducing the block sizes so larger vector reads can be unwrapped, but I
would be interested in understanding if we could explicitly disallow this
behaviour."

The reason is that the reads are being forwarded to the origin is because
XCache doesn't have enough resources (memory in this case) to handle it.
If the reads were not forwarded then the only two possible actions would
be a) wait until enough memory becomes available (probably ending in an
operation expired anyway), or b) simply declaring an error (resulting in a
failed job anyway).

It would seem to be better to make the Ceph layer coalesce vector reads
and exctract out as much out of the page to satisfy the incoming vread to
avoid rereading pages. That said, it doesn't necessarily avoid a failure.
If we could capture the vector reads (which we can via monitoring) then we
could make an educated guess. For instance, if 80% of the small reads can
usually be satisfied from one 64MB page, then the coalescing solution
would likely work. If it's much less than that, likely not.

Andy


On Mon, 10 Aug 2020, Tom Byrne wrote:

> Hi all,
>
> Tracing one 'operation expired' failure with tracing turned up, I got the following.
>
> Client:
>
> ```
> [2020-08-10 07:25:02.640039 +0000][Dump][XRootD] [xrootd.echo.stfc.ac.uk:1094] Message kXR_readv (handle: 0x00000000, chunks: 837, total size: 9403078) has been successfully sent.
> ...
> [2020-08-10 07:30:58.618908 +0000][Debug][ExDbgMsg] [xrootd.echo.stfc.ac.uk:1094] Calling MsgHandler: 0x5c001300 (message: kXR_readv (handle: 0x00000000, chunks: 837, total size: 9403078) ) with status: [ERROR] Operation expired.
> ```
>
> XCache:
>
> ```
> 200810 07:25:02 7314 tlhcb008.220:[log in to unmask] XrootdProtocol: 0100 open r lhcb:prod/lhcb/LHCb/Collision15/TURBO.MDST/00051290/0002/00051290_00025876_1.turbo.mdst
> 200810 07:25:02 7314 acc_Audit: tlhcb008.220:[log in to unmask] grant gsi [log in to unmask] read lhcb:prod/lhcb/LHCb/Collision15/TURBO.MDST/00051290/0002/000
> 200810 07:25:02 7314 tlhcb008.220:[log in to unmask] XrootdResponse: 0100 sending 4 data bytes
> 200810 07:25:02 7314 tlhcb008.220:[log in to unmask] XrootdProtocol: 0200 req=readv dlen=13392
> 200810 07:25:02 7314 tlhcb008.220:[log in to unmask] XrootdProtocol: 0200 fh=0 readV 674@71780326
> 200810 07:25:02 7314 tlhcb008.220:[log in to unmask] XrootdProtocol: 0200 fh=0 readV 356@71782997
> 200810 07:25:02 7314 tlhcb008.220:[log in to unmask] XrootdProtocol: 0200 fh=0 readV 813@71783601
> ==== ~800 readVs snipped ====
> 200810 07:25:02 7314 tlhcb008.220:[log in to unmask] XrootdProtocol: 0200 fh=0 readV 29117@105144524
> 200810 07:25:02 7314 tlhcb008.220:[log in to unmask] XrootdProtocol: 0200 fh=0 readV 2104@105366736
> 200810 07:25:02 7314 tlhcb008.220:[log in to unmask] XrootdProtocol: 0200 fh=0 readV 14813@105372804
>
> ```
>
> xrd-ceph:
>
> ```
> 200810 07:25:17 5116 u102.7:4210@localhost XrootdProtocol: 0300 open r lhcb:prod/lhcb/LHCb/Collision15/TURBO.MDST/00051290/0002/00051290_00025876_1.turbo.mdst?pss.tid=tlhcb008.220:[log in to unmask]
> 200810 07:25:17 5116 u102.7:4210@localhost XrootdResponse: 0300 sending 4 data bytes
> 200810 07:25:17 5116 u102.7:4210@localhost XrootdProtocol: 0200 req=readv dlen=13392
> 200810 07:25:17 5116 u102.7:4210@localhost XrootdProtocol: 0200 fh=2 readV 674@71780326
> 200810 07:25:17 5116 u102.7:4210@localhost XrootdProtocol: 0200 fh=2 readV 356@71782997
> 200810 07:25:17 5116 u102.7:4210@localhost XrootdProtocol: 0200 fh=2 readV 813@71783601
> ==== ~800 readVs snipped ====
> 200810 07:25:17 5116 u102.7:4210@localhost XrootdProtocol: 0200 fh=2 readV 29117@105144524
> 200810 07:25:17 5116 u102.7:4210@localhost XrootdProtocol: 0200 fh=2 readV 2104@105366736
> 200810 07:25:17 5116 u102.7:4210@localhost XrootdProtocol: 0200 fh=2 readV 14813@105372804
> 200810 07:26:17 3212 u102.7:4236@localhost XrootdResponse: 0000 sending 55 data bytes; status=0
> 200810 07:26:17 3212 u102.7:4236@localhost XrootdProtocol: 0000 req=auth dlen=151
> 200810 07:26:17 3212 u102.7:4236@localhost XrootdResponse: 0000 sending OK
> ```
>
> Which looks to me like the vector reads are being passed through, as you suspected. I didn't realise this could happen with XCache, I was naively assuming anything would be passed through in blocksize-sized requests.
>
> Looking through the logs at the requests getting through to the xrd-ceph gateways, I see a mix of small irregularly sized vector reads, and properly aligned 64MB reads. I assume from this mix of full block reads and vector reads are the mix of "unwrapped" and "pass through" reads from XCache?
>
> Amusingly(!!?), the ratio of vector reads to reads varies wildly depending on the generation of workernode, and positively correlates with the number of "Operation expired" failures seen on that generation (more vreads getting to xrd-ceph = more job failures of this type). All generations are running the same xrootd configs and LHCb jobs, but have different numbers of job slots and different types of local storage. It might be notable that the newest generation of workernodes has SSDs for local storage, sees no vector reads hitting the xrd-ceph container, and sees no operation expired failures, despite having 3-4x the number of job slots as the other workernodes.
>
> Looking at the Echo monitoring of read operations coming from the LHCb test workernodes over the past day, it looks like the majority of reads Ceph is serving are 64MB as expected:
>
> ![lhcb_osd_read_operation_size](https://user-images.githubusercontent.com/9048425/89809177-1e83f180-db33-11ea-9c63-efaae9a4ec52.png)
>
> However, looking at the totals of reads and vector reads from the xrd-ceph container logs for the same period, there are an order of magnitude more vector reads than full block reads. I assume from these two observations that most of the vector reads hitting Echo never finish, as something times out long before the stack of reads can be actioned when they do hit Ceph.
>
> Again, this may be somewhat naive, but from my digging today it looks like we might be able to make things a lot better if we could stop XCache allowing vector reads to be passed through to the origin. I will experiment with increasing the amount of ram it can use (pfc.ram) and reducing the block sizes so larger vector reads can be unwrapped, but I would be interested in understanding if we could explicitly disallow this behaviour. As stated above the xrd-ceph doesn't handle vector reads particularly elegantly, and Echo block sizes are super inefficient for small reads, so if a vector read is too big to be unwrapped by XCache, it's probably not going to fare much better being passed through to Echo either.
>
> Other things I have looked at/thought about from your comments:
>
> * Looking at the Ceph -> workernode traffic, I'm not seeing a vast amount more traffic than I would expect. It's definitely asymmetrical, with ~50% more data going into the xrootd container network from Ceph then data coming out of the container network to the job containers, but that sounds pretty good to me, especially given the block sizes. It's certainly not the 39GB of data for a 9MB read levels of bad as far as I can tell.
>
> * @osschar's comments on block sizes - I do appreciate that large block sizes are never going to be particularly efficient for sparse vector reads, and we have experimented with smaller block sizes, but as the underlying block size in Echo is 64MB, reducing the various xrootd block sizes moves the inefficiency to the Ceph layer, as the 64MB blocks still have to be reconstructed from their erasure coded shards for each operation. Reducing the block sizes significantly impacts the full file copy speed. I would say I'm happy with it somewhere between 16MB-64MB.
>
> * pss.setopt DebugLevel 3 - I've set this, see the example below. Again, I assume from this that the mix of full block reads and vector reads are the mix of "unwrapped" and "pass through" reads?
> ```
> [2020-08-10 17:27:29.570050 +0000][Error ][XRootD ] [[log in to unmask]:1095] Unable to get the response to request kXR_readv (handle: 0x00000000, chunks: 634, total size: 8740387)
> [2020-08-10 17:27:29.570107 +0000][Error ][XRootD ] [[log in to unmask]:1095] Unable to get the response to request kXR_read (handle: 0x00000000, offset: 1543503872, size: 67108864)
> [2020-08-10 17:27:29.570231 +0000][Error ][XRootD ] [[log in to unmask]:1095] Unable to get the response to request kXR_readv (handle: 0x03000000, chunks: 899, total size: 2441099)
> [2020-08-10 17:27:29.570271 +0000][Error ][XRootD ] [[log in to unmask]:1095] Unable to get the response to request kXR_readv (handle: 0x03000000, chunks: 1024, total size: 2757650)
> [2020-08-10 17:27:29.570326 +0000][Error ][XRootD ] [[log in to unmask]:1095] Unable to get the response to request kXR_read (handle: 0x03000000, offset: 67108864, size: 67108864)
> [2020-08-10 17:28:05.373496 +0000][Error ][XRootD ] [[log in to unmask]:1095] Unable to get the response to request kXR_readv (handle: 0x02000000, chunks: 685, total size: 9232202)
> [2020-08-10 17:28:05.373542 +0000][Error ][XRootD ] [[log in to unmask]:1095] Unable to get the response to request kXR_readv (handle: 0x02000000, chunks: 685, total size: 9232202)
> ```
>
> Thanks again for all your time on this, I really appreciate all the input.
>
>
> --
> You are receiving this because you were mentioned.
> Reply to this email directly or view it on GitHub:
> https://github.com/xrootd/xrootd/issues/1259#issuecomment-671519357


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

[ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/xrootd/xrootd/issues/1259#issuecomment-671526709", "url": "https://github.com/xrootd/xrootd/issues/1259#issuecomment-671526709", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1