LISTSERV 16.5 - XROOTD-L Archives

Hi Simon,

On Mon, 9 Mar 2020, Simon Liu wrote:

> Hi, Andy
>
> Here is the update, I haven't found any error from the two servers I 
> switched off async, while the other two still showing same aio errors, 
> thus I switched all async on them too.
Well, at least we have a bypass.

> In addition to that, I took look at POSIX aio on Xcache disk server, 
> here is what I have.
>
> For small file, 4G, tested aio 1024,2048,4096,8192,16384
> Native xfs and glusterfs over fuser both ok, higher aio, higher CPU 
> usage, but throughput lower
A assume the 1k, 2k, etc are segsize parameters. The higher CPU is to be 
expected as Linux softwares the POSIX aio implementation using multiple 
threads at user level. Nothing special happens in the kernel.

> For very large file, 64G, 2 times more than server memory size,
> I see the same performance/cost pattern, and with 16384 aio operations, 
> both native xfs and glusterfs i/o test stuck.
Yes, I suppose that's to be expected as the server probably ran out of 
threads and the underlying POSIX aio implementation likely doesn't deal 
with it well. Then again, was this on a single connection (i.e. one 
client)? If so, the "limit" value would apply and one should not run out 
of threads. BTW what is the thread limit for the server (it should be 
shown in the log at start-up)?

> Same tests with no aio, I see i/o performance and cost are actually 
> better.
Unfortunate, I never liked the POSIX aio implementation in Linux.

> I copied this line from Xcache setup example
> xrootd.async maxtot 16384 limit 32
Hmmm, that was Matevz take on the values. I haven't verified whether those 
are good or not. The defaults are "maxtot 4096 limit 8" which might work 
better here.

> Do you have idea which environment was used ? OS, filesystem, network etc.. ?
Well, the original aio implementation was coded for Solaris using ZFS. 
The network at the time was 1Gb per server over a 10Gb backbone but was 
upgraded to 10Gb per server over a 40Gb backbone before we switched to 
Linux servers (much higher now). It worked quite well for that combination 
as the Solaris kernel handled the asynchrnous part of the I/O. We just 
kept it for Linux while planning to reimplement either with thread queues 
per device (a large undertaking) or using the Linux-only epoll syscall 
(io_submit doesn't really work well here). That said, more importan stuff 
always comes up and it's been on the backburner for years.

One of the reasons for keeping it for Xcache is to allow better overlap of 
I/O when doing xrdcp's through he cache. People complained that the 
performance of xrdcp through the cache was worse then simply going to the 
origin server to do the xrdcp. Lot of that had to do with latency because 
xrdcp uses large block sizes and doing synchronous I/O winds up in the 
position of being a store and forward operation. The aio broke the blocks 
into smaller pieces which allowed for a faster path through the cache. I 
wasn't a fan of this approach but the complaints were pretty heavy even 
when I pointed out it was an improper use of a cache.

Thanks for the measurements. Actual figures would be nice, maybe even 
plots if you have the time.

Andy


  >
> Thanks
>
> Simon
> On 2020-03-06 4:34 p.m., Simon Liu wrote:
>> Hi, Andy
>>
>> Thanks for quick reply, indeed it's glusterfs under Xcache, but it's through fuse, in princlple aio should be supported. I'll take further look on this.
>>
>> On the other hand, this glusterfs runs replicated mode, writes from fuse goes twice to different blocks, this may affect performance.
>>
>> This problem contributes about 1-2% transfer failures, I switched 2 of Xcache servers aio off, leave the other two on, will update you when I have updates.
>>
>> thanks and have a good weekend.
>>
>> Simon
>>
>> On 3/6/2020 2:42 PM, Andrew Hanushevsky wrote:
>>> Hi Simon,
>>>
>>> I see that the cache space sits in glusterfs, right? It may be that async i/o is not supported by gluster. Could you add the directive:
>>>
>>> xrootd.async off
>>>
>>> and see if the problem persists?
>>>
>>> Andy
>>>
>>> On Fri, 6 Mar 2020, Simon Liu wrote:
>>>
>>>> Hi, Andrew
>>>>
>>>> We noticed a strange issue with our Xcache instance, a file gets particaially cached each time from different client access, using the same xrdcp cmd.  All clients copy failed till full file cached. and interesting thing is that the copy in Xcache now  is correct, size and checksum.
>>>>
>>>> Here is the job log
>>>>
>>>>
>>>> https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbigpanda.cern.ch%2Fjob%3Fpandaid%3D4659771981&amp;data=02%7C01%7C%7C92e4d852f6dd4baefd9808d7c21f9c35%7Cc20535109cb34679a2d38f442e03b587%7C1%7C0%7C637191313333037445&amp;sdata=EhqpU9O7RUw78gaFvNi4cRYrfHd2hSnXJBhY42VxHBM%3D&amp;reserved=0
>>>> 2020-03-02 14:51:05,903 | INFO     | copytool_in         | pilot.copytool.xrdcp             | _resolve_checksum_option  | Use --cksum adler32:print option to get the checksum for xrdcp command
>>>> 2020-03-02 14:51:05,903 | INFO     | copytool_in         | pilot.util.container             | execute                   | executing command: xrdcp -np -f --cksum adler32:print root://atxcache.lcg.triumf.ca:1094//root://xrootd.lcg.triumf.ca:1094//atlas/atlasdatadisk/rucio/mc16_13TeV/7e/be/DAOD_EXOT4.20012950._001037.pool.root.1 /home/condor/execute/dir_22570/atlas_k9L2RKT8/PanDA_Pilot-4659771981/DAOD_EXOT4.20012950._001037.pool.root.1
>>>> 2020-03-02 14:51:45,264 | INFO     | copytool_in         | pilot.copytool.xrdcp             | _stagefile                | rcode=54, stdout=, stderr=Run: [ERROR] Server responded with an error: [3007] XrdXrootdAio: Unable to read /root:/xrootd.lcg.triumf.ca:1094/atlas/atlasdatadisk/rucio/mc16_13TeV/7e/be/DAOD_EXOT4.20012950._001037.pool.root.1; Input/output error
>>>>
>>>> Here is the log in Xcache, other tries are similar, till full size cached.
>>>>
>>>> 200302 06:01:12 717509 XrootdXeq: atprd001.24778:[log in to unmask]<mailto:atprd001.24778:[log in to unmask]> pvt IPv4 login as e7589647.0
>>>>
>>>> 200302 06:01:12 717509 Posix_P2L: file /dev/shm/atlas/atlas/atlasdatadisk/rucio/mc16_13TeV/7e/be/DAOD_EXOT4.20012950._001037.pool.root.1.meta4 pfn2lfn /atlas/rucio/mc16_13TeV/7e/be/DAOD_EXOT4.20012950._001037.pool.root.1
>>>>
>>>> 200302 06:01:12 717509 oss_Alloc: free=470126380974040-4218299221 path=/glusterfs/xcache/meta/
>>>>
>>>> 200302 06:01:12 717509 oss_Alloc_Cache: data cache for /glusterfs/xcache/data/BB/42EB5B5E5E341F000000ce0c01cfBB010000124%
>>>>
>>>> 200302 06:01:12 717509 xrootd oss_Open_ufs: fd=16386 flags=2 mode=600 path=/glusterfs/xcache/namespace/atlas/rucio/mc16_13TeV/7e/be/DAOD_EXOT4.20012950._001037.pool.root.1
>>>>
>>>> 200302 06:01:12 717509 oss_Alloc: free=470122162674819-65536 path=/glusterfs/xcache/meta/
>>>>
>>>> 200302 06:01:13 717509 oss_Alloc_Cache: meta cache for /glusterfs/xcache/meta/BC/42EB5B5E5E341F000000ce0c01cfBC010000124%
>>>>
>>>> 200302 06:01:13 717509 xrootd oss_Open_ufs: fd=16387 flags=2 mode=600 path=/glusterfs/xcache/namespace/atlas/rucio/mc16_13TeV/7e/be/DAOD_EXOT4.20012950._001037.pool.root.1.cinfo
>>>>
>>>> 200302 06:01:14 719947 XrdFileCache_File: error File::ReadBlocksFromDisk incomplete size = 0 idx = 49 /atlas/rucio/mc16_13TeV/7e/be/DAOD_EXOT4.20012950._001037.pool.root.1
>>>>
>>>> 200302 06:01:14 719947 XrdFileCache_File: error File::Read() failed read from disk /atlas/rucio/mc16_13TeV/7e/be/DAOD_EXOT4.20012950._001037.pool.root.1
>>>>
>>>> 200302 06:01:14 719947 XrdFileCache_IO: warning IOEntireFile::Read() pass to origin, File::Read() exit status=-5, error=Input/output error root://u28@localfile:1094//atlas/rucio/mc16_13TeV/7e/be/DAOD_EXOT4.20012950._001037.pool.root.1?pss.tid=atprd001.24778:[log in to unmask]&oss.lcl=1<mailto:root://u28@localfile:1094//atlas/rucio/mc16_13TeV/7e/be/DAOD_EXOT4.20012950._001037.pool.root.1?pss.tid=atprd001.24778:[log in to unmask]&oss.lcl=1>
>>>>
>>>> 200302 06:01:14 722657 Xrootdaio: atprd001.24778:[log in to unmask]<mailto:atprd001.24778:[log in to unmask]> XrdXrootdAio: Unable to read /root:/xrootd.lcg.triumf.ca:1094/atlas/atlasdatadisk/rucio/mc16_13TeV/7e/be/DAOD_EXOT4.20012950._001037.pool.root.1; Input/output error
>>>>
>>>> 200302 06:01:15 717509 XrootdXeq: atprd001.24778:[log in to unmask]<mailto:atprd001.24778:[log in to unmask]> disc 0:00:03
>>>>
>>>>
>>>>
>>>> I have no idea how much each time the file cached in Xcache, in dCache log, all xroot access to this file succeeded,
>>>>
>>>>         datestamp          |          client | cellname             | errorcode | transfersize | fullsize
>>>> ----------------------------+--------------------------+---------------------------------+-----------+--------------+------------
>>>> 2020-03-02 06:01:15.592-08 | 206.12.1.207             | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |     77594624 | 4218299221
>>>> 2020-03-02 06:01:21.093-08 | 206.12.1.210             | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |     45088768 | 4218299221
>>>> 2020-03-02 06:01:26.147-08 | 206.12.1.210             | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |     93323264 | 4218299221
>>>> 2020-03-02 06:27:24.072-08 | 206.12.1.208             | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |     73400320 | 4218299221
>>>> 2020-03-02 06:27:44.108-08 | 206.12.1.208             | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |    240123904 | 4218299221
>>>> 2020-03-02 06:28:04.102-08 | 206.12.1.208             | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |     95420416 | 4218299221
>>>> 2020-03-02 06:38:33.069-08 | 206.12.1.210             | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |     89128960 | 4218299221
>>>> 2020-03-02 06:38:55.781-08 | 206.12.1.209             | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |    112197632 | 4218299221
>>>> 2020-03-02 06:39:32.314-08 | 206.12.1.209             | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |    831520768 | 4218299221
>>>> 2020-03-02 06:51:45.26-08  | 206.12.1.208             | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |     57671680 | 4218299221
>>>> 2020-03-02 06:52:42.488-08 | 206.12.1.210             | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |    263192576 | 4218299221
>>>> 2020-03-02 06:53:43.511-08 | 206.12.1.210             | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |    499122176 | 4218299221
>>>> 2020-03-05 09:16:08.258-08 | 2607:f8f0:660:3:0:0:0:51 | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |   4218299221 | 4218299221
>>>> 2020-03-05 09:16:22.397-08 | 206.12.1.209             | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |   1740514133 | 4218299221
>>>>
>>>> The very last two tries from my tests, red one shows fullsize transfer, Xcache bypassed.
>>>>
>>>> The blue one shows the last chunk of file cached in Xcache, after that, access the file through Xcache succeeded, and no read from dCache.
>>>>
>>>> Is this known to you ?
>>>>
>>>> thanks
>>>>
>>>> Simon
>>>>
>>>>
>>>>
>
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1