Print

Print


Hi Simon,

I see that the cache space sits in glusterfs, right? It may be that 
async i/o is not supported by gluster. Could you add the directive:

xrootd.async off

and see if the problem persists?

Andy

On Fri, 6 Mar 2020, Simon Liu wrote:

> Hi, Andrew
>
> We noticed a strange issue with our Xcache instance, a file gets particaially cached each time from different client access, using the same xrdcp cmd.  All clients copy failed till full file cached. and interesting thing is that the copy in Xcache now  is correct, size and checksum.
>
> Here is the job log
>
>
> https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbigpanda.cern.ch%2Fjob%3Fpandaid%3D4659771981&data=02%7C01%7C%7Cc69f3e37ed9244f09d7c08d7c12ea060%7Cc20535109cb34679a2d38f442e03b587%7C1%7C0%7C637190278308973950&sdata=V1AHQfkKtrF0vyPLiJvrssic3UZk3hY4%2F5dvgyOmKlc%3D&reserved=0
> 2020-03-02 14:51:05,903 | INFO     | copytool_in         | pilot.copytool.xrdcp             | _resolve_checksum_option  | Use --cksum adler32:print option to get the checksum for xrdcp command
> 2020-03-02 14:51:05,903 | INFO     | copytool_in         | pilot.util.container             | execute                   | executing command: xrdcp -np -f --cksum adler32:print root://atxcache.lcg.triumf.ca:1094//root://xrootd.lcg.triumf.ca:1094//atlas/atlasdatadisk/rucio/mc16_13TeV/7e/be/DAOD_EXOT4.20012950._001037.pool.root.1 /home/condor/execute/dir_22570/atlas_k9L2RKT8/PanDA_Pilot-4659771981/DAOD_EXOT4.20012950._001037.pool.root.1
> 2020-03-02 14:51:45,264 | INFO     | copytool_in         | pilot.copytool.xrdcp             | _stagefile                | rcode=54, stdout=, stderr=Run: [ERROR] Server responded with an error: [3007] XrdXrootdAio: Unable to read /root:/xrootd.lcg.triumf.ca:1094/atlas/atlasdatadisk/rucio/mc16_13TeV/7e/be/DAOD_EXOT4.20012950._001037.pool.root.1; Input/output error
>
> Here is the log in Xcache, other tries are similar, till full size cached.
>
> 200302 06:01:12 717509 XrootdXeq: atprd001.24778:[log in to unmask]<mailto:atprd001.24778:[log in to unmask]> pvt IPv4 login as e7589647.0
>
> 200302 06:01:12 717509 Posix_P2L: file /dev/shm/atlas/atlas/atlasdatadisk/rucio/mc16_13TeV/7e/be/DAOD_EXOT4.20012950._001037.pool.root.1.meta4 pfn2lfn /atlas/rucio/mc16_13TeV/7e/be/DAOD_EXOT4.20012950._001037.pool.root.1
>
> 200302 06:01:12 717509 oss_Alloc: free=470126380974040-4218299221 path=/glusterfs/xcache/meta/
>
> 200302 06:01:12 717509 oss_Alloc_Cache: data cache for /glusterfs/xcache/data/BB/42EB5B5E5E341F000000ce0c01cfBB010000124%
>
> 200302 06:01:12 717509 xrootd oss_Open_ufs: fd=16386 flags=2 mode=600 path=/glusterfs/xcache/namespace/atlas/rucio/mc16_13TeV/7e/be/DAOD_EXOT4.20012950._001037.pool.root.1
>
> 200302 06:01:12 717509 oss_Alloc: free=470122162674819-65536 path=/glusterfs/xcache/meta/
>
> 200302 06:01:13 717509 oss_Alloc_Cache: meta cache for /glusterfs/xcache/meta/BC/42EB5B5E5E341F000000ce0c01cfBC010000124%
>
> 200302 06:01:13 717509 xrootd oss_Open_ufs: fd=16387 flags=2 mode=600 path=/glusterfs/xcache/namespace/atlas/rucio/mc16_13TeV/7e/be/DAOD_EXOT4.20012950._001037.pool.root.1.cinfo
>
> 200302 06:01:14 719947 XrdFileCache_File: error File::ReadBlocksFromDisk incomplete size = 0 idx = 49 /atlas/rucio/mc16_13TeV/7e/be/DAOD_EXOT4.20012950._001037.pool.root.1
>
> 200302 06:01:14 719947 XrdFileCache_File: error File::Read() failed read from disk /atlas/rucio/mc16_13TeV/7e/be/DAOD_EXOT4.20012950._001037.pool.root.1
>
> 200302 06:01:14 719947 XrdFileCache_IO: warning IOEntireFile::Read() pass to origin, File::Read() exit status=-5, error=Input/output error root://u28@localfile:1094//atlas/rucio/mc16_13TeV/7e/be/DAOD_EXOT4.20012950._001037.pool.root.1?pss.tid=atprd001.24778:[log in to unmask]&oss.lcl=1<mailto:root://u28@localfile:1094//atlas/rucio/mc16_13TeV/7e/be/DAOD_EXOT4.20012950._001037.pool.root.1?pss.tid=atprd001.24778:[log in to unmask]&oss.lcl=1>
>
> 200302 06:01:14 722657 Xrootdaio: atprd001.24778:[log in to unmask]<mailto:atprd001.24778:[log in to unmask]> XrdXrootdAio: Unable to read /root:/xrootd.lcg.triumf.ca:1094/atlas/atlasdatadisk/rucio/mc16_13TeV/7e/be/DAOD_EXOT4.20012950._001037.pool.root.1; Input/output error
>
> 200302 06:01:15 717509 XrootdXeq: atprd001.24778:[log in to unmask]<mailto:atprd001.24778:[log in to unmask]> disc 0:00:03
>
>
>
> I have no idea how much each time the file cached in Xcache, in dCache log, all xroot access to this file succeeded,
>
>         datestamp          |          client          |            cellname             | errorcode | transfersize |  fullsize
> ----------------------------+--------------------------+---------------------------------+-----------+--------------+------------
> 2020-03-02 06:01:15.592-08 | 206.12.1.207             | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |     77594624 | 4218299221
> 2020-03-02 06:01:21.093-08 | 206.12.1.210             | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |     45088768 | 4218299221
> 2020-03-02 06:01:26.147-08 | 206.12.1.210             | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |     93323264 | 4218299221
> 2020-03-02 06:27:24.072-08 | 206.12.1.208             | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |     73400320 | 4218299221
> 2020-03-02 06:27:44.108-08 | 206.12.1.208             | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |    240123904 | 4218299221
> 2020-03-02 06:28:04.102-08 | 206.12.1.208             | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |     95420416 | 4218299221
> 2020-03-02 06:38:33.069-08 | 206.12.1.210             | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |     89128960 | 4218299221
> 2020-03-02 06:38:55.781-08 | 206.12.1.209             | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |    112197632 | 4218299221
> 2020-03-02 06:39:32.314-08 | 206.12.1.209             | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |    831520768 | 4218299221
> 2020-03-02 06:51:45.26-08  | 206.12.1.208             | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |     57671680 | 4218299221
> 2020-03-02 06:52:42.488-08 | 206.12.1.210             | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |    263192576 | 4218299221
> 2020-03-02 06:53:43.511-08 | 206.12.1.210             | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |    499122176 | 4218299221
> 2020-03-05 09:16:08.258-08 | 2607:f8f0:660:3:0:0:0:51 | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |   4218299221 | 4218299221
> 2020-03-05 09:16:22.397-08 | 206.12.1.209             | sfa14kx_1_lun41@sfa14kx_1_lun41 |         0 |   1740514133 | 4218299221
>
> The very last two tries from my tests, red one shows fullsize transfer, Xcache bypassed.
>
> The blue one shows the last chunk of file cached in Xcache, after that, access the file through Xcache succeeded, and no read from dCache.
>
> Is this known to you ?
>
> thanks
>
> Simon
>
>
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1