Print

Print


Hi Simon,

Well, it's hard to tell at this point. However, could you send me the log 
from the Xcache server that covers the time that you got the "file not 
found" error. Don't post it to the xrootd-l list, just send it to me.

Andy

On Wed, 9 Oct 2019, Xinli (Simon) Liu wrote:

> Hi, all
> Not sure if email is being handled. Here is more update I got.
>
> We noticed one job failed at stag in. I did some debugging and found more detail.
>
> I tried manual from one of xcache node. It failed with this error.
>
>
> xrdcp -np -f --cksum adler32:print
> root://atpool06.lcg.triumf.ca:1094//root://xrootd.lcg.triumf.ca:1094//atlas/atlasdatadisk/rucio/mc16_13TeV/fa/cf/HITS.19328425._000001.pool.root.1
>      ./aa -f
>
>
>      Run: [ERROR] Server responded with an error: [3011] Unable to open
> /root:/xrootd.lcg.triumf.ca:1094/atlas/atlasdatadisk/rucio/mc16_13TeV/fa/cf/HITS.19328425._000001.pool.root.1;
>      no such file or directory
>
> In above cmd, xrood.lcg.triumf.ca is a xrood proxy reverse server, then, I tried to get file from our internal xrood DNS RR.  And, it works.
>
>
> $ xrdcp -np -f --cksum adler32:print
>      root://atpool06.lcg.triumf.ca:1094//root://xrootd.triumf.lcg:1094//atlas/atlasdatadisk/rucio/mc16_13TeV/fa/cf/HITS.19328425._000001.pool.root.1
>      ./aa -f
>      adler32: 645c670b /home/xrootd/bin/./aa 938931212
>
>
> I restarted xcache service on atpool06, then tried the same command again with proxy reverse service. It works again.
>
>
> $ xrdcp -np -f --cksum adler32:print
>      root://atpool06.lcg.triumf.ca:1094//root://proxy.lcg.triumf.ca:1094//atlas/atlasdatadisk/rucio/mc16_13TeV/fa/cf/HITS.19328425._000001.pool.root.1
>      ./aa -f
>
>
>      adler32: 645c670b /home/xrootd/bin/./aa 938931212
>
> I'm confused, is it xcache service issue, or xrood proxy reverse issue ?
>
> thanks
>
> Simon
> On 2019-10-03 1:39 p.m., Simon Liu wrote:
>
> Hi, xrootd support
>
> I see strange issues in our diskless site, using xcache setup. It works most of time, however, sometime get 'stuck' and doesn't recover itself.
>
> I see two types of problems, both need to restart xrootd service. I'm now have to run a cron job to detect and kill problematic process and restart xrootd.
>
> I hope they are known to you, thanks.
>
> See logs below,
>
> The first one is from caching function.
>
>
> [2019-10-02 13:30:17.876434 -0700][Error  ][Utility           ] Unable to resolve localfile:1094: Name or service not known
>
> [2019-10-02 13:30:17.876485 -0700][Error  ][PostMaster        ] [localfile:1094 #0] Unable to resolve IP address for the host
>
> [2019-10-02 13:30:17.876503 -0700][Error  ][XRootD            ] [localfile:1094] Unable to send the message kXR_open (file: /dev/shm/atlas/atlas/atlasdatadisk/rucio/mc15_13TeV/5b/62/EVNT.04972714._000040.pool.root.1.meta4, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ): [FATAL] Invalid address
>
> xrootd daemon is still running, but not taking any requests until restarted.
>
> The rest requests all get this type of error.
>
> 191002 12:26:41 21236 Posix_Open: [FATAL] Invalid address open root://u28@localfile:1094//dev/shm/atlas/atlas/atlasdatadisk/rucio/data18_13TeV/9b/92/data18_13TeV.00354309.physics_Main.merge.AOD.f947_m1993._lb0217._0006.1.meta4?pss.tid=atprd001.29289:[log in to unmask]&oss.lcl=1<mailto:root://u28@localfile:1094//dev/shm/atlas/atlas/atlasdatadisk/rucio/data18_13TeV/9b/92/data18_13TeV.00354309.physics_Main.merge.AOD.f947_m1993._lb0217._0006.1.meta4?pss.tid=atprd001.29289:[log in to unmask]&oss.lcl=1>
>
> 191002 12:26:41 21236 ofs_open: atprd001.29289:[log in to unmask]<mailto:atprd001.29289:[log in to unmask]> Unable to open /root:/xrootd.lcg.triumf.ca:1094/atlas/atlasdatadisk/rucio/data18_13TeV/9b/92/data18_13TeV.00354309.physics_Main.merge.AOD.f947_m1993._lb0217._0006.1; no route to host
>
>
> The other type. Likey related to forward function, authentication issue ?
>
> [2019-10-03 04:18:14.410005 -0700][Error  ][XRootDTransport   ] [[log in to unmask]:1094<mailto:[log in to unmask]:1094> #0.0] No protocols left to try
> [2019-10-03 04:18:14.410041 -0700][Error  ][AsyncSock         ] [[log in to unmask]:1094<mailto:[log in to unmask]:1094> #0.0] Socket error while handshaking: [FATAL] Auth failed
> [2019-10-03 04:18:14.410083 -0700][Error  ][PostMaster        ] [[log in to unmask]:1094<mailto:[log in to unmask]:1094> #0] elapsed = 0, pConnectionWindow = 120 seconds.
> [2019-10-03 04:18:14.410103 -0700][Error  ][PostMaster        ] [[log in to unmask]:1094<mailto:[log in to unmask]:1094> #0] Unable to recover: [FATAL] Auth failed.
> [2019-10-03 04:18:14.410116 -0700][Error  ][XRootD            ] [[log in to unmask]:1094<mailto:[log in to unmask]:1094>] Impossible to send message kXR_open (file: /atlas/atlasdatadisk/rucio/mc16_13TeV/41/26/HITS.10701323._000037.pool.root.1?oss.lcl=1&pss.tid=xrootd.976399:39@atpool05, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ). Trying to recover.
>
>
> It tried to recover, but never succeed until xrootd restart. Same error message for the requests after the problem, except cached files.
>
>
> [2019-10-02 19:02:00.755505 -0700][Warning][PostMaster        ] Please note that the 'root://localfile//path/filename.meta4' semantic is now deprecated, use 'file://localhost/path/filename.meta4'instead!
>
> 191002 19:02:00 291696 XrdFileCache_File: error File::ProcessBlockResponse block 0x7f17e8002620  1 error=-2 /atlas/rucio/mc16_13TeV/5a/1e/HITS.10701323._000279.pool.root.1
>
> 191002 19:02:00 291696 XrdFileCache_File: error File::Read() io 0x7f178c002200, block 1 finished with error 2 No such file or directory /atlas/rucio/mc16_13TeV/5a/1e/HITS.10701323._000279.pool.root.1
>
> 191002 19:02:00 291696 XrdFileCache_IO: warning IOEntireFile::Read() pass to origin, File::Read() exit status=-2, error=No such file or directory root://u42@localfile:1094//atlas/rucio/mc16_13TeV/5a/1e/HITS.10701323._000279.pool.root.1?xrd.gsiusrpxy=/home/condor/execute/dir_14037/MwbNDmRj4ZvnjgorRmsz1l6ogsA7HpABFKDmrtRTDmABFKDmIOfXBn/user.proxy&xrd.wantprot=gsi,unix&pss.tid=atprd001.17282:[log in to unmask]&oss.lcl=1<mailto:root://u42@localfile:1094//atlas/rucio/mc16_13TeV/5a/1e/HITS.10701323._000279.pool.root.1?xrd.gsiusrpxy=/home/condor/execute/dir_14037/MwbNDmRj4ZvnjgorRmsz1l6ogsA7HpABFKDmrtRTDmABFKDmIOfXBn/user.proxy&xrd.wantprot=gsi,unix&pss.tid=atprd001.17282:[log in to unmask]&oss.lcl=1>
>
> 191002 19:02:00 80678 Xrootdaio: atprd001.17282:[log in to unmask]<mailto:atprd001.17282:[log in to unmask]> XrdXrootdAio: Unable to read /root:/xrootd.lcg.triumf.ca:1094/atlas/atlasdatadisk/rucio/mc16_13TeV/5a/1e/HITS.10701323._000279.pool.root.1; No such file or directory
>
> The configuration is quite simple , here is it.
>
> ## X509 configuration
> xrootd.seclib /usr/lib64/libXrdSec.so
> sec.protparm gsi -vomsfun:/usr/lib64/libXrdSecgsiVOMS.so -vomsfunparms:certfmt=raw|vos=atlas|grps=/atlas
> sec.protocol /usr/lib64 gsi -ca:1 -crl:0 -gridmap:/dev/null
> acc.authdb /etc/xrootd/auth_file
> #ofs.authorize
> acc.authrefresh 60
> ##
>
> # This is TRIUMF
> all.sitename TRIUMF-LCG2
> all.adminpath /var/run/xrootd
> all.pidpath /var/run/xrootd
> all.adminpath /var/spool/xrootd
> xrootd.prepare /var/log/xrootd/
>
> all.export /atlas r/o
> all.export /root:/
> all.export /xroot:/
>
> all.role server
> xrootd.async maxtot 16384 limit 32
>
> ofs.osslib /usr/lib64/libXrdPss.so
>
> # xcache local file caching, cephfs though
> oss.localroot  /xcachecephfs/xcache/namespace
> oss.space meta /xcachecephfs/xcache/meta
> oss.space data /xcachecephfs/xcache/data
>
> oss.path /atlas/rucio r/w
> oss.path /root:       r/w
>
> pss.origin localfile:1094
> #pss.origin xrootd.triumf.lcg:1094
> pss.cachelib /usr/lib64/libXrdFileCache.so
> pss.config streams 512
> pss.namelib -lfncache -lfn2pfn /usr/lib64/XrdName2NameDCP4RUCIO.so
> #
> pfc.ram 12g
> pfc.diskusage 0.8 0.9
> pfc.spaces data meta
> pfc.blocksize 1M
> pfc.prefetch 0
>
>
> #https
>
> #http.desthttps yes
> http.secxtractor /usr/lib64/libXrdHttpVOMS.so
>
> xrd.protocol XrdHttp:2880 /usr/lib64/libXrdHttp-4.so
> http.cadir   /etc/grid-security/certificates
> http.cert    /etc/grid-security/xrd/xrdcert.pem
> http.key     /etc/grid-security/xrd/xrdkey.pem
> http.listingredir https://atpool05:2880/
>
>
>
> Thanks
>
> Simon
>
>
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the XROOTD-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1