Hi, all Not sure if email is being handled. Here is more update I got. We noticed one job failed at stag in. I did some debugging and found more detail. I tried manual from one of xcache node. It failed with this error. xrdcp -np -f --cksum adler32:print root://atpool06.lcg.triumf.ca:1094//root://xrootd.lcg.triumf.ca:1094//atlas/atlasdatadisk/rucio/mc16_13TeV/fa/cf/HITS.19328425._000001.pool.root.1 ./aa -f Run: [ERROR] Server responded with an error: [3011] Unable to open /root:/xrootd.lcg.triumf.ca:1094/atlas/atlasdatadisk/rucio/mc16_13TeV/fa/cf/HITS.19328425._000001.pool.root.1; no such file or directory In above cmd, xrood.lcg.triumf.ca is a xrood proxy reverse server, then, I tried to get file from our internal xrood DNS RR. And, it works. $ xrdcp -np -f --cksum adler32:print root://atpool06.lcg.triumf.ca:1094//root://xrootd.triumf.lcg:1094//atlas/atlasdatadisk/rucio/mc16_13TeV/fa/cf/HITS.19328425._000001.pool.root.1 ./aa -f adler32: 645c670b /home/xrootd/bin/./aa 938931212 I restarted xcache service on atpool06, then tried the same command again with proxy reverse service. It works again. $ xrdcp -np -f --cksum adler32:print root://atpool06.lcg.triumf.ca:1094//root://proxy.lcg.triumf.ca:1094//atlas/atlasdatadisk/rucio/mc16_13TeV/fa/cf/HITS.19328425._000001.pool.root.1 ./aa -f adler32: 645c670b /home/xrootd/bin/./aa 938931212 I'm confused, is it xcache service issue, or xrood proxy reverse issue ? thanks Simon On 2019-10-03 1:39 p.m., Simon Liu wrote: Hi, xrootd support I see strange issues in our diskless site, using xcache setup. It works most of time, however, sometime get 'stuck' and doesn't recover itself. I see two types of problems, both need to restart xrootd service. I'm now have to run a cron job to detect and kill problematic process and restart xrootd. I hope they are known to you, thanks. See logs below, The first one is from caching function. [2019-10-02 13:30:17.876434 -0700][Error ][Utility ] Unable to resolve localfile:1094: Name or service not known [2019-10-02 13:30:17.876485 -0700][Error ][PostMaster ] [localfile:1094 #0] Unable to resolve IP address for the host [2019-10-02 13:30:17.876503 -0700][Error ][XRootD ] [localfile:1094] Unable to send the message kXR_open (file: /dev/shm/atlas/atlas/atlasdatadisk/rucio/mc15_13TeV/5b/62/EVNT.04972714._000040.pool.root.1.meta4, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ): [FATAL] Invalid address xrootd daemon is still running, but not taking any requests until restarted. The rest requests all get this type of error. 191002 12:26:41 21236 Posix_Open: [FATAL] Invalid address open root://u28@localfile:1094//dev/shm/atlas/atlas/atlasdatadisk/rucio/data18_13TeV/9b/92/data18_13TeV.00354309.physics_Main.merge.AOD.f947_m1993._lb0217._0006.1.meta4?pss.tid=atprd001.29289:[log in to unmask]&oss.lcl=1<mailto:root://u28@localfile:1094//dev/shm/atlas/atlas/atlasdatadisk/rucio/data18_13TeV/9b/92/data18_13TeV.00354309.physics_Main.merge.AOD.f947_m1993._lb0217._0006.1.meta4?pss.tid=atprd001.29289:[log in to unmask]&oss.lcl=1> 191002 12:26:41 21236 ofs_open: atprd001.29289:[log in to unmask]<mailto:atprd001.29289:[log in to unmask]> Unable to open /root:/xrootd.lcg.triumf.ca:1094/atlas/atlasdatadisk/rucio/data18_13TeV/9b/92/data18_13TeV.00354309.physics_Main.merge.AOD.f947_m1993._lb0217._0006.1; no route to host The other type. Likey related to forward function, authentication issue ? [2019-10-03 04:18:14.410005 -0700][Error ][XRootDTransport ] [[log in to unmask]:1094<mailto:[log in to unmask]:1094> #0.0] No protocols left to try [2019-10-03 04:18:14.410041 -0700][Error ][AsyncSock ] [[log in to unmask]:1094<mailto:[log in to unmask]:1094> #0.0] Socket error while handshaking: [FATAL] Auth failed [2019-10-03 04:18:14.410083 -0700][Error ][PostMaster ] [[log in to unmask]:1094<mailto:[log in to unmask]:1094> #0] elapsed = 0, pConnectionWindow = 120 seconds. [2019-10-03 04:18:14.410103 -0700][Error ][PostMaster ] [[log in to unmask]:1094<mailto:[log in to unmask]:1094> #0] Unable to recover: [FATAL] Auth failed. [2019-10-03 04:18:14.410116 -0700][Error ][XRootD ] [[log in to unmask]:1094<mailto:[log in to unmask]:1094>] Impossible to send message kXR_open (file: /atlas/atlasdatadisk/rucio/mc16_13TeV/41/26/HITS.10701323._000037.pool.root.1?oss.lcl=1&pss.tid=xrootd.976399:39@atpool05, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ). Trying to recover. It tried to recover, but never succeed until xrootd restart. Same error message for the requests after the problem, except cached files. [2019-10-02 19:02:00.755505 -0700][Warning][PostMaster ] Please note that the 'root://localfile//path/filename.meta4' semantic is now deprecated, use 'file://localhost/path/filename.meta4'instead! 191002 19:02:00 291696 XrdFileCache_File: error File::ProcessBlockResponse block 0x7f17e8002620 1 error=-2 /atlas/rucio/mc16_13TeV/5a/1e/HITS.10701323._000279.pool.root.1 191002 19:02:00 291696 XrdFileCache_File: error File::Read() io 0x7f178c002200, block 1 finished with error 2 No such file or directory /atlas/rucio/mc16_13TeV/5a/1e/HITS.10701323._000279.pool.root.1 191002 19:02:00 291696 XrdFileCache_IO: warning IOEntireFile::Read() pass to origin, File::Read() exit status=-2, error=No such file or directory root://u42@localfile:1094//atlas/rucio/mc16_13TeV/5a/1e/HITS.10701323._000279.pool.root.1?xrd.gsiusrpxy=/home/condor/execute/dir_14037/MwbNDmRj4ZvnjgorRmsz1l6ogsA7HpABFKDmrtRTDmABFKDmIOfXBn/user.proxy&xrd.wantprot=gsi,unix&pss.tid=atprd001.17282:[log in to unmask]&oss.lcl=1<mailto:root://u42@localfile:1094//atlas/rucio/mc16_13TeV/5a/1e/HITS.10701323._000279.pool.root.1?xrd.gsiusrpxy=/home/condor/execute/dir_14037/MwbNDmRj4ZvnjgorRmsz1l6ogsA7HpABFKDmrtRTDmABFKDmIOfXBn/user.proxy&xrd.wantprot=gsi,unix&pss.tid=atprd001.17282:[log in to unmask]&oss.lcl=1> 191002 19:02:00 80678 Xrootdaio: atprd001.17282:[log in to unmask]<mailto:atprd001.17282:[log in to unmask]> XrdXrootdAio: Unable to read /root:/xrootd.lcg.triumf.ca:1094/atlas/atlasdatadisk/rucio/mc16_13TeV/5a/1e/HITS.10701323._000279.pool.root.1; No such file or directory The configuration is quite simple , here is it. ## X509 configuration xrootd.seclib /usr/lib64/libXrdSec.so sec.protparm gsi -vomsfun:/usr/lib64/libXrdSecgsiVOMS.so -vomsfunparms:certfmt=raw|vos=atlas|grps=/atlas sec.protocol /usr/lib64 gsi -ca:1 -crl:0 -gridmap:/dev/null acc.authdb /etc/xrootd/auth_file #ofs.authorize acc.authrefresh 60 ## # This is TRIUMF all.sitename TRIUMF-LCG2 all.adminpath /var/run/xrootd all.pidpath /var/run/xrootd all.adminpath /var/spool/xrootd xrootd.prepare /var/log/xrootd/ all.export /atlas r/o all.export /root:/ all.export /xroot:/ all.role server xrootd.async maxtot 16384 limit 32 ofs.osslib /usr/lib64/libXrdPss.so # xcache local file caching, cephfs though oss.localroot /xcachecephfs/xcache/namespace oss.space meta /xcachecephfs/xcache/meta oss.space data /xcachecephfs/xcache/data oss.path /atlas/rucio r/w oss.path /root: r/w pss.origin localfile:1094 #pss.origin xrootd.triumf.lcg:1094 pss.cachelib /usr/lib64/libXrdFileCache.so pss.config streams 512 pss.namelib -lfncache -lfn2pfn /usr/lib64/XrdName2NameDCP4RUCIO.so # pfc.ram 12g pfc.diskusage 0.8 0.9 pfc.spaces data meta pfc.blocksize 1M pfc.prefetch 0 #https #http.desthttps yes http.secxtractor /usr/lib64/libXrdHttpVOMS.so xrd.protocol XrdHttp:2880 /usr/lib64/libXrdHttp-4.so http.cadir /etc/grid-security/certificates http.cert /etc/grid-security/xrd/xrdcert.pem http.key /etc/grid-security/xrd/xrdkey.pem http.listingredir https://atpool05:2880/ Thanks Simon ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-L list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1