Print

Print


Hi, all
Not sure if email is being handled. Here is more update I got.

We noticed one job failed at stag in. I did some debugging and found more detail.

I tried manual from one of xcache node. It failed with this error.


xrdcp -np -f --cksum adler32:print
root://atpool06.lcg.triumf.ca:1094//root://xrootd.lcg.triumf.ca:1094//atlas/atlasdatadisk/rucio/mc16_13TeV/fa/cf/HITS.19328425._000001.pool.root.1
      ./aa -f


      Run: [ERROR] Server responded with an error: [3011] Unable to open
/root:/xrootd.lcg.triumf.ca:1094/atlas/atlasdatadisk/rucio/mc16_13TeV/fa/cf/HITS.19328425._000001.pool.root.1;
      no such file or directory

In above cmd, xrood.lcg.triumf.ca is a xrood proxy reverse server, then, I tried to get file from our internal xrood DNS RR.  And, it works.


$ xrdcp -np -f --cksum adler32:print
      root://atpool06.lcg.triumf.ca:1094//root://xrootd.triumf.lcg:1094//atlas/atlasdatadisk/rucio/mc16_13TeV/fa/cf/HITS.19328425._000001.pool.root.1
      ./aa -f
      adler32: 645c670b /home/xrootd/bin/./aa 938931212


I restarted xcache service on atpool06, then tried the same command again with proxy reverse service. It works again.


$ xrdcp -np -f --cksum adler32:print
      root://atpool06.lcg.triumf.ca:1094//root://proxy.lcg.triumf.ca:1094//atlas/atlasdatadisk/rucio/mc16_13TeV/fa/cf/HITS.19328425._000001.pool.root.1
      ./aa -f


      adler32: 645c670b /home/xrootd/bin/./aa 938931212

I'm confused, is it xcache service issue, or xrood proxy reverse issue ?

thanks

Simon
On 2019-10-03 1:39 p.m., Simon Liu wrote:

Hi, xrootd support

I see strange issues in our diskless site, using xcache setup. It works most of time, however, sometime get 'stuck' and doesn't recover itself.

I see two types of problems, both need to restart xrootd service. I'm now have to run a cron job to detect and kill problematic process and restart xrootd.

I hope they are known to you, thanks.

See logs below,

The first one is from caching function.


[2019-10-02 13:30:17.876434 -0700][Error  ][Utility           ] Unable to resolve localfile:1094: Name or service not known

[2019-10-02 13:30:17.876485 -0700][Error  ][PostMaster        ] [localfile:1094 #0] Unable to resolve IP address for the host

[2019-10-02 13:30:17.876503 -0700][Error  ][XRootD            ] [localfile:1094] Unable to send the message kXR_open (file: /dev/shm/atlas/atlas/atlasdatadisk/rucio/mc15_13TeV/5b/62/EVNT.04972714._000040.pool.root.1.meta4, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ): [FATAL] Invalid address

xrootd daemon is still running, but not taking any requests until restarted.

The rest requests all get this type of error.

191002 12:26:41 21236 Posix_Open: [FATAL] Invalid address open root://u28@localfile:1094//dev/shm/atlas/atlas/atlasdatadisk/rucio/data18_13TeV/9b/92/data18_13TeV.00354309.physics_Main.merge.AOD.f947_m1993._lb0217._0006.1.meta4?pss.tid=atprd001.29289:[log in to unmask]&oss.lcl=1<mailto:root://u28@localfile:1094//dev/shm/atlas/atlas/atlasdatadisk/rucio/data18_13TeV/9b/92/data18_13TeV.00354309.physics_Main.merge.AOD.f947_m1993._lb0217._0006.1.meta4?pss.tid=atprd001.29289:[log in to unmask]&oss.lcl=1>

191002 12:26:41 21236 ofs_open: atprd001.29289:[log in to unmask]<mailto:atprd001.29289:[log in to unmask]> Unable to open /root:/xrootd.lcg.triumf.ca:1094/atlas/atlasdatadisk/rucio/data18_13TeV/9b/92/data18_13TeV.00354309.physics_Main.merge.AOD.f947_m1993._lb0217._0006.1; no route to host


The other type. Likey related to forward function, authentication issue ?

[2019-10-03 04:18:14.410005 -0700][Error  ][XRootDTransport   ] [[log in to unmask]:1094<mailto:[log in to unmask]:1094> #0.0] No protocols left to try
[2019-10-03 04:18:14.410041 -0700][Error  ][AsyncSock         ] [[log in to unmask]:1094<mailto:[log in to unmask]:1094> #0.0] Socket error while handshaking: [FATAL] Auth failed
[2019-10-03 04:18:14.410083 -0700][Error  ][PostMaster        ] [[log in to unmask]:1094<mailto:[log in to unmask]:1094> #0] elapsed = 0, pConnectionWindow = 120 seconds.
[2019-10-03 04:18:14.410103 -0700][Error  ][PostMaster        ] [[log in to unmask]:1094<mailto:[log in to unmask]:1094> #0] Unable to recover: [FATAL] Auth failed.
[2019-10-03 04:18:14.410116 -0700][Error  ][XRootD            ] [[log in to unmask]:1094<mailto:[log in to unmask]:1094>] Impossible to send message kXR_open (file: /atlas/atlasdatadisk/rucio/mc16_13TeV/41/26/HITS.10701323._000037.pool.root.1?oss.lcl=1&pss.tid=xrootd.976399:39@atpool05, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ). Trying to recover.


It tried to recover, but never succeed until xrootd restart. Same error message for the requests after the problem, except cached files.


[2019-10-02 19:02:00.755505 -0700][Warning][PostMaster        ] Please note that the 'root://localfile//path/filename.meta4' semantic is now deprecated, use 'file://localhost/path/filename.meta4'instead!

191002 19:02:00 291696 XrdFileCache_File: error File::ProcessBlockResponse block 0x7f17e8002620  1 error=-2 /atlas/rucio/mc16_13TeV/5a/1e/HITS.10701323._000279.pool.root.1

191002 19:02:00 291696 XrdFileCache_File: error File::Read() io 0x7f178c002200, block 1 finished with error 2 No such file or directory /atlas/rucio/mc16_13TeV/5a/1e/HITS.10701323._000279.pool.root.1

191002 19:02:00 291696 XrdFileCache_IO: warning IOEntireFile::Read() pass to origin, File::Read() exit status=-2, error=No such file or directory root://u42@localfile:1094//atlas/rucio/mc16_13TeV/5a/1e/HITS.10701323._000279.pool.root.1?xrd.gsiusrpxy=/home/condor/execute/dir_14037/MwbNDmRj4ZvnjgorRmsz1l6ogsA7HpABFKDmrtRTDmABFKDmIOfXBn/user.proxy&xrd.wantprot=gsi,unix&pss.tid=atprd001.17282:[log in to unmask]&oss.lcl=1<mailto:root://u42@localfile:1094//atlas/rucio/mc16_13TeV/5a/1e/HITS.10701323._000279.pool.root.1?xrd.gsiusrpxy=/home/condor/execute/dir_14037/MwbNDmRj4ZvnjgorRmsz1l6ogsA7HpABFKDmrtRTDmABFKDmIOfXBn/user.proxy&xrd.wantprot=gsi,unix&pss.tid=atprd001.17282:[log in to unmask]&oss.lcl=1>

191002 19:02:00 80678 Xrootdaio: atprd001.17282:[log in to unmask]<mailto:atprd001.17282:[log in to unmask]> XrdXrootdAio: Unable to read /root:/xrootd.lcg.triumf.ca:1094/atlas/atlasdatadisk/rucio/mc16_13TeV/5a/1e/HITS.10701323._000279.pool.root.1; No such file or directory

The configuration is quite simple , here is it.

## X509 configuration
xrootd.seclib /usr/lib64/libXrdSec.so
sec.protparm gsi -vomsfun:/usr/lib64/libXrdSecgsiVOMS.so -vomsfunparms:certfmt=raw|vos=atlas|grps=/atlas
sec.protocol /usr/lib64 gsi -ca:1 -crl:0 -gridmap:/dev/null
acc.authdb /etc/xrootd/auth_file
#ofs.authorize
acc.authrefresh 60
##

# This is TRIUMF
all.sitename TRIUMF-LCG2
all.adminpath /var/run/xrootd
all.pidpath /var/run/xrootd
all.adminpath /var/spool/xrootd
xrootd.prepare /var/log/xrootd/

all.export /atlas r/o
all.export /root:/
all.export /xroot:/

all.role server
xrootd.async maxtot 16384 limit 32

ofs.osslib /usr/lib64/libXrdPss.so

# xcache local file caching, cephfs though
oss.localroot  /xcachecephfs/xcache/namespace
oss.space meta /xcachecephfs/xcache/meta
oss.space data /xcachecephfs/xcache/data

oss.path /atlas/rucio r/w
oss.path /root:       r/w

pss.origin localfile:1094
#pss.origin xrootd.triumf.lcg:1094
pss.cachelib /usr/lib64/libXrdFileCache.so
pss.config streams 512
pss.namelib -lfncache -lfn2pfn /usr/lib64/XrdName2NameDCP4RUCIO.so
#
pfc.ram 12g
pfc.diskusage 0.8 0.9
pfc.spaces data meta
pfc.blocksize 1M
pfc.prefetch 0


#https

#http.desthttps yes
http.secxtractor /usr/lib64/libXrdHttpVOMS.so

xrd.protocol XrdHttp:2880 /usr/lib64/libXrdHttp-4.so
http.cadir   /etc/grid-security/certificates
http.cert    /etc/grid-security/xrd/xrdcert.pem
http.key     /etc/grid-security/xrd/xrdkey.pem
http.listingredir https://atpool05:2880/



Thanks

Simon



########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1