Hi Andy, The config is literally copied from the boilerplate example in the Disk Caching Proxy Cluster section in the documentation :) - I did the replication because it's how the example does it. It is entirely correct that there is just one server for now - I was building up the system from a single server once it actually worked. (If you read up, you will see that in this case, it's not even redirecting when there's just 1 server to consider.) Sam On Wed, 4 Dec 2019 at 22:38, Andrew Hanushevsky <[log in to unmask]> wrote: > > Hi Sam, > > Well, xrdmapc shows a redirector (cephs03) with one server (cephc01). I > sthat not correct? I assume you expected more nodes, yes? The config file > looks reasonable, though i doesn't have to repeat so much information. > Also, for consistency oss.localroot should be the sameverywhere (though > as specified it's essentially the same). > > Andy > > On Wed, 4 Dec 2019, Sam Skipsey wrote: > > > Hi Matevz, > > > > So: > > > > [root@cephs03 ~]# xrdmapc cephs03:1094 > > 0**** cephs03.beowulf.cluster:1094 > > Srv cephc01.beowulf.cluster:1094 > > > > > > I agree, though, that it looks like the cmsd and xrootd aren't > > properly connecting - I see none of the expected stuff in the xrootd > > log on the redirector. > > > > The config file (shared between the redirector and the server) is: > > > > #Clustered cache config (redirector also does authentication with lcmaps) > > > > all.manager cephs03.beowulf.cluster:1213 > > all.export /xrootd:/ stage r/o > > all.export /root:/ stage r/o > > all.export * stage r/o > > > > if cephs03+ > > > > all.role manager > > all.export /xrootd:/ stage r/o > > all.export /root:/ stage r/o > > all.export * stage r/o > > > > xrootd.trace emsg login stall redirect > > xrd.trace conn > > cms.trace defer files redirect stage > > > > > > > > > > #the cmsd on the cache node > > else if exec cmsd > > > > all.role server > > all.export /xrootd:/ stage > > all.export /root:/ stage > > all.export * stage > > oss.localroot /cache > > > > #the xrootd on the cache node > > else > > > > all.role server > > > > ofs.osslib libXrdPss.so > > pss.cachelib libXrdFileCache.so > > pfc.ram 16g > > pfc.trace info > > pfc.diskusage 0.90 0.95 > > oss.localroot /cache/ > > cms.trace defer redirect stage > > > > all.export /xroot:/ r/w > > all.export /root:/ r/w > > all.export * r/w > > > > pss.origin localhost:1095 > > > > xrd.allow host *.beowulf.cluster > > > > #inbound security protocol, for authenticating to the xrootd-ceph > > #sec.protocol > > xrootd.seclib /opt/xrootd/lib64/libXrdSec.so > > sec.protocol sss -s /etc/gateway/xrootd/sss.keytab.grp -c > > /etc/gateway/xrootd/sss.keytab.grp > > sec.protbind localhost:1095 only sss > > > > xrd.report 127.0.0.1:9527 every 5s all > > > > > > #try changing some pss options to tune this connection > > > > pss.setopt DebugLevel 3 > > pss.setopt ParStreamsPerPhyConn 15 > > pfc.blocksize 16M > > > > > > fi > > > > > > [Note that the shared secret is the same for all parts of the cluster, > > at the moment] > > At various points, I've messed with bits of the file (removing some of > > the stages, or even entire exports, and a bunch of other things), but > > this is the current config - using the default setup derived from the > > Proxy config documentation pg 38, and adding your debugging lines. > > > > Sam > > > > > > > > On Tue, 3 Dec 2019 at 20:37, Matevz Tadel <[log in to unmask]> wrote: > >> > >> On 2019-12-03 06:38, Sam Skipsey wrote: > >>> Okay, so what I have, including logs from all the relevant bits is below. > >>> > >>> I restarted all of the services (after adding the traces as requested) > >>> at around 14:30, so all of the servers come up. > >>> About 2 minutes later, I tried (from cephc01, using xrdcp) to copy a > >>> file via the redirector xrootd (on cephs03). This only seems to show > >>> up on the redirector xrootd itself - there's no record of anything in > >>> the other logs as far as I can see. > >> > >> Hmmh, this is stange, cmsd on redirector never seems to ask the servers. > >> > >> redirector cmsd: > >> 191203 11:57:35 51827 Select seeking /store/data/Run2018D/MET/MINIAOD/PromptReco-v2/000/322/431/00000/4AD1B758-3BB8-E811-92B8-FA163EA98227.root > >> 191203 11:57:35 51827 redirector.51097:17@xrootd XrdLink: Setting ref to 2+-1 post=0 > >> 191203 11:57:35 51475 Dispatch server.6875:24@xcache-05:1094 for have dlen=107 > >> 191203 11:57:35 51475 server.6875:24@xcache-05 XrdLink: Setting ref to 1+1 post=0 > >> 191203 11:57:35 51341 XrdSched: running server inq=0 > >> 191203 11:57:35 51341 server.6875:24@xcache-05:1094 do_Have: /store/data/Run2018D/MET/MINIAOD/PromptReco-v2/000/322/431/00000/4AD1B758-3BB8-E811-92B8-FA163EA98227.root > >> > >> redirector xrootd: > >> 191203 11:57:35 51514 nobody.351:[log in to unmask] XrootdProtocol: 0100 req=open dlen=106 > >> 191203 11:57:35 51514 nobody.351:[log in to unmask] XrootdProtocol: 0100 open r /store/data/Run2018D/MET/MINIAOD/PromptReco-v2/000/322/431/00000/4AD1B758-3BB8-E811-92B8-FA163EA98227.root > >> 191203 11:57:35 51514 nobody.351:[log in to unmask] ofs_open: 0-660 fn=/store/data/Run2018D/MET/MINIAOD/PromptReco-v2/000/322/431/00000/4AD1B758-3BB8-E811-92B8-FA163EA98227.root > >> 191203 11:57:35 51109 Receive xrootd 26 bytes on 60327927 > >> 191203 11:57:35 51109 Decode xrootd redirects nobody.351:[log in to unmask] to xcache-05.t2.ucsd.edu:1094 /store/data/Run2018D/MET/MINIAOD/PromptReco-v2/000/322/431/00000/4AD1B758-3BB8-E811-92B8-FA163EA98227.root > >> 191203 11:57:35 51514 nobody.351:[log in to unmask] XrootdProtocol: 0100 redirecting to xcache-05.t2.ucsd.edu:1094 > >> 191203 11:57:35 51514 nobody.351:[log in to unmask] XrootdResponse: 0100 sending 25 data bytes; status=4004 > >> 191203 11:57:35 51514 nobody.351:[log in to unmask] ofs_close: use=0 fn=dummy > >> 191203 11:57:35 51668 XrdSched: running cuser3.270:1838@cabinet-5-5-5 inq=0 > >> > >> So, I'm suspecting redirector cmsd and xrootd never get connected, do you see something like this in the redirector xrootd log: > >> #### Initialization > >> ------ File system manager initialization completed. > >> 191203 11:14:07 51097 XrootdAioReq: Max aio/req=8; aio/srv=4096; Quantum=131072 > >> 191203 11:14:07 51097 XrootdAioReq: Adding 18 aioreq objects. > >> 191203 11:14:07 51097 XrootdAio: Adding 18 aio objects; 4096 pending. > >> 191203 11:14:07 51097 XrdSched: scheduling xrootd protocol anchor in 3600 seconds > >> 191203 11:14:07 51097 XrdSched: scheduling transit protocol anchor in 3600 seconds > >> Config warning: 'xrootd.prepare logdir' not specified; prepare tracking disabled. > >> 191203 11:14:07 51111 XrdXeq: Admin traffic thread started > >> 191203 11:14:07 51109 XrdInet: Connected to xrootd.t2.ucsd.edu:2041 <------- > >> 191203 11:14:07 51109 cms_ClientMan: Connected to xrootd.t2.ucsd.edu v 3 <------- > >> 191203 11:14:07 51109 Hookup xrootd.t2.ucsd.edu qt=178ms rw=2 > >> ------ xrootd protocol initialization completed. > >> > >> What does 'xrdmapc redirector:1094' give you? > >> > >> Would you mind sharing your redirector config? > >> > >> Andy, do you have a better idea what to look for / try? > >> > >> Matevz > >> > >> > >> > >> > >>> Sam > >>> > >>> > >>> > >>> redirector cmsd > >>> > >>> 191203 14:30:14 18957 Protocol: Primary server.202124:22@cephc01:1094 logged in. > >>> 191203 14:30:14 18957 Protocol: server.202124:22@cephc01:1094 system > >>> ID: [log in to unmask] 1213cephs03.beowulf.cluster > >>> =====> Routing for 10.1.50.11: local pub4 prv4 > >>> =====> Route all4: 10.1.50.11 Dest=[::10.1.50.11]:1094 > >>> 191203 14:31:16 18939 Config: manager service enabled. > >>> 191203 14:31:16 18953 State: Status changed to active + staging > >>> > >>> > >>> ---- > >>> redirector xrootd > >>> > >>> ------ xrootd [log in to unmask]:1094 initialization completed. > >>> 191203 14:32:26 18970 XrootdXeq: root.218779:20@cephc01 pvt IPv4 login > >>> 191203 14:32:26 18970 root.218779:20@cephc01 XrootdResponse: sending > >>> err 3011: No servers have read access to the file > >>> 191203 14:32:26 18970 XrootdXeq: root.218779:20@cephc01 disc 0:00:00 > >>> > >>> > >>> ----- > >>> server cmsd > >>> > >>> ------ cmsd [log in to unmask]:46427 initialization completed. > >>> 191203 14:30:14 218768 do_Login:: Primary server 218748 logged in; > >>> data port is 1094 > >>> Config Connecting to 1 manager and 1 site. > >>> 191203 14:30:14 218729 Config: server service enabled. > >>> 191203 14:30:14 218770 State: Status changed to active + staging > >>> 191203 14:30:14 218740 ManTree: Now connected to 1 root node(s) > >>> 191203 14:30:14 218740 Protocol: Logged into cephs03 > >>> > >>> > >>> ----- > >>> server xrootd > >>> > >>> 191203 14:30:14 218765 cms_Finder: Connected to cmsd via > >>> /tmp/cache/.olb/olbd.admin > >>> ------ xrootd protocol initialization completed. > >>> ------ xrootd [log in to unmask]:1094 initialization completed. > >>> 191203 14:30:15 218764 XrdFileCache_Manager: info Cache::Purge() Started. > >>> 191203 14:30:15 218764 XrdFileCache_Manager: info Cache::Purge() > >>> Finished, removed 0 data files, total size 0, bytes to remove at end: > >>> 0 > >>> 191203 14:35:15 218764 XrdFileCache_Manager: info Cache::Purge() Started. > >>> 191203 14:35:15 218764 XrdFileCache_Manager: info Cache::Purge() > >>> Finished, removed 0 data files, total size 0, bytes to remove at end: > >>> 0 > >>> > >>> On Mon, 2 Dec 2019 at 19:33, Matevz Tadel <[log in to unmask]> wrote: > >>>> > >>>> I'd try this: > >>>> > >>>> redirector: > >>>> xrootd.trace emsg login stall redirect > >>>> xrd.trace conn > >>>> cms.trace defer files redirect stage > >>>> > >>>> server: > >>>> # For debug, to see files being searched > >>>> # cms.trace defer files redirect stage > >>>> cms.trace defer redirect stage > >>>> > >>>> > >>>> You say xrdmapc shows the configured servers, right? > >>>> > >>>> We had some trouble with ipv4/6 at ucsd lately, clients will be redirected to > >>>> ipv6 servers only if they come in via ipv6 to the redirector. > >>>> > >>>> Can you restart redirector cmsd and then (after 30sec) look at: > >>>> > >>>> [1131] root@xrootd /var/log/xrootd/xcacheucsd# grep Routing cmsd.log | sort > >>>> > >>>> =====> Routing for bcache-1.t2.ucsd.edu: local pub4 prv4 pub6 prv6 > >>>> =====> Routing for bcache-1.t2.ucsd.edu: local pub4 prv4 pub6 prv6 > >>>> =====> Routing for xcache-00.t2.ucsd.edu: local pub4 prv4 pub6 prv6 > >>>> =====> Routing for xcache-01.t2.ucsd.edu: local pub4 prv4 pub6 prv6 > >>>> =====> Routing for xcache-02.t2.ucsd.edu: local pub4 prv4 pub6 prv6 > >>>> =====> Routing for xcache-03.t2.ucsd.edu: local pub4 prv4 pub6 prv6 > >>>> =====> Routing for xcache-04.t2.ucsd.edu: local pub4 prv4 pub6 prv6 > >>>> =====> Routing for xcache-05.t2.ucsd.edu: local pub4 prv4 pub6 prv6 > >>>> =====> Routing for xcache-06.t2.ucsd.edu: local pub4 prv4 pub6 prv6 > >>>> =====> Routing for xcache-07.t2.ucsd.edu: local pub4 prv4 pub6 prv6 > >>>> =====> Routing for xcache-08.t2.ucsd.edu: local pub4 prv4 pub6 prv6 > >>>> =====> Routing for xcache-09.t2.ucsd.edu: local pub4 prv4 pub6 prv6 > >>>> =====> Routing for xcache-10.t2.ucsd.edu: local pub4 prv4 > >>>> =====> Routing for xcache-11.t2.ucsd.edu: local pub4 prv4 pub6 prv6 > >>>> =====> Routing for xrd-cache-1.ultralight.org: local pub4 prv4 pub6 prv6 > >>>> =====> Routing for xrd-cache-2.ultralight.org: local pub4 prv4 pub6 prv6 > >>>> > >>>> Matevz > >>>> > >>>> On 2019-12-02 11:24, Sam Skipsey wrote: > >>>>> No, I explicitly did that. (As I noted, there's a typo for that in the > >>>>> example, as it uses "rw" not "r/w" , which doesn't work). > >>>>> > >>>>> I've tried basically every variation of stage/nostage/ r/w / r/o at > >>>>> different parts of the network, but the manager cmsd never seems to > >>>>> actually consider the servers (even when I've already pre-staged the > >>>>> file it's looking for by directly talking to the server xrootd service > >>>>> and getting it to cache). > >>>>> > >>>>> Sam > >>>>> > >>>>> On Mon, 2 Dec 2019 at 19:09, Matevz Tadel <[log in to unmask]> wrote: > >>>>>> > >>>>>> Hi, > >>>>>> > >>>>>> Before I go looking at what's wrong on the web page, have a look at this, page 21: > >>>>>> > >>>>>> https://indico.cern.ch/event/727208/contributions/3444604/ > >>>>>> > >>>>>> Maybe you're missing the r/w for xrootd, stage r/o for cmsd trick? > >>>>>> > >>>>>> Cheers, > >>>>>> Matevz > >>>>>> > >>>>>> On 2019-11-28 08:22, Sam Skipsey wrote: > >>>>>>> Hello everyone, > >>>>>>> > >>>>>>> So, I have another question, working entirely from the documentation > >>>>>>> on xrootd.org > >>>>>>> > >>>>>>> In the documentation for cache configuration, there's an example of > >>>>>>> how to set up a cluster of disk caching proxies: > >>>>>>> > >>>>>>> https://xrootd.slac.stanford.edu/doc/dev410/pss_config.pdf [page 38, > >>>>>>> you can't copy it because weirdly its an image] > >>>>>>> > >>>>>>> I'm following that exactly (except for fixing the typo where the > >>>>>>> example has an export using "rw" and not "r/w" as an option), and, > >>>>>>> well, it just doesn't seem to work. > >>>>>>> > >>>>>>> If I talk directly to the server that the proxies talk to: I can get a file. > >>>>>>> If I talk to an individual proxy: I can also get a file (and it is cached) > >>>>>>> If I talk to the *redirector*, I get, with debugging on "Open has > >>>>>>> returned with status [ERROR] Server responded with an error: [3011] No > >>>>>>> servers have read access to the file" > >>>>>>> > >>>>>>> The redirector logs show that the cmsd on the proxy logs in (and is > >>>>>>> listed as a "server" in its list of servers), and the proxy cmsd logs > >>>>>>> also show that it happily registers to the redirector. > >>>>>>> > >>>>>>> How do I debug this? > >>>>>>> > >>>>>>> I've already tried adding and removing options to the various exports, > >>>>>>> making sure that all the relevant ports are open, etc. > >>>>>>> > >>>>>>> Sam > >>>>>>> > >>>>>>> ######################################################################## > >>>>>>> Use REPLY-ALL to reply to list > >>>>>>> > >>>>>>> To unsubscribe from the XROOTD-L list, click the following link: > >>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 > >>>>>>> > >>>>>> > >>>> > >> > > > > ######################################################################## > > Use REPLY-ALL to reply to list > > > > To unsubscribe from the XROOTD-L list, click the following link: > > https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 > > ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-L list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1