Print

Print


Hi Sam,

Well, xrdmapc shows a redirector (cephs03) with one server (cephc01). I 
sthat not correct? I assume you expected more nodes, yes? The config file 
looks reasonable, though i doesn't have to repeat so much information. 
Also, for consistency oss.localroot should be the sameverywhere (though 
as specified it's essentially the same).

Andy

On Wed, 4 Dec 2019, Sam Skipsey wrote:

> Hi Matevz,
>
> So:
>
> [root@cephs03 ~]# xrdmapc cephs03:1094
> 0**** cephs03.beowulf.cluster:1094
>      Srv cephc01.beowulf.cluster:1094
>
>
> I agree, though, that it looks like the cmsd and xrootd aren't
> properly connecting - I see none of the expected stuff in the xrootd
> log on the redirector.
>
> The config file (shared between the redirector and the server) is:
>
> #Clustered cache config (redirector also does authentication with lcmaps)
>
> all.manager cephs03.beowulf.cluster:1213
> all.export /xrootd:/ stage r/o
> all.export /root:/ stage r/o
> all.export * stage r/o
>
> if cephs03+
>
> all.role manager
> all.export /xrootd:/ stage r/o
> all.export /root:/ stage r/o
> all.export * stage r/o
>
> xrootd.trace emsg login stall redirect
> xrd.trace conn
> cms.trace defer files redirect stage
>
>
>
>
> #the cmsd on the cache node
> else if exec cmsd
>
> all.role server
> all.export /xrootd:/ stage
> all.export /root:/ stage
> all.export * stage
> oss.localroot /cache
>
> #the xrootd on the cache node
> else
>
> all.role server
>
> ofs.osslib    libXrdPss.so
> pss.cachelib  libXrdFileCache.so
> pfc.ram      16g
> pfc.trace     info
> pfc.diskusage 0.90 0.95
> oss.localroot /cache/
> cms.trace defer redirect stage
>
> all.export /xroot:/ r/w
> all.export /root:/ r/w
> all.export * r/w
>
> pss.origin localhost:1095
>
> xrd.allow host *.beowulf.cluster
>
> #inbound security protocol, for authenticating to the xrootd-ceph
> #sec.protocol
> xrootd.seclib /opt/xrootd/lib64/libXrdSec.so
> sec.protocol sss -s /etc/gateway/xrootd/sss.keytab.grp -c
> /etc/gateway/xrootd/sss.keytab.grp
> sec.protbind localhost:1095 only sss
>
> xrd.report 127.0.0.1:9527 every 5s all
>
>
> #try changing some pss options to tune this connection
>
> pss.setopt DebugLevel 3
> pss.setopt ParStreamsPerPhyConn 15
> pfc.blocksize 16M
>
>
> fi
>
>
> [Note that the shared secret is the same for all parts of the cluster,
> at the moment]
> At various points, I've messed with bits of the file (removing some of
> the stages, or even entire exports, and a bunch of other things), but
> this is the current config - using the default setup derived from the
> Proxy config documentation pg 38, and adding your debugging lines.
>
> Sam
>
>
>
> On Tue, 3 Dec 2019 at 20:37, Matevz Tadel <[log in to unmask]> wrote:
>>
>> On 2019-12-03 06:38, Sam Skipsey wrote:
>>> Okay, so what I have, including logs from all the relevant bits is below.
>>>
>>> I restarted all of the services (after adding the traces as requested)
>>> at around 14:30, so all of the servers come up.
>>> About 2 minutes later, I tried (from cephc01, using xrdcp) to copy a
>>> file via the redirector xrootd (on cephs03). This only seems to show
>>> up on the redirector xrootd itself - there's no record of anything in
>>> the other logs as far as I can see.
>>
>> Hmmh, this is stange, cmsd on redirector never seems to ask the servers.
>>
>> redirector cmsd:
>> 191203 11:57:35 51827 Select seeking /store/data/Run2018D/MET/MINIAOD/PromptReco-v2/000/322/431/00000/4AD1B758-3BB8-E811-92B8-FA163EA98227.root
>> 191203 11:57:35 51827 redirector.51097:17@xrootd XrdLink: Setting ref to 2+-1 post=0
>> 191203 11:57:35 51475 Dispatch server.6875:24@xcache-05:1094 for have dlen=107
>> 191203 11:57:35 51475 server.6875:24@xcache-05 XrdLink: Setting ref to 1+1 post=0
>> 191203 11:57:35 51341 XrdSched: running server inq=0
>> 191203 11:57:35 51341 server.6875:24@xcache-05:1094 do_Have: /store/data/Run2018D/MET/MINIAOD/PromptReco-v2/000/322/431/00000/4AD1B758-3BB8-E811-92B8-FA163EA98227.root
>>
>> redirector xrootd:
>> 191203 11:57:35 51514 nobody.351:[log in to unmask] XrootdProtocol: 0100 req=open dlen=106
>> 191203 11:57:35 51514 nobody.351:[log in to unmask] XrootdProtocol: 0100 open r /store/data/Run2018D/MET/MINIAOD/PromptReco-v2/000/322/431/00000/4AD1B758-3BB8-E811-92B8-FA163EA98227.root
>> 191203 11:57:35 51514 nobody.351:[log in to unmask] ofs_open: 0-660 fn=/store/data/Run2018D/MET/MINIAOD/PromptReco-v2/000/322/431/00000/4AD1B758-3BB8-E811-92B8-FA163EA98227.root
>> 191203 11:57:35 51109 Receive xrootd 26 bytes on 60327927
>> 191203 11:57:35 51109 Decode xrootd redirects nobody.351:[log in to unmask] to xcache-05.t2.ucsd.edu:1094 /store/data/Run2018D/MET/MINIAOD/PromptReco-v2/000/322/431/00000/4AD1B758-3BB8-E811-92B8-FA163EA98227.root
>> 191203 11:57:35 51514 nobody.351:[log in to unmask] XrootdProtocol: 0100 redirecting to xcache-05.t2.ucsd.edu:1094
>> 191203 11:57:35 51514 nobody.351:[log in to unmask] XrootdResponse: 0100 sending 25 data bytes; status=4004
>> 191203 11:57:35 51514 nobody.351:[log in to unmask] ofs_close: use=0 fn=dummy
>> 191203 11:57:35 51668 XrdSched: running cuser3.270:1838@cabinet-5-5-5 inq=0
>>
>> So, I'm suspecting redirector cmsd and xrootd never get connected, do you see something like this in the redirector xrootd log:
>> #### Initialization
>> ------ File system manager initialization completed.
>> 191203 11:14:07 51097 XrootdAioReq: Max aio/req=8; aio/srv=4096; Quantum=131072
>> 191203 11:14:07 51097 XrootdAioReq: Adding 18 aioreq objects.
>> 191203 11:14:07 51097 XrootdAio: Adding 18 aio objects; 4096 pending.
>> 191203 11:14:07 51097 XrdSched: scheduling xrootd protocol anchor in 3600 seconds
>> 191203 11:14:07 51097 XrdSched: scheduling transit protocol anchor in 3600 seconds
>> Config warning: 'xrootd.prepare logdir' not specified; prepare tracking disabled.
>> 191203 11:14:07 51111 XrdXeq: Admin traffic thread started
>> 191203 11:14:07 51109 XrdInet: Connected to xrootd.t2.ucsd.edu:2041        <-------
>> 191203 11:14:07 51109 cms_ClientMan: Connected to xrootd.t2.ucsd.edu v 3   <-------
>> 191203 11:14:07 51109 Hookup xrootd.t2.ucsd.edu qt=178ms rw=2
>> ------ xrootd protocol initialization completed.
>>
>> What does 'xrdmapc redirector:1094' give you?
>>
>> Would you mind sharing your redirector config?
>>
>> Andy, do you have a better idea what to look for / try?
>>
>> Matevz
>>
>>
>>
>>
>>> Sam
>>>
>>>
>>>
>>> redirector cmsd
>>>
>>> 191203 14:30:14 18957 Protocol: Primary server.202124:22@cephc01:1094 logged in.
>>> 191203 14:30:14 18957 Protocol: server.202124:22@cephc01:1094 system
>>> ID: [log in to unmask] 1213cephs03.beowulf.cluster
>>> =====> Routing for 10.1.50.11: local pub4 prv4
>>> =====> Route all4: 10.1.50.11 Dest=[::10.1.50.11]:1094
>>> 191203 14:31:16 18939 Config: manager service enabled.
>>> 191203 14:31:16 18953 State: Status changed to active + staging
>>>
>>>
>>> ----
>>> redirector xrootd
>>>
>>> ------ xrootd [log in to unmask]:1094 initialization completed.
>>> 191203 14:32:26 18970 XrootdXeq: root.218779:20@cephc01 pvt IPv4 login
>>> 191203 14:32:26 18970 root.218779:20@cephc01 XrootdResponse: sending
>>> err 3011: No servers have read access to the file
>>> 191203 14:32:26 18970 XrootdXeq: root.218779:20@cephc01 disc 0:00:00
>>>
>>>
>>> -----
>>> server cmsd
>>>
>>> ------ cmsd [log in to unmask]:46427 initialization completed.
>>> 191203 14:30:14 218768 do_Login:: Primary server 218748 logged in;
>>> data port is 1094
>>> Config Connecting to 1 manager and 1 site.
>>> 191203 14:30:14 218729 Config: server service enabled.
>>> 191203 14:30:14 218770 State: Status changed to active + staging
>>> 191203 14:30:14 218740 ManTree: Now connected to 1 root node(s)
>>> 191203 14:30:14 218740 Protocol: Logged into cephs03
>>>
>>>
>>> -----
>>> server xrootd
>>>
>>> 191203 14:30:14 218765 cms_Finder: Connected to cmsd via
>>> /tmp/cache/.olb/olbd.admin
>>> ------ xrootd protocol initialization completed.
>>> ------ xrootd [log in to unmask]:1094 initialization completed.
>>> 191203 14:30:15 218764 XrdFileCache_Manager: info Cache::Purge() Started.
>>> 191203 14:30:15 218764 XrdFileCache_Manager: info Cache::Purge()
>>> Finished, removed 0 data files, total size 0, bytes to remove at end:
>>> 0
>>> 191203 14:35:15 218764 XrdFileCache_Manager: info Cache::Purge() Started.
>>> 191203 14:35:15 218764 XrdFileCache_Manager: info Cache::Purge()
>>> Finished, removed 0 data files, total size 0, bytes to remove at end:
>>> 0
>>>
>>> On Mon, 2 Dec 2019 at 19:33, Matevz Tadel <[log in to unmask]> wrote:
>>>>
>>>> I'd try this:
>>>>
>>>> redirector:
>>>> xrootd.trace emsg login stall redirect
>>>> xrd.trace conn
>>>> cms.trace defer files redirect stage
>>>>
>>>> server:
>>>> # For debug, to see files being searched
>>>> # cms.trace    defer files redirect stage
>>>> cms.trace    defer redirect stage
>>>>
>>>>
>>>> You say xrdmapc shows the configured servers, right?
>>>>
>>>> We had some trouble with ipv4/6 at ucsd lately, clients will be redirected to
>>>> ipv6 servers only if they come in via ipv6 to the redirector.
>>>>
>>>> Can you restart redirector cmsd and then (after 30sec) look at:
>>>>
>>>> [1131] root@xrootd /var/log/xrootd/xcacheucsd# grep Routing cmsd.log | sort
>>>>
>>>> =====> Routing for bcache-1.t2.ucsd.edu: local pub4 prv4 pub6 prv6
>>>> =====> Routing for bcache-1.t2.ucsd.edu: local pub4 prv4 pub6 prv6
>>>> =====> Routing for xcache-00.t2.ucsd.edu: local pub4 prv4 pub6 prv6
>>>> =====> Routing for xcache-01.t2.ucsd.edu: local pub4 prv4 pub6 prv6
>>>> =====> Routing for xcache-02.t2.ucsd.edu: local pub4 prv4 pub6 prv6
>>>> =====> Routing for xcache-03.t2.ucsd.edu: local pub4 prv4 pub6 prv6
>>>> =====> Routing for xcache-04.t2.ucsd.edu: local pub4 prv4 pub6 prv6
>>>> =====> Routing for xcache-05.t2.ucsd.edu: local pub4 prv4 pub6 prv6
>>>> =====> Routing for xcache-06.t2.ucsd.edu: local pub4 prv4 pub6 prv6
>>>> =====> Routing for xcache-07.t2.ucsd.edu: local pub4 prv4 pub6 prv6
>>>> =====> Routing for xcache-08.t2.ucsd.edu: local pub4 prv4 pub6 prv6
>>>> =====> Routing for xcache-09.t2.ucsd.edu: local pub4 prv4 pub6 prv6
>>>> =====> Routing for xcache-10.t2.ucsd.edu: local pub4 prv4
>>>> =====> Routing for xcache-11.t2.ucsd.edu: local pub4 prv4 pub6 prv6
>>>> =====> Routing for xrd-cache-1.ultralight.org: local pub4 prv4 pub6 prv6
>>>> =====> Routing for xrd-cache-2.ultralight.org: local pub4 prv4 pub6 prv6
>>>>
>>>> Matevz
>>>>
>>>> On 2019-12-02 11:24, Sam Skipsey wrote:
>>>>> No, I explicitly did that. (As I noted, there's a typo for that in the
>>>>> example, as it uses "rw" not "r/w" , which doesn't work).
>>>>>
>>>>> I've tried basically every variation of stage/nostage/ r/w / r/o at
>>>>> different parts of the network, but the manager cmsd never seems to
>>>>> actually consider the servers (even when I've already pre-staged the
>>>>> file it's looking for by directly talking to the server xrootd service
>>>>> and getting it to cache).
>>>>>
>>>>> Sam
>>>>>
>>>>> On Mon, 2 Dec 2019 at 19:09, Matevz Tadel <[log in to unmask]> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Before I go looking at what's wrong on the web page, have a look at this, page 21:
>>>>>>
>>>>>> https://indico.cern.ch/event/727208/contributions/3444604/
>>>>>>
>>>>>> Maybe you're missing the r/w for xrootd, stage r/o for cmsd trick?
>>>>>>
>>>>>> Cheers,
>>>>>> Matevz
>>>>>>
>>>>>> On 2019-11-28 08:22, Sam Skipsey wrote:
>>>>>>> Hello everyone,
>>>>>>>
>>>>>>> So, I have another question, working entirely from the documentation
>>>>>>> on xrootd.org
>>>>>>>
>>>>>>> In the documentation for cache configuration, there's an example of
>>>>>>> how to set up a cluster of disk caching proxies:
>>>>>>>
>>>>>>> https://xrootd.slac.stanford.edu/doc/dev410/pss_config.pdf [page 38,
>>>>>>> you can't copy it because weirdly its an image]
>>>>>>>
>>>>>>> I'm following that exactly (except for fixing the typo where the
>>>>>>> example has an export using "rw" and not "r/w" as an option), and,
>>>>>>> well, it just doesn't seem to work.
>>>>>>>
>>>>>>> If I talk directly to the server that the proxies talk to: I can get a file.
>>>>>>> If I talk to an individual proxy: I can also get a file (and it is cached)
>>>>>>> If I talk to the *redirector*, I get, with debugging on "Open has
>>>>>>> returned with status [ERROR] Server responded with an error: [3011] No
>>>>>>> servers have read access to the file"
>>>>>>>
>>>>>>> The redirector logs show that the cmsd on the proxy logs in (and is
>>>>>>> listed as a "server" in its list of servers), and the proxy cmsd logs
>>>>>>> also show that it happily registers to the redirector.
>>>>>>>
>>>>>>> How do I debug this?
>>>>>>>
>>>>>>> I've already tried adding and removing options to the various exports,
>>>>>>> making sure that all the relevant ports are open, etc.
>>>>>>>
>>>>>>> Sam
>>>>>>>
>>>>>>> ########################################################################
>>>>>>> Use REPLY-ALL to reply to list
>>>>>>>
>>>>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>>>>>
>>>>>>
>>>>
>>
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the XROOTD-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1