Print

Print


Andy's reply ...


-------- Forwarded Message --------
Subject: Re: tried= ignored on redirection
Date: Thu, 23 Oct 2014 17:16:49 -0700
From: Andrew Hanushevsky <[log in to unmask]>
Organization: Stanford University/SLAC
To: Matevz Tadel <[log in to unmask]>

Matevz,

I am not surprised you get back to UCSD. There are two path to UCSD and the
tried only identifies one of them in a particular circuit. So, you are
almost guaranteed to trip over UCSD again unless the file is found
elsewhere.

Andy

-----Original Message-----
From: Matevz Tadel
Sent: Thursday, October 23, 2014 4:26 PM
To: Andrew Hanushevsky
Cc: Lukasz Janyst ; xrootd-dev ; Jeff Dost
Subject: Re: tried= ignored on redirection

On 10/23/14 16:22, Matevz Tadel wrote:
> Thanks Andy!
>
> I'm still confused about why the UNL redirects me back to UCSD for the
> initial
> stat (and cmsxrootd1.fnal.gov does not).
>
> Retrying the same xrdcps now, I never get redirected back to UCSD for the
> actual
> open request -- but this is probably because the file can always be opened
> in EU
> now.

Yes, indeed ... if I take a file that is only at UCSD I still get back to
UCSD
despite the initial tried. Shouldn't open fail then?

Did we manage to criss-cross the redirectors, meta-managers and meta-meta
managers beyond failure? :)

Matevz

> I managed to gather all configs and versions from involved redirectors,
> they are
> all here:
> http://uaf-2.t2.ucsd.edu/~matevz/tmp/xrd-tried/
>
> \m
>
> On 10/20/14 22:28, Andrew Hanushevsky wrote:
>>
>> xrdfed09.cern.ch
>> xrdfed10.cern.ch
>>
>> for 2:
>> cmsxrootd.fnal.gov is actually two hosts:
>> cmsxrootd1.fnal.gov
>> xrootd.unl.edu
>>
>> In such cases, to handle .tried. correctly, the client must specify the
>> cluster
>> ID on the tried not the actual host that it used as that still leaves the
>> other
>> host free to be tried. So, it may look to you as if the .tried. was
>> ignored.
>> That didn.t happen, it was honored but the other path was free to be used
>> and
>> likely chosen.
>>
>> The whole idea of using cluster ID is good but from the client.s
>> perspective it
>> is problematic as the client needs to ask one of the servers (another
>> interaction) what its cluster ID is and use that in the .tried. string.
>> Servers
>> will actually use their cluster ID when they need to tack on a tried on a
>> static
>> redirect, which it why that works the same way every time.
>>
>> Now, that is the case in all releases prior to 4.x. In 4.x we realized
>> that this
>> would be a problem and the cmsd resolves the cluster ID ahead of time. If
>> two
>> hosts have the same cluster ID then one is considered the primary and the
>> other
>> is considered the backup (this extends to 3 and so on). This makes it
>> impossible
>> for the .tried. with a host name to get back into a cluster when you are
>> trying
>> to ignore the cluster.
>>
>> While this solves the .tried. problem, it does have side-effects. It
>> means that
>> only one of n redirectors will always be used for all requests and we
>> won.t
>> switch to another one unless that one fails. In many ways that.s good
>> because it
>> dramatically cuts down on duplicate queries. The alternative would be to
>> resolve
>> a host name to the group of hosts that are actually the .same. and
>> automatically
>> excluded the group. However, that would increase the duplicate queries
>> and we
>> made the trade-off that duplicate queries were worse than always using
>> the same
>> redirector until it failed.
>>
>> Andy
>>
>> On Mon, 20 Oct 2014, Matevz Tadel wrote:
>>
>>> Hi Andy,
>>>
>>> I guess this scrolled off the context window :) Do these logs help? Any
>>> ideas
>>> what I should still try?
>>>
>>> I can try getting config files from all cms meta managers ... I guess
>>> this
>>> would be come handy in any case :)
>>>
>>> Cheers,
>>> Matevz
>>>
>>> On 10/10/14 14:20, Matevz Tadel wrote:
>>>> I ran the same xrdcp to UNL and FNAL, 3 times each, all within a span
>>>> of a
>>>> couple minutes [1]. Here are the logs (.txt) and results of grep -e
>>>> kXR_stat
>>>> -e kXR_open -e kXR_redirect (.grep):
>>>>
>>>>    http://uaf-2.t2.ucsd.edu/~matevz/tmp/xrd-tried/
>>>>
>>>> Observations:
>>>>
>>>> The initial stat has two modes:
>>>>     1. it fails in fnal-1 and fnal-3;
>>>>     2. it is redirected back to UCSD for fnal-2 and all unls.
>>>> I find it really strange fnal-2 is different than 1 and 3 in this
>>>> respect.
>>>>
>>>> For 1, redirection then goes -> cms-xrd-global.cern.ch ->
>>>> xrootd-redic.pi.infn.it -> madhatter.csc.fi -> server where file is
>>>> opened ok.
>>>>
>>>> For 2 redirection to xrootd-redic.pi.infn.it doesn't happen and we get
>>>> redirected back to cmsxrootd1.fnal.gov (for both fnal-2 and all unls)
>>>> which
>>>> then sends us to UCSD where we open the file -- but this is the place
>>>> we were
>>>> not supposed to come back to.
>>>>
>>>> I assume the real question is why I get redirected back to
>>>> cmsxrootd1.fnal.gov (despite tried=). Another thing ... why don't I get
>>>> sent
>>>> to pisa on other con3ection attempts?
>>>>
>>>> Could it be that cms-xrd-global.cern.ch has:
>>>> a) too short timeouts (but it should be in the cache!);
>>>> b) wrong address for the US peer metamanager (cmsxrootd1.fnal.gov
>>>> instead of
>>>> the DNS alias cmsxrootd.fnal.gov+)?
>>>>
>>>> Matevz
>>>>
>>>>
>>>> [1] The commands that were run:
>>>>
>>>> XRD_NETWORKSTACK=IPv4 xrdcp --debug 3 --force
>>>> 'root://xrootd.unl.edu:1094//store/mc/Summer12_DR53X/DYJetsToLL_M-10To50_TuneZ2Star_8TeV-madgraph/AODSIM/PU_S10_START53_V7A-v1/00000/064C50C4-DA1B-E211-BA43-848F69FD289B.root?hdfs_block_size=134217728&tried=xrootd.t2.ucsd.edu'
>>>>
>>>> /dev/null > ~/buf/xrdcp-tried-unl-1.txt 2>&1
>>>>
>>>> XRD_NETWORKSTACK=IPv4 xrdcp --debug 3 --force
>>>> 'root://cmsxrootd.fnal.gov//store/mc/Summer12_DR53X/DYJetsToLL_M-10To50_TuneZ2Star_8TeV-madgraph/AODSIM/PU_S10_START53_V7A-v1/00000/064C50C4-DA1B-E211-BA43-848F69FD289B.root?hdfs_block_size=134217728&tried=xrootd.t2.ucsd.edu'
>>>>
>>>> /dev/null > ~/buf/xrdcp-tried-fnal-1.txt 2>&1
>>>>
>>>> XRD_NETWORKSTACK=IPv4 xrdcp --debug 3 --force
>>>> 'root://xrootd.unl.edu:1094//store/mc/Summer12_DR53X/DYJetsToLL_M-10To50_TuneZ2Star_8TeV-madgraph/AODSIM/PU_S10_START53_V7A-v1/00000/064C50C4-DA1B-E211-BA43-848F69FD289B.root?hdfs_block_size=134217728&tried=xrootd.t2.ucsd.edu'
>>>>
>>>> /dev/null > ~/buf/xrdcp-tried-unl-2.txt 2>&1
>>>>
>>>> XRD_NETWORKSTACK=IPv4 xrdcp --debug 3 --force
>>>> 'root://cmsxrootd.fnal.gov//store/mc/Summer12_DR53X/DYJetsToLL_M-10To50_TuneZ2Star_8TeV-madgraph/AODSIM/PU_S10_START53_V7A-v1/00000/064C50C4-DA1B-E211-BA43-848F69FD289B.root?hdfs_block_size=134217728&tried=xrootd.t2.ucsd.edu'
>>>>
>>>> /dev/null > ~/buf/xrdcp-tried-fnal-2.txt 2>&1
>>>>
>>>> XRD_NETWORKSTACK=IPv4 xrdcp --debug 3 --force
>>>> 'root://xrootd.unl.edu:1094//store/mc/Summer12_DR53X/DYJetsToLL_M-10To50_TuneZ2Star_8TeV-madgraph/AODSIM/PU_S10_START53_V7A-v1/00000/064C50C4-DA1B-E211-BA43-848F69FD289B.root?hdfs_block_size=134217728&tried=xrootd.t2.ucsd.edu'
>>>>
>>>> /dev/null > ~/buf/xrdcp-tried-unl-3.txt 2>&1
>>>>
>>>> XRD_NETWORKSTACK=IPv4 xrdcp --debug 3 --force
>>>> 'root://cmsxrootd.fnal.gov//store/mc/Summer12_DR53X/DYJetsToLL_M-10To50_TuneZ2Star_8TeV-madgraph/AODSIM/PU_S10_START53_V7A-v1/00000/064C50C4-DA1B-E211-BA43-848F69FD289B.root?hdfs_block_size=134217728&tried=xrootd.t2.ucsd.edu'
>>>>
>>>> /dev/null > ~/buf/xrdcp-tried-fnal-3.txt 2>&1
>>>>
>>>>
>>>>
>>>> On 10/03/14 13:01, Andrew Hanushevsky wrote:
>>>>> On Fri, 3 Oct 2014, Matevz Tadel wrote:
>>>>>
>>>>>>> Could you clean up the log and follow through with all of the
>>>>>>> redirections?
>>>>>>
>>>>>> You want me to run with debug 3 and only grep out redirection and
>>>>>> stat/open
>>>>>> messages?
>>>>> Yes, that would give us the request and the response only.
>>>>>
>>>>>>> I still think the client version you are using may be dropping the
>>>>>>> tried
>>>>>>> history.
>>>>>> OK, I will take the head of master next time, I had 4.0.x-stable now
>>>>>> (but
>>>>>> maybe forgot to pull in latest changes).
>>>>> You could try that but according t Lukasz, that should not happen in
>>>>> the new
>>>>> client.
>>>>>
>>>>>> What I've noticed:
>>>>>> 1. If I go to UNL redirector it will send me back to UCSD (v4.0.2).
>>>>>> 2. If I go to FNAL one, it sends me off to EU, as it should (v3.3.3).
>>>>>> 3. If I use the DNS alias for both of those, one of the two happens,
>>>>>> obviously.
>>>>> Odd, there shouldn't be a diference between versions here. Then again,
>>>>> from
>>>>> the
>>>>> above you aren't doing exactly the same thing. If you go to UNL what
>>>>> the
>>>>> difference between V4 and V3, if any? Same question for FNAL.
>>>>>
>>>>>> Is it possible UNL has the file in cache and tried= gets ignored in
>>>>>> this
>>>>>> case?
>>>>> Nope, the tried gets processed before the cache is inspected. So, even
>>>>> if the
>>>>> location has been cached, it is ignored. Now the big difference
>>>>> between V3 and
>>>>> V4 is that if your cluster has two replicated redierctors subscribing
>>>>> to a
>>>>> manager, V3 would treat both as separate entities. In V4, it picks one
>>>>> of he
>>>>> two
>>>>> and only uses that one while the other is held as a hot backup. So, if
>>>>> the one
>>>>> fails it will automatically switch to the other one.
>>>>>
>>>>> Andy
>>>>>
>>>>> ########################################################################
>>>>> Use REPLY-ALL to reply to list
>>>>>
>>>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>>>
>>>> ########################################################################
>>>> Use REPLY-ALL to reply to list
>>>>
>>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>>>
>>>
>>> ########################################################################
>>> Use REPLY-ALL to reply to list
>>>
>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>>
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the XROOTD-DEV list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1