Print

Print


On 10/23/14 6:53 PM, Matevz Tadel wrote:
> Andy's reply ...
>
>
> -------- Forwarded Message --------
> Subject: Re: tried= ignored on redirection
> Date: Thu, 23 Oct 2014 17:16:49 -0700
> From: Andrew Hanushevsky <[log in to unmask]>
> Organization: Stanford University/SLAC
> To: Matevz Tadel <[log in to unmask]>
>
> Matevz,
>
> I am not surprised you get back to UCSD. There are two path to UCSD and the
> tried only identifies one of them in a particular circuit. So, you are
> almost guaranteed to trip over UCSD again unless the file is found
> elsewhere.

Hi Andy,

OK, I give up ... let's talk about this in person :) I still think there's 
something fishy going on in here.

I might end up hacking the client to have the option to avoid dropping the 
original tried= on every redirection (the code and logick that Lukasz pointed me 
to in the beginning of this thread). But I don't think I'll get to it before we 
meet anyway.

Matevz

> Andy
>
> -----Original Message-----
> From: Matevz Tadel
> Sent: Thursday, October 23, 2014 4:26 PM
> To: Andrew Hanushevsky
> Cc: Lukasz Janyst ; xrootd-dev ; Jeff Dost
> Subject: Re: tried= ignored on redirection
>
> On 10/23/14 16:22, Matevz Tadel wrote:
>> Thanks Andy!
>>
>> I'm still confused about why the UNL redirects me back to UCSD for the
>> initial
>> stat (and cmsxrootd1.fnal.gov does not).
>>
>> Retrying the same xrdcps now, I never get redirected back to UCSD for the
>> actual
>> open request -- but this is probably because the file can always be opened
>> in EU
>> now.
>
> Yes, indeed ... if I take a file that is only at UCSD I still get back to
> UCSD
> despite the initial tried. Shouldn't open fail then?
>
> Did we manage to criss-cross the redirectors, meta-managers and meta-meta
> managers beyond failure? :)
>
> Matevz
>
>> I managed to gather all configs and versions from involved redirectors,
>> they are
>> all here:
>> http://uaf-2.t2.ucsd.edu/~matevz/tmp/xrd-tried/
>>
>> \m
>>
>> On 10/20/14 22:28, Andrew Hanushevsky wrote:
>>>
>>> xrdfed09.cern.ch
>>> xrdfed10.cern.ch
>>>
>>> for 2:
>>> cmsxrootd.fnal.gov is actually two hosts:
>>> cmsxrootd1.fnal.gov
>>> xrootd.unl.edu
>>>
>>> In such cases, to handle .tried. correctly, the client must specify the
>>> cluster
>>> ID on the tried not the actual host that it used as that still leaves the
>>> other
>>> host free to be tried. So, it may look to you as if the .tried. was
>>> ignored.
>>> That didn.t happen, it was honored but the other path was free to be used
>>> and
>>> likely chosen.
>>>
>>> The whole idea of using cluster ID is good but from the client.s
>>> perspective it
>>> is problematic as the client needs to ask one of the servers (another
>>> interaction) what its cluster ID is and use that in the .tried. string.
>>> Servers
>>> will actually use their cluster ID when they need to tack on a tried on a
>>> static
>>> redirect, which it why that works the same way every time.
>>>
>>> Now, that is the case in all releases prior to 4.x. In 4.x we realized
>>> that this
>>> would be a problem and the cmsd resolves the cluster ID ahead of time. If
>>> two
>>> hosts have the same cluster ID then one is considered the primary and the
>>> other
>>> is considered the backup (this extends to 3 and so on). This makes it
>>> impossible
>>> for the .tried. with a host name to get back into a cluster when you are
>>> trying
>>> to ignore the cluster.
>>>
>>> While this solves the .tried. problem, it does have side-effects. It
>>> means that
>>> only one of n redirectors will always be used for all requests and we
>>> won.t
>>> switch to another one unless that one fails. In many ways that.s good
>>> because it
>>> dramatically cuts down on duplicate queries. The alternative would be to
>>> resolve
>>> a host name to the group of hosts that are actually the .same. and
>>> automatically
>>> excluded the group. However, that would increase the duplicate queries
>>> and we
>>> made the trade-off that duplicate queries were worse than always using
>>> the same
>>> redirector until it failed.
>>>
>>> Andy
>>>
>>> On Mon, 20 Oct 2014, Matevz Tadel wrote:
>>>
>>>> Hi Andy,
>>>>
>>>> I guess this scrolled off the context window :) Do these logs help? Any
>>>> ideas
>>>> what I should still try?
>>>>
>>>> I can try getting config files from all cms meta managers ... I guess
>>>> this
>>>> would be come handy in any case :)
>>>>
>>>> Cheers,
>>>> Matevz
>>>>
>>>> On 10/10/14 14:20, Matevz Tadel wrote:
>>>>> I ran the same xrdcp to UNL and FNAL, 3 times each, all within a span
>>>>> of a
>>>>> couple minutes [1]. Here are the logs (.txt) and results of grep -e
>>>>> kXR_stat
>>>>> -e kXR_open -e kXR_redirect (.grep):
>>>>>
>>>>>    http://uaf-2.t2.ucsd.edu/~matevz/tmp/xrd-tried/
>>>>>
>>>>> Observations:
>>>>>
>>>>> The initial stat has two modes:
>>>>>     1. it fails in fnal-1 and fnal-3;
>>>>>     2. it is redirected back to UCSD for fnal-2 and all unls.
>>>>> I find it really strange fnal-2 is different than 1 and 3 in this
>>>>> respect.
>>>>>
>>>>> For 1, redirection then goes -> cms-xrd-global.cern.ch ->
>>>>> xrootd-redic.pi.infn.it -> madhatter.csc.fi -> server where file is
>>>>> opened ok.
>>>>>
>>>>> For 2 redirection to xrootd-redic.pi.infn.it doesn't happen and we get
>>>>> redirected back to cmsxrootd1.fnal.gov (for both fnal-2 and all unls)
>>>>> which
>>>>> then sends us to UCSD where we open the file -- but this is the place
>>>>> we were
>>>>> not supposed to come back to.
>>>>>
>>>>> I assume the real question is why I get redirected back to
>>>>> cmsxrootd1.fnal.gov (despite tried=). Another thing ... why don't I get
>>>>> sent
>>>>> to pisa on other con3ection attempts?
>>>>>
>>>>> Could it be that cms-xrd-global.cern.ch has:
>>>>> a) too short timeouts (but it should be in the cache!);
>>>>> b) wrong address for the US peer metamanager (cmsxrootd1.fnal.gov
>>>>> instead of
>>>>> the DNS alias cmsxrootd.fnal.gov+)?
>>>>>
>>>>> Matevz
>>>>>
>>>>>
>>>>> [1] The commands that were run:
>>>>>
>>>>> XRD_NETWORKSTACK=IPv4 xrdcp --debug 3 --force
>>>>> 'root://xrootd.unl.edu:1094//store/mc/Summer12_DR53X/DYJetsToLL_M-10To50_TuneZ2Star_8TeV-madgraph/AODSIM/PU_S10_START53_V7A-v1/00000/064C50C4-DA1B-E211-BA43-848F69FD289B.root?hdfs_block_size=134217728&tried=xrootd.t2.ucsd.edu'
>>>>>
>>>>>
>>>>> /dev/null > ~/buf/xrdcp-tried-unl-1.txt 2>&1
>>>>>
>>>>> XRD_NETWORKSTACK=IPv4 xrdcp --debug 3 --force
>>>>> 'root://cmsxrootd.fnal.gov//store/mc/Summer12_DR53X/DYJetsToLL_M-10To50_TuneZ2Star_8TeV-madgraph/AODSIM/PU_S10_START53_V7A-v1/00000/064C50C4-DA1B-E211-BA43-848F69FD289B.root?hdfs_block_size=134217728&tried=xrootd.t2.ucsd.edu'
>>>>>
>>>>>
>>>>> /dev/null > ~/buf/xrdcp-tried-fnal-1.txt 2>&1
>>>>>
>>>>> XRD_NETWORKSTACK=IPv4 xrdcp --debug 3 --force
>>>>> 'root://xrootd.unl.edu:1094//store/mc/Summer12_DR53X/DYJetsToLL_M-10To50_TuneZ2Star_8TeV-madgraph/AODSIM/PU_S10_START53_V7A-v1/00000/064C50C4-DA1B-E211-BA43-848F69FD289B.root?hdfs_block_size=134217728&tried=xrootd.t2.ucsd.edu'
>>>>>
>>>>>
>>>>> /dev/null > ~/buf/xrdcp-tried-unl-2.txt 2>&1
>>>>>
>>>>> XRD_NETWORKSTACK=IPv4 xrdcp --debug 3 --force
>>>>> 'root://cmsxrootd.fnal.gov//store/mc/Summer12_DR53X/DYJetsToLL_M-10To50_TuneZ2Star_8TeV-madgraph/AODSIM/PU_S10_START53_V7A-v1/00000/064C50C4-DA1B-E211-BA43-848F69FD289B.root?hdfs_block_size=134217728&tried=xrootd.t2.ucsd.edu'
>>>>>
>>>>>
>>>>> /dev/null > ~/buf/xrdcp-tried-fnal-2.txt 2>&1
>>>>>
>>>>> XRD_NETWORKSTACK=IPv4 xrdcp --debug 3 --force
>>>>> 'root://xrootd.unl.edu:1094//store/mc/Summer12_DR53X/DYJetsToLL_M-10To50_TuneZ2Star_8TeV-madgraph/AODSIM/PU_S10_START53_V7A-v1/00000/064C50C4-DA1B-E211-BA43-848F69FD289B.root?hdfs_block_size=134217728&tried=xrootd.t2.ucsd.edu'
>>>>>
>>>>>
>>>>> /dev/null > ~/buf/xrdcp-tried-unl-3.txt 2>&1
>>>>>
>>>>> XRD_NETWORKSTACK=IPv4 xrdcp --debug 3 --force
>>>>> 'root://cmsxrootd.fnal.gov//store/mc/Summer12_DR53X/DYJetsToLL_M-10To50_TuneZ2Star_8TeV-madgraph/AODSIM/PU_S10_START53_V7A-v1/00000/064C50C4-DA1B-E211-BA43-848F69FD289B.root?hdfs_block_size=134217728&tried=xrootd.t2.ucsd.edu'
>>>>>
>>>>>
>>>>> /dev/null > ~/buf/xrdcp-tried-fnal-3.txt 2>&1
>>>>>
>>>>>
>>>>>
>>>>> On 10/03/14 13:01, Andrew Hanushevsky wrote:
>>>>>> On Fri, 3 Oct 2014, Matevz Tadel wrote:
>>>>>>
>>>>>>>> Could you clean up the log and follow through with all of the
>>>>>>>> redirections?
>>>>>>>
>>>>>>> You want me to run with debug 3 and only grep out redirection and
>>>>>>> stat/open
>>>>>>> messages?
>>>>>> Yes, that would give us the request and the response only.
>>>>>>
>>>>>>>> I still think the client version you are using may be dropping the
>>>>>>>> tried
>>>>>>>> history.
>>>>>>> OK, I will take the head of master next time, I had 4.0.x-stable now
>>>>>>> (but
>>>>>>> maybe forgot to pull in latest changes).
>>>>>> You could try that but according t Lukasz, that should not happen in
>>>>>> the new
>>>>>> client.
>>>>>>
>>>>>>> What I've noticed:
>>>>>>> 1. If I go to UNL redirector it will send me back to UCSD (v4.0.2).
>>>>>>> 2. If I go to FNAL one, it sends me off to EU, as it should (v3.3.3).
>>>>>>> 3. If I use the DNS alias for both of those, one of the two happens,
>>>>>>> obviously.
>>>>>> Odd, there shouldn't be a diference between versions here. Then again,
>>>>>> from
>>>>>> the
>>>>>> above you aren't doing exactly the same thing. If you go to UNL what
>>>>>> the
>>>>>> difference between V4 and V3, if any? Same question for FNAL.
>>>>>>
>>>>>>> Is it possible UNL has the file in cache and tried= gets ignored in
>>>>>>> this
>>>>>>> case?
>>>>>> Nope, the tried gets processed before the cache is inspected. So, even
>>>>>> if the
>>>>>> location has been cached, it is ignored. Now the big difference
>>>>>> between V3 and
>>>>>> V4 is that if your cluster has two replicated redierctors subscribing
>>>>>> to a
>>>>>> manager, V3 would treat both as separate entities. In V4, it picks one
>>>>>> of he
>>>>>> two
>>>>>> and only uses that one while the other is held as a hot backup. So, if
>>>>>> the one
>>>>>> fails it will automatically switch to the other one.
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>> ########################################################################
>>>>>> Use REPLY-ALL to reply to list
>>>>>>
>>>>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>>>>
>>>>> ########################################################################
>>>>> Use REPLY-ALL to reply to list
>>>>>
>>>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>>>>
>>>>
>>>> ########################################################################
>>>> Use REPLY-ALL to reply to list
>>>>
>>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>>>
>>
>> ########################################################################
>> Use REPLY-ALL to reply to list
>>
>> To unsubscribe from the XROOTD-DEV list, click the following link:
>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the XROOTD-DEV list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1