On 10/23/14 6:53 PM, Matevz Tadel wrote: > Andy's reply ... > > > -------- Forwarded Message -------- > Subject: Re: tried= ignored on redirection > Date: Thu, 23 Oct 2014 17:16:49 -0700 > From: Andrew Hanushevsky <[log in to unmask]> > Organization: Stanford University/SLAC > To: Matevz Tadel <[log in to unmask]> > > Matevz, > > I am not surprised you get back to UCSD. There are two path to UCSD and the > tried only identifies one of them in a particular circuit. So, you are > almost guaranteed to trip over UCSD again unless the file is found > elsewhere. Hi Andy, OK, I give up ... let's talk about this in person :) I still think there's something fishy going on in here. I might end up hacking the client to have the option to avoid dropping the original tried= on every redirection (the code and logick that Lukasz pointed me to in the beginning of this thread). But I don't think I'll get to it before we meet anyway. Matevz > Andy > > -----Original Message----- > From: Matevz Tadel > Sent: Thursday, October 23, 2014 4:26 PM > To: Andrew Hanushevsky > Cc: Lukasz Janyst ; xrootd-dev ; Jeff Dost > Subject: Re: tried= ignored on redirection > > On 10/23/14 16:22, Matevz Tadel wrote: >> Thanks Andy! >> >> I'm still confused about why the UNL redirects me back to UCSD for the >> initial >> stat (and cmsxrootd1.fnal.gov does not). >> >> Retrying the same xrdcps now, I never get redirected back to UCSD for the >> actual >> open request -- but this is probably because the file can always be opened >> in EU >> now. > > Yes, indeed ... if I take a file that is only at UCSD I still get back to > UCSD > despite the initial tried. Shouldn't open fail then? > > Did we manage to criss-cross the redirectors, meta-managers and meta-meta > managers beyond failure? :) > > Matevz > >> I managed to gather all configs and versions from involved redirectors, >> they are >> all here: >> http://uaf-2.t2.ucsd.edu/~matevz/tmp/xrd-tried/ >> >> \m >> >> On 10/20/14 22:28, Andrew Hanushevsky wrote: >>> >>> xrdfed09.cern.ch >>> xrdfed10.cern.ch >>> >>> for 2: >>> cmsxrootd.fnal.gov is actually two hosts: >>> cmsxrootd1.fnal.gov >>> xrootd.unl.edu >>> >>> In such cases, to handle .tried. correctly, the client must specify the >>> cluster >>> ID on the tried not the actual host that it used as that still leaves the >>> other >>> host free to be tried. So, it may look to you as if the .tried. was >>> ignored. >>> That didn.t happen, it was honored but the other path was free to be used >>> and >>> likely chosen. >>> >>> The whole idea of using cluster ID is good but from the client.s >>> perspective it >>> is problematic as the client needs to ask one of the servers (another >>> interaction) what its cluster ID is and use that in the .tried. string. >>> Servers >>> will actually use their cluster ID when they need to tack on a tried on a >>> static >>> redirect, which it why that works the same way every time. >>> >>> Now, that is the case in all releases prior to 4.x. In 4.x we realized >>> that this >>> would be a problem and the cmsd resolves the cluster ID ahead of time. If >>> two >>> hosts have the same cluster ID then one is considered the primary and the >>> other >>> is considered the backup (this extends to 3 and so on). This makes it >>> impossible >>> for the .tried. with a host name to get back into a cluster when you are >>> trying >>> to ignore the cluster. >>> >>> While this solves the .tried. problem, it does have side-effects. It >>> means that >>> only one of n redirectors will always be used for all requests and we >>> won.t >>> switch to another one unless that one fails. In many ways that.s good >>> because it >>> dramatically cuts down on duplicate queries. The alternative would be to >>> resolve >>> a host name to the group of hosts that are actually the .same. and >>> automatically >>> excluded the group. However, that would increase the duplicate queries >>> and we >>> made the trade-off that duplicate queries were worse than always using >>> the same >>> redirector until it failed. >>> >>> Andy >>> >>> On Mon, 20 Oct 2014, Matevz Tadel wrote: >>> >>>> Hi Andy, >>>> >>>> I guess this scrolled off the context window :) Do these logs help? Any >>>> ideas >>>> what I should still try? >>>> >>>> I can try getting config files from all cms meta managers ... I guess >>>> this >>>> would be come handy in any case :) >>>> >>>> Cheers, >>>> Matevz >>>> >>>> On 10/10/14 14:20, Matevz Tadel wrote: >>>>> I ran the same xrdcp to UNL and FNAL, 3 times each, all within a span >>>>> of a >>>>> couple minutes [1]. Here are the logs (.txt) and results of grep -e >>>>> kXR_stat >>>>> -e kXR_open -e kXR_redirect (.grep): >>>>> >>>>> http://uaf-2.t2.ucsd.edu/~matevz/tmp/xrd-tried/ >>>>> >>>>> Observations: >>>>> >>>>> The initial stat has two modes: >>>>> 1. it fails in fnal-1 and fnal-3; >>>>> 2. it is redirected back to UCSD for fnal-2 and all unls. >>>>> I find it really strange fnal-2 is different than 1 and 3 in this >>>>> respect. >>>>> >>>>> For 1, redirection then goes -> cms-xrd-global.cern.ch -> >>>>> xrootd-redic.pi.infn.it -> madhatter.csc.fi -> server where file is >>>>> opened ok. >>>>> >>>>> For 2 redirection to xrootd-redic.pi.infn.it doesn't happen and we get >>>>> redirected back to cmsxrootd1.fnal.gov (for both fnal-2 and all unls) >>>>> which >>>>> then sends us to UCSD where we open the file -- but this is the place >>>>> we were >>>>> not supposed to come back to. >>>>> >>>>> I assume the real question is why I get redirected back to >>>>> cmsxrootd1.fnal.gov (despite tried=). Another thing ... why don't I get >>>>> sent >>>>> to pisa on other con3ection attempts? >>>>> >>>>> Could it be that cms-xrd-global.cern.ch has: >>>>> a) too short timeouts (but it should be in the cache!); >>>>> b) wrong address for the US peer metamanager (cmsxrootd1.fnal.gov >>>>> instead of >>>>> the DNS alias cmsxrootd.fnal.gov+)? >>>>> >>>>> Matevz >>>>> >>>>> >>>>> [1] The commands that were run: >>>>> >>>>> XRD_NETWORKSTACK=IPv4 xrdcp --debug 3 --force >>>>> 'root://xrootd.unl.edu:1094//store/mc/Summer12_DR53X/DYJetsToLL_M-10To50_TuneZ2Star_8TeV-madgraph/AODSIM/PU_S10_START53_V7A-v1/00000/064C50C4-DA1B-E211-BA43-848F69FD289B.root?hdfs_block_size=134217728&tried=xrootd.t2.ucsd.edu' >>>>> >>>>> >>>>> /dev/null > ~/buf/xrdcp-tried-unl-1.txt 2>&1 >>>>> >>>>> XRD_NETWORKSTACK=IPv4 xrdcp --debug 3 --force >>>>> 'root://cmsxrootd.fnal.gov//store/mc/Summer12_DR53X/DYJetsToLL_M-10To50_TuneZ2Star_8TeV-madgraph/AODSIM/PU_S10_START53_V7A-v1/00000/064C50C4-DA1B-E211-BA43-848F69FD289B.root?hdfs_block_size=134217728&tried=xrootd.t2.ucsd.edu' >>>>> >>>>> >>>>> /dev/null > ~/buf/xrdcp-tried-fnal-1.txt 2>&1 >>>>> >>>>> XRD_NETWORKSTACK=IPv4 xrdcp --debug 3 --force >>>>> 'root://xrootd.unl.edu:1094//store/mc/Summer12_DR53X/DYJetsToLL_M-10To50_TuneZ2Star_8TeV-madgraph/AODSIM/PU_S10_START53_V7A-v1/00000/064C50C4-DA1B-E211-BA43-848F69FD289B.root?hdfs_block_size=134217728&tried=xrootd.t2.ucsd.edu' >>>>> >>>>> >>>>> /dev/null > ~/buf/xrdcp-tried-unl-2.txt 2>&1 >>>>> >>>>> XRD_NETWORKSTACK=IPv4 xrdcp --debug 3 --force >>>>> 'root://cmsxrootd.fnal.gov//store/mc/Summer12_DR53X/DYJetsToLL_M-10To50_TuneZ2Star_8TeV-madgraph/AODSIM/PU_S10_START53_V7A-v1/00000/064C50C4-DA1B-E211-BA43-848F69FD289B.root?hdfs_block_size=134217728&tried=xrootd.t2.ucsd.edu' >>>>> >>>>> >>>>> /dev/null > ~/buf/xrdcp-tried-fnal-2.txt 2>&1 >>>>> >>>>> XRD_NETWORKSTACK=IPv4 xrdcp --debug 3 --force >>>>> 'root://xrootd.unl.edu:1094//store/mc/Summer12_DR53X/DYJetsToLL_M-10To50_TuneZ2Star_8TeV-madgraph/AODSIM/PU_S10_START53_V7A-v1/00000/064C50C4-DA1B-E211-BA43-848F69FD289B.root?hdfs_block_size=134217728&tried=xrootd.t2.ucsd.edu' >>>>> >>>>> >>>>> /dev/null > ~/buf/xrdcp-tried-unl-3.txt 2>&1 >>>>> >>>>> XRD_NETWORKSTACK=IPv4 xrdcp --debug 3 --force >>>>> 'root://cmsxrootd.fnal.gov//store/mc/Summer12_DR53X/DYJetsToLL_M-10To50_TuneZ2Star_8TeV-madgraph/AODSIM/PU_S10_START53_V7A-v1/00000/064C50C4-DA1B-E211-BA43-848F69FD289B.root?hdfs_block_size=134217728&tried=xrootd.t2.ucsd.edu' >>>>> >>>>> >>>>> /dev/null > ~/buf/xrdcp-tried-fnal-3.txt 2>&1 >>>>> >>>>> >>>>> >>>>> On 10/03/14 13:01, Andrew Hanushevsky wrote: >>>>>> On Fri, 3 Oct 2014, Matevz Tadel wrote: >>>>>> >>>>>>>> Could you clean up the log and follow through with all of the >>>>>>>> redirections? >>>>>>> >>>>>>> You want me to run with debug 3 and only grep out redirection and >>>>>>> stat/open >>>>>>> messages? >>>>>> Yes, that would give us the request and the response only. >>>>>> >>>>>>>> I still think the client version you are using may be dropping the >>>>>>>> tried >>>>>>>> history. >>>>>>> OK, I will take the head of master next time, I had 4.0.x-stable now >>>>>>> (but >>>>>>> maybe forgot to pull in latest changes). >>>>>> You could try that but according t Lukasz, that should not happen in >>>>>> the new >>>>>> client. >>>>>> >>>>>>> What I've noticed: >>>>>>> 1. If I go to UNL redirector it will send me back to UCSD (v4.0.2). >>>>>>> 2. If I go to FNAL one, it sends me off to EU, as it should (v3.3.3). >>>>>>> 3. If I use the DNS alias for both of those, one of the two happens, >>>>>>> obviously. >>>>>> Odd, there shouldn't be a diference between versions here. Then again, >>>>>> from >>>>>> the >>>>>> above you aren't doing exactly the same thing. If you go to UNL what >>>>>> the >>>>>> difference between V4 and V3, if any? Same question for FNAL. >>>>>> >>>>>>> Is it possible UNL has the file in cache and tried= gets ignored in >>>>>>> this >>>>>>> case? >>>>>> Nope, the tried gets processed before the cache is inspected. So, even >>>>>> if the >>>>>> location has been cached, it is ignored. Now the big difference >>>>>> between V3 and >>>>>> V4 is that if your cluster has two replicated redierctors subscribing >>>>>> to a >>>>>> manager, V3 would treat both as separate entities. In V4, it picks one >>>>>> of he >>>>>> two >>>>>> and only uses that one while the other is held as a hot backup. So, if >>>>>> the one >>>>>> fails it will automatically switch to the other one. >>>>>> >>>>>> Andy >>>>>> >>>>>> ######################################################################## >>>>>> Use REPLY-ALL to reply to list >>>>>> >>>>>> To unsubscribe from the XROOTD-DEV list, click the following link: >>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1 >>>>> >>>>> ######################################################################## >>>>> Use REPLY-ALL to reply to list >>>>> >>>>> To unsubscribe from the XROOTD-DEV list, click the following link: >>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1 >>>>> >>>> >>>> ######################################################################## >>>> Use REPLY-ALL to reply to list >>>> >>>> To unsubscribe from the XROOTD-DEV list, click the following link: >>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1 >>>> >> >> ######################################################################## >> Use REPLY-ALL to reply to list >> >> To unsubscribe from the XROOTD-DEV list, click the following link: >> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1 > > ######################################################################## > Use REPLY-ALL to reply to list > > To unsubscribe from the XROOTD-DEV list, click the following link: > https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1 ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-DEV list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1