ciao Andy!
Frankly I would have hoped this (experiments taking care of an additional cycling), but this is a clear counter example:

http://dashb-cms-sum.cern.ch/dashboard/request.py/getMetricResultDetails?hostName=ce08-lcg.cr.cnaf.infn.it&flavour=CREAM-CE&metric=org.cms.WN-xrootd-fallback&timeStamp=2014-04-17T03:59:07Z

the second URL seems to be never tried ..

I will also investigate this on CMS side.

tom


On Fri, Apr 18, 2014 at 4:11 AM, Andrew Hanushevsky <[log in to unmask]> wrote:
Hi Tommaso,
 
Well, I will look into this some more. Yes, TFile (actually, TXNetfile) would be affected as the cycling is a core feature. Frankly, I don’t think this affects CMS since, as far as I understand, they manually cycle through themselves in the CMSW framework. Perhaps that is why I never heard a complaint from them.
 
Andy
 
From: [log in to unmask]" href="mailto:[log in to unmask]" target="_blank">Tommaso Boccali
Sent: Wednesday, April 16, 2014 11:32 PM
To: [log in to unmask]" href="mailto:[log in to unmask]" target="_blank">Andrew Hanushevsky
Cc: [log in to unmask]" href="mailto:[log in to unmask]" target="_blank">[log in to unmask]
Subject: Re: problem with aliased redirectors
 
so, TFile::Open is broken as well, correct?
And then why the logic of xrdcopy is different?
 
If the former assumption is correct, we have a real problem.
CMS relies on DNS aliased redirectors to ensure HA access to the federation, assuming that ANY of the N aliased redirectors would be enough.
If I understand the situation here, instead, ANY redirector which is not ok will cause problems to the fraction of sites which decide to use that stably (instead of randomly cycling...).
 
 
thanks for your explanations!
 
tom
 
 


On Thu, Apr 17, 2014 at 8:27 AM, Andrew Hanushevsky <[log in to unmask]> wrote:
Hi Tommaso,

Whether you use xrdcp or TFile::Open does not matter. All of that goes through the same logic. So, what you see in xrdcp is what you should expect from TFile::Open(). Either test will be valid.


Andy

On Thu, 17 Apr 2014, Tommaso Boccali wrote:

Ciao Andrew, thanks a lot.
What worries me is the test I did with

TFile::Open

which seems to be broken as well. Since in CMSSW we use that, in the real
use case which counts I am afraid we have a problem.

Could you try that? Unfortunately now I add to restart the redirector,
since I got during the night MANY failures, since a couple of sites are
already using the redundant setup

tom


On Thu, Apr 17, 2014 at 7:09 AM, Andrew Hanushevsky
<[log in to unmask]>wrote:

Hi Tommaso,

I have check the new client and it works as expected. I couldn't copy the
file because I am not authroized but it did find a working redirector.
Oddly enough, it also found things in Germany and the like. I assume you
redirect upstream? If not, then I need to check with Lukasz how he managed
to get to your global redirector.

I agree that the old client is broken. I don't know when that regression
happened as it worked the last time I checked this. The  question is
whether we should fix the old client since with release 4 the old client is
deprecated and no one would actually use it even if it got fixed. R4 is due
out in the middle of May.

Andy



On Thu, 17 Apr 2014, Tommaso Boccali wrote:

ciao Andrew!
I think you do not need at this level, because by choice we have left the
redirectors w/o authentication.

So if you look to a command like

xrdcp -d 10 root://
xrootd-cms.infn.it//store/data/Run2013A/MinimumBias/
RECO/PromptReco-v1/000/212/188/00000/6C246B92-C67B-E211-
BE02-003048D2BC62.root.

&log


(after a voms-proxy-destroy in my case) I still see the usual fixed order

-bash-3.2$ grep ShowUrls log
140417 01:12:15 001 Xrd: ShowUrls: The converted URLs count is 2
140417 01:12:15 001 Xrd: ShowUrls: URL n.1: root://
xrootd-redic.pi.infn.it:1094//store/data/Run2013A/
MinimumBias/RECO/PromptReco-v1/000/212/188/00000/6C246B92-
C67B-E211-BE02-003048D2BC62.root
.
140417 01:12:15 001 Xrd: ShowUrls: URL n.2: root://
xrootd.ba.infn.it:1094//store/data/Run2013A/MinimumBias/
RECO/PromptReco-v1/000/212/188/00000/6C246B92-C67B-E211-
BE02-003048D2BC62.root
.
140417 01:12:15 001 Xrd: ShowUrls: The converted URLs count is 2
140417 01:12:15 001 Xrd: ShowUrls: URL n.1: root://
xrootd-redic.pi.infn.it:1094//store/data/Run2013A/
MinimumBias/RECO/PromptReco-v1/000/212/188/00000/6C246B92-
C67B-E211-BE02-003048D2BC62.root
.
140417 01:12:15 001 Xrd: ShowUrls: URL n.2: root://
xrootd.ba.infn.it:1094//store/data/Run2013A/MinimumBias/
RECO/PromptReco-v1/000/212/188/00000/6C246B92-C67B-E211-
BE02-003048D2BC62.root
.
140417 01:12:20 001 Xrd: ShowUrls: The converted URLs count is 2
140417 01:12:20 001 Xrd: ShowUrls: URL n.1: root://
xrootd-redic.pi.infn.it:1094//store/data/Run2013A/
MinimumBias/RECO/PromptReco-v1/000/212/188/00000/6C246B92-
C67B-E211-BE02-003048D2BC62.root
.
140417 01:12:20 001 Xrd: ShowUrls: URL n.2: root://
xrootd.ba.infn.it:1094//store/data/Run2013A/MinimumBias/
RECO/PromptReco-v1/000/212/188/00000/6C246B92-C67B-E211-
BE02-003048D2BC62.root
.
...

eventually is this would not fail, then you would get an error when trying
to access the real file, but at least in my case I "die" before.

Of course for this to make sense I need to leave off one of the
redirectors
(xrootd-redic.pi.infn.it).  Also, you can test the same behavior with

xrdcp -d 10 root://xrootd-redic.pi.infn.it,
xrootd.ba.infn.it//store/data/Run2013A/MinimumBias/RECO/
PromptReco-v1/000/212/188/00000/6C246B92-C67B-E211-BE02-003048D2BC62.root
.

& log1 &


again, I get

-bash-3.2$ grep ShowUrls log1
140417 01:15:59 001 Xrd: ShowUrls: The converted URLs count is 2
140417 01:15:59 001 Xrd: ShowUrls: URL n.1: root://
xrootd-redic.pi.infn.it:1094//store/data/Run2013A/
MinimumBias/RECO/PromptReco-v1/000/212/188/00000/6C246B92-
C67B-E211-BE02-003048D2BC62.root
.
140417 01:15:59 001 Xrd: ShowUrls: URL n.2: root://
xrootd.ba.infn.it:1094//store/data/Run2013A/MinimumBias/
RECO/PromptReco-v1/000/212/188/00000/6C246B92-C67B-E211-
BE02-003048D2BC62.root
.
140417 01:15:59 001 Xrd: ShowUrls: The converted URLs count is 2
140417 01:15:59 001 Xrd: ShowUrls: URL n.1: root://
xrootd-redic.pi.infn.it:1094//store/data/Run2013A/
MinimumBias/RECO/PromptReco-v1/000/212/188/00000/6C246B92-
C67B-E211-BE02-003048D2BC62.root
.
140417 01:15:59 001 Xrd: ShowUrls: URL n.2: root://
xrootd.ba.infn.it:1094//store/data/Run2013A/MinimumBias/
RECO/PromptReco-v1/000/212/188/00000/6C246B92-C67B-E211-
BE02-003048D2BC62.root
.
140417 01:16:04 001 Xrd: ShowUrls: The converted URLs count is 2
140417 01:16:04 001 Xrd: ShowUrls: URL n.1: root://
xrootd-redic.pi.infn.it:1094//store/data/Run2013A/
MinimumBias/RECO/PromptReco-v1/000/212/188/00000/6C246B92-
C67B-E211-BE02-003048D2BC62.root
.
140417 01:16:04 001 Xrd: ShowUrls: URL n.2: root://
xrootd.ba.infn.it:1094//store/data/Run2013A/MinimumBias/
RECO/PromptReco-v1/000/212/188/00000/6C246B92-C67B-E211-
BE02-003048D2BC62.root
.
...


I try and leave the redirector OFF for the night, if you want to try. I
hope I will not get big side effects  :(

tom





On Thu, Apr 17, 2014 at 12:33 AM, Andrew Hanushevsky <[log in to unmask]
wrote:

   Hi Tommaso,

So I need a certificate to reproduce your test from here? I also can
supply you with access to xrdcopy if you happen to have AFS installed.

Andy

*From:* Tommaso Boccali <[log in to unmask]>
*Sent:* Tuesday, April 15, 2014 10:02 PM
*To:* Andrew Hanushevsky <[log in to unmask]>
*Cc:* [log in to unmask]
*Subject:* Re: problem with aliased redirectors


ciao Andrew!
I have problems checking with xrdcopy, since that is not distributed with
CMS software, I have to find a way. For the moment, another hint
something
is not ok in the randomization in xrdcp:

I tried (with xrootd.ba.infn.it ON and xrootd-redic.pi.infn.it OFF)

xrdcp -d 10 root://xrootd-redic.pi.infn.it,xrootd.ba.infn.it
//store/data/Run2013A/MinimumBias/RECO/PromptReco-
v1/000/212/188/00000/6C246B92-C67B-E211-BE02-003048D2BC62.root<
http://xrootd-cms.infn.it//store/data/Run2013A/
MinimumBias/RECO/PromptReco-v1/000/212/188/00000/6C246B92-
C67B-E211-BE02-003048D2BC62.root>.


so putting explicitly the list of servers in the command line.
So, this always fails (xrootd-redic.pi.infn.it is always tried, 8 times,
and the other never reached).

Instead

xrdcp -d 10 root://xrootd.ba.infn.it,xrootd-redic.pi.infn.it
//store/data/Run2013A/MinimumBias/RECO/PromptReco-
v1/000/212/188/00000/6C246B92-C67B-E211-BE02-003048D2BC62.root<
http://xrootd-cms.infn.it//store/data/Run2013A/
MinimumBias/RECO/PromptReco-v1/000/212/188/00000/6C246B92-
C67B-E211-BE02-003048D2BC62.root>.


always works at the first attempt.

In any case, I think we basically care about the behavior of
TFile::Open()
from our SW, not direct copy commands


This for example should not fail:

root [5] TFile* ii = TFile::Open("root://xrootd-redic.pi.infn.it,
xrootd.ba.infn.it//store/data/Run2013A/MinimumBias/RECO/
PromptReco-v1/000/212/188/00000/6C246B92-C67B-E211-BE02-
003048D2BC62.root
")

140416 06:58:05 001 Xrd: Connect: can't open connection to [
xrootd-redic.pi.infn.it:1094]
140416 06:58:05 001 Xrd: XrdNetFile: Error creating logical connection to
xrootd-redic.pi.infn.it:1094
Error in <TXNetFile::CreateXClient>: open attempt failed on root://
xrootd-redic.pi.infn.it,
xrootd.ba.infn.it//store/data/Run2013A/MinimumBias/RECO/
PromptReco-v1/000/212/188/00000/6C246B92-C67B-E211-BE02-
003048D2BC62.root

(does not seem to give a second try to the other server)

and this seems even worse:

root [7] TFile* ii = TFile::Open("root://
xrootd-cms.infn.it//store/data/Run2013A/MinimumBias/
RECO/PromptReco-v1/000/212/188/00000/6C246B92-C67B-E211-
BE02-003048D2BC62.root
")

140416 06:59:11 001 Xrd: Connect: can't open connection to [
xrootd-redic.pi.infn.it:1094]
140416 06:59:11 001 Xrd: XrdNetFile: Error creating logical connection to
xrootd-redic.pi.infn.it:1094
Error in <TXNetFile::CreateXClient>: open attempt failed on root://
xrootd-cms.infn.it//store/data/Run2013A/MinimumBias/
RECO/PromptReco-v1/000/212/188/00000/6C246B92-C67B-E211-
BE02-003048D2BC62.root

so not even a second attempt is tried ....

this instead works

root [1] TFile* ii = TFile::Open("root://xrootd.ba.infn.it,
xrootd-redic.pi.infn.it//store/data/Run2013A/
MinimumBias/RECO/PromptReco-v1/000/212/188/00000/6C246B92-
C67B-E211-BE02-003048D2BC62.root
")

140416 07:01:18 001 Xrd: GoToAnotherServer: Going to:
t2-cms-xrootd01.desy.de:1094
140416 07:01:18 001 Xrd: GoToAnotherServer: Going to:
dcache-cms-xrootd.desy.de:1094
140416 07:01:18 001 Xrd: GoToAnotherServer: Going to:
131.169.191.230:20982

tommaso




On Tue, Apr 15, 2014 at 11:08 PM, Andrew Hanushevsky <[log in to unmask]
wrote:

   Hi Tommaso,

DNS round-robin, while it looks good in small scale tests, rarely works
all that well. The reason is that DNS round-robins whenever a look-up is
made regardless of the reason for the lookup. With a of clients that may
very well lead to suboptimal ordering. So, the xrootd client gets all of
the addresses and uses an algorithm that better spreads the access.

As for why xrdcp didnÿÿt go after the seconds entry is mysterious but I
would say itÿÿs a bug. Could you try the same test again but use
xrdcopy?
Thatÿÿs the new version of the client.

Andy

*From:* Tommaso Boccali <[log in to unmask]>
*Sent:* Tuesday, April 15, 2014 3:48 AM
*To:* [log in to unmask]
*Subject:* Re: problem with aliased redirectors


as additional info, the DNS seems to do well its RR job: from the same
machine

-bash-3.2$ host xrootd-cms.infn.it
xrootd-cms.infn.it has address 193.205.76.83
xrootd-cms.infn.it has address 90.147.66.75
-bash-3.2$ host xrootd-cms.infn.it
xrootd-cms.infn.it has address 90.147.66.75
xrootd-cms.infn.it has address 193.205.76.83
-bash-3.2$ host xrootd-cms.infn.it
xrootd-cms.infn.it has address 90.147.66.75
xrootd-cms.infn.it has address 193.205.76.83
-bash-3.2$ host xrootd-cms.infn.it
xrootd-cms.infn.it has address 90.147.66.75
xrootd-cms.infn.it has address 193.205.76.83
-bash-3.2$ host xrootd-cms.infn.it
xrootd-cms.infn.it has address 193.205.76.83
xrootd-cms.infn.it has address 90.147.66.75

So each time the order returned is random, in case xrootd would need to
depend on this

BUT: inside xrdcp log, the order seems always to be the same (*)

is some caching done inside xrdcp killing the RR?

tom

*:

-bash-3.2$ grep DNS log
140415 12:32:21 001 Xrd: ConvertDNSAlias: resolving xrootd-cms.infn.it
140415 12:32:21 001 Xrd: ConvertDNSAlias: found host
xrootd-redic.pi.infn.it with addr 193.205.76.83
140415 12:32:21 001 Xrd: ConvertDNSAlias: found host
xrootd.ba.infn.itwith addr 90.147.66.75

140415 12:32:21 001 Xrd: ConvertDNSAlias: resolving xrootd-cms.infn.it
140415 12:32:21 001 Xrd: ConvertDNSAlias: found host
xrootd-redic.pi.infn.it with addr 193.205.76.83
140415 12:32:21 001 Xrd: ConvertDNSAlias: found host
xrootd.ba.infn.itwith addr 90.147.66.75

140415 12:32:26 001 Xrd: ConvertDNSAlias: resolving xrootd-cms.infn.it
140415 12:32:26 001 Xrd: ConvertDNSAlias: found host
xrootd-redic.pi.infn.it with addr 193.205.76.83
140415 12:32:26 001 Xrd: ConvertDNSAlias: found host
xrootd.ba.infn.itwith addr 90.147.66.75

140415 12:32:31 001 Xrd: ConvertDNSAlias: resolving xrootd-cms.infn.it
140415 12:32:31 001 Xrd: ConvertDNSAlias: found host
xrootd-redic.pi.infn.it with addr 193.205.76.83
140415 12:32:31 001 Xrd: ConvertDNSAlias: found host
xrootd.ba.infn.itwith addr 90.147.66.75

140415 12:32:36 001 Xrd: ConvertDNSAlias: resolving xrootd-cms.infn.it
140415 12:32:36 001 Xrd: ConvertDNSAlias: found host
xrootd-redic.pi.infn.it with addr 193.205.76.83
140415 12:32:36 001 Xrd: ConvertDNSAlias: found host
xrootd.ba.infn.itwith addr 90.147.66.75

140415 12:32:41 001 Xrd: ConvertDNSAlias: resolving xrootd-cms.infn.it
140415 12:32:41 001 Xrd: ConvertDNSAlias: found host
xrootd-redic.pi.infn.it with addr 193.205.76.83
140415 12:32:41 001 Xrd: ConvertDNSAlias: found host
xrootd.ba.infn.itwith addr 90.147.66.75

140415 12:32:46 001 Xrd: ConvertDNSAlias: resolving xrootd-cms.infn.it
140415 12:32:46 001 Xrd: ConvertDNSAlias: found host
xrootd-redic.pi.infn.it with addr 193.205.76.83
140415 12:32:46 001 Xrd: ConvertDNSAlias: found host
xrootd.ba.infn.itwith addr 90.147.66.75

140415 12:32:51 001 Xrd: ConvertDNSAlias: resolving xrootd-cms.infn.it
140415 12:32:51 001 Xrd: ConvertDNSAlias: found host
xrootd-redic.pi.infn.it with addr 193.205.76.83
140415 12:32:51 001 Xrd: ConvertDNSAlias: found host
xrootd.ba.infn.itwith addr 90.147.66.75

140415 12:32:56 001 Xrd: ConvertDNSAlias: resolving xrootd-cms.infn.it
140415 12:32:56 001 Xrd: ConvertDNSAlias: found host
xrootd-redic.pi.infn.it with addr 193.205.76.83
140415 12:32:56 001 Xrd: ConvertDNSAlias: found host
xrootd.ba.infn.itwith addr 90.147.66.75


On 15 Apr 2014, at 12:33, Tommaso Boccali <[log in to unmask]>
wrote:

Ciao,
as from a previous discussion, we have setup an aliased DNS xrootd
redirector,

which is

-bash-3.2$ host xrootd-cms.infn.it
xrootd-cms.infn.it has address 90.147.66.75
xrootd-cms.infn.it has address 193.205.76.83

I was playing with some crash tests, and I do not get the result.

So: I switched off the redirector 193.205.76.83, while keeping it into
the alias, and I issued a

xrdcp -d 10 root://
xrootd-cms.infn.it//store/data/Run2013A/MinimumBias/
RECO/PromptReco-v1/000/212/188/00000/6C246B92-C67B-E211-
BE02-003048D2BC62.root.

I was assuming that the client would have recognized the alias, and
eventually tried a second host if the first was not available.

In the log ( https://www.dropbox.com/s/zmp9uyreqm4qwhg/xrootd.log )
I see eventually the client recognizes the situation:


  140415 12:25:11 001 Xrd: ConvertDNSAlias: found host
xrootd-redic.pi.infn.it with addr 193.205.76.83
140415 12:25:11 001 Xrd: ConvertDNSAlias: found host
xrootd.ba.infn.itwith addr 90.147.66.75

140415 12:25:11 001 Xrd: ShowUrls: The converted URLs count is 2
140415 12:25:11 001 Xrd: ShowUrls: URL n.1: root://
xrootd-redic.pi.infn.it:1094//store/data/Run2013A/
MinimumBias/RECO/PromptReco-v1/000/212/188/00000/6C246B92-
C67B-E211-BE02-003048D2BC62.root
.
140415 12:25:11 001 Xrd: ShowUrls: URL n.2: root://
xrootd.ba.infn.it:1094//store/data/Run2013A/MinimumBias/
RECO/PromptReco-v1/000/212/188/00000/6C246B92-C67B-E211-
BE02-003048D2BC62.root
.

but then

140415 12:25:46 001 Xrd: Open: Trying to connect to
xrootd-redic.pi.infn.it:1094. Connect try 8
140415 12:25:46 001 Xrd: XrdClientConn: Trying to connect to
193.205.76.83:1094
140415 12:25:46 001 Xrd: Connect: Creating a logical connection...
140415 12:25:46 001 Xrd: Connect: Physical connection not found.
Creating
a new one...
140415 12:25:46 001 Xrd: Connect: Connecting to [
xrootd-redic.pi.infn.it:1094]
140415 12:25:46 001 Xrd: ClientSock::TryConnect_low: Trying to connect
to
xrootd-redic.pi.infn.it(193.205.76.83):1094 Windowsize=0 Timeout=120
140415 12:25:46 001 Xrd: ClientSock::TryConnect_low: Connection
toxrootd-redic.pi.infn.it:1094 failed. (-1)
140415 12:25:46 001 Xrd: Connect: can't open connection to [
xrootd-redic.pi.infn.it:1094]
140415 12:25:46 001 Xrd: PhyConnection: Disconnecting socket...
140415 12:25:46 001 Xrd: Connect: Connect(xrootd-redic.pi.infn.it,
1094)
returned -1
140415 12:25:46 001 Xrd: XrdNetFile: Error creating logical connection
to
xrootd-redic.pi.infn.it:1094
140415 12:25:46 001 Xrd: Open: Disconnecting.
140415 12:25:46 001 Xrd: Cache: Cache Status --------------------------
140415 12:25:46 001 Xrd: Cache: --------------------------------------
fTotalByteCount = 0
Last server error 10000 ('')
Error accessing path/file for root://
xrootd-cms.infn.it//store/data/Run2013A/MinimumBias/
RECO/PromptReco-v1/000/212/188/00000/6C246B92-C67B-E211-
BE02-003048D2BC62.root

so no attempt is done on the other. What is wrong here? all in all it
tries 8 times to connect to the SAME server, and 0 times to the other
...


thanks a lot

tom

--
Tommaso Boccali
INFN Pisa

------------------------------


Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1



------------------------------


Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1




--
Tommaso Boccali
INFN Pisa

------------------------------


Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1




--
Tommaso Boccali
INFN Pisa




--
Tommaso Boccali
INFN Pisa

########################################################################

Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1


 
--
Tommaso Boccali
INFN Pisa


Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1




--
Tommaso Boccali
INFN Pisa


Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1