Print

Print


Hi Pavel,

  the "problem" is the DNS RR. The client receives only one IP addr and 
keeps that. You should only create an alias instead of using RR at the 
DNS level and make sure that the DNS gives all the aliases when 
requested to translate the addr.

  For example, this is the output of nslookup in the case of the 
redirectors at SLAC:

fabrizio@bradipo 10:01:26 ~>nslookup kanolb-a.slac.stanford.edu
Server:         192.84.143.16
Address:        192.84.143.16#53

Non-authoritative answer:
Name:   kanolb-a.slac.stanford.edu
Address: 134.79.85.23
Name:   kanolb-a.slac.stanford.edu
Address: 134.79.85.24


  as you can see, both IPs are returned to the client at the same time, 
and they will be considered during the connection phase.

  Fabrizio

Pavel Jakl wrote:
> Hi Fabrizio and Andy,
> 
> I am not sure if we discussed this before, but let me explain my 
> problem. When Andy has implemented multiple redirectors for clusters 
> bigger than 64 servers, I though that it would bring us the full 
> recoverability in the case that something happened to the host acting as 
> redirector.
> I did few tests and has found that client is not ready for that. Maybe I 
> am wrong and doing something wrong, so let me explain it.
> We have DNS RR containing 2 servers configured as full redirectors and 
> managers of the cluster.
> [The example: DNS RR - xrdstar.rcf.bnl.gov and 2 redirectors (rcas6132, 
> rcas6182)]
> 
> The problem is that client will initially resolve one of the 
> redirectors, but if the particular host is down, the client doesn't try 
> to  connect to the second redirector.  It even doesn't keep track of the 
> servers which are available under DNS RR. I am not against that client 
> will try to connect fixed number of times, but when he is not 
> successful, move to other server under DNS RR.
> 
> As you can see in the example, it resolves rcas6132 which was 
> temporarily down, but didn't try the second one rcas6182... I am not 
> sure, but how you handle this in SLAC ? What I am doing wrong ?
> 
> Thanks
> Pavel
> 
> CINT/ROOT C/C++ Interpreter version 5.16.13, June 8, 2006
> Type ? for help. Commands must be C++ statements.
> Enclose multiple statements between { }.
> *** Float Point Exception is OFF ***
> *** Start at Date : Sun Apr 22 08:07:06 2007
> QAInfo:You are using STAR_LEVEL : dev, ROOT_LEVEL : 5.12.00 and node : 
> rcas6009.rcf.bnl.gov
> root4star [0]
> Processing XROOTD_macro.C...
> 070422 08:07:06 001 Xrd: Create: (C) 2004 SLAC INFN XrdClient 
> kXR_ver002+kXR_asyncap
> 070422 08:07:06 001 Xrd: TakeUrl: parsing url:
> 070422 08:07:06 001 Xrd: GetDomainToMatch: 
> GetHostName(rcas6009.rcf.bnl.gov) returned name=rcas6009.rcf.bnl.gov
> 070422 08:07:06 001 Xrd: GetDomainToMatch: 
> GetDomain(rcas6009.rcf.bnl.gov) --> rcf.bnl.gov
> 070422 08:07:06 001 Xrd: XrdClientUrlSet: parsing: 
> root://xrdstar.rcf.bnl.gov:1097//data1/reco/productionCentral/FullField/P05ic/2004/053/st_physics_5053078_raw_3020016.MuDst.root 
> 
> 070422 08:07:06 001 Xrd: XrdClientUrlSet: protocol: root
> 070422 08:07:06 001 Xrd: XrdClientUrlSet: file: 
> /data1/reco/productionCentral/FullField/P05ic/2004/053/st_physics_5053078_raw_3020016.MuDst.root 
> 
> 070422 08:07:06 001 Xrd: XrdClientUrlSet: list of [host:port] : 
> xrdstar.rcf.bnl.gov:1097
> 070422 08:07:06 001 Xrd: XrdClientUrlSet: Remote file to open is 
> '/data1/reco/productionCentral/FullField/P05ic/2004/053/st_physics_5053078_raw_3020016.MuDst.root' 
> 
> 070422 08:07:06 001 Xrd: XrdClientUrlSet: parsing entity: 
> xrdstar.rcf.bnl.gov:1097
> 070422 08:07:06 001 Xrd: TakeUrl: parsing url: xrdstar.rcf.bnl.gov:1097
> 070422 08:07:06 001 Xrd: TakeUrl:    HostWPort:   xrdstar.rcf.bnl.gov:1097
> 070422 08:07:06 001 Xrd: TakeUrl:    File:   /
> 070422 08:07:06 001 Xrd: TakeUrl:    Host:   xrdstar.rcf.bnl.gov
> 070422 08:07:06 001 Xrd: TakeUrl:    Port:   1097
> 070422 08:07:06 001 Xrd: ConvertDNSAlias: resolving 
> xrdstar.rcf.bnl.gov:1097
> 070422 08:07:06 001 Xrd: CheckPort: specified port (1097) potentially 
> valid.
> 070422 08:07:06 001 Xrd: ConvertDNSAlias: found host 
> rcas6132.rcf.bnl.gov with addr 130.199.206.182
> 070422 08:07:06 001 Xrd: ShowUrls: The converted URLs count is 1
> 070422 08:07:06 001 Xrd: ShowUrls: URL n.1: 
> root://rcas6132.rcf.bnl.gov:1097//data1/reco/productionCentral/FullField/P05ic/2004/053/st_physics_5053078_raw_3020016.MuDst.root. 
> 
> 070422 08:07:06 001 Xrd: GetDomainToMatch: 
> GetHostName(rcas6132.rcf.bnl.gov) returned name=rcas6132.rcf.bnl.gov
> 070422 08:07:06 001 Xrd: GetDomainToMatch: 
> GetDomain(rcas6132.rcf.bnl.gov) --> rcf.bnl.gov
> 070422 08:07:06 001 Xrd: CheckHostDomain: Resolved 
> [rcas6132.rcf.bnl.gov]'s domain name into [rcf.bnl.gov]
> 070422 08:07:06 001 Xrd: DomainMatcher: search for 'rcf.bnl.gov' in 
> '<unknown>'
> 070422 08:07:06 001 Xrd: DomainMatcher: checking domain: <unknown>
> 070422 08:07:06 001 Xrd: DomainMatcher: no domain matching 'rcf.bnl.gov' 
> found in '<unknown>'
> 070422 08:07:06 001 Xrd: DomainMatcher: search for 'rcf.bnl.gov' in 
> 'rcf.bnl.gov|usatlas.bnl.gov'
> 070422 08:07:06 001 Xrd: DomainMatcher: checking domain: rcf.bnl.gov
> 070422 08:07:06 001 Xrd: DomainMatcher: domain: rcf.bnl.gov matches 
> 'rcf.bnl.gov' (matching chars: 11)
> 070422 08:07:06 001 Xrd: CheckHostDomain: Access granted to the domain 
> of [rcas6132.rcf.bnl.gov].
> 070422 08:07:06 001 Xrd: Open: Trying to connect to 
> rcas6132.rcf.bnl.gov:1097. Connect try 1
> 070422 08:07:06 001 Xrd: XrdClientConn: Trying to connect to 
> 130.199.206.182:1097
> 070422 08:07:06 001 Xrd: Connect: Creating a logical connection...
> 070422 08:07:06 001 Xrd: Connect: Physical connection not found. 
> Creating a new one...
> 070422 08:07:06 001 Xrd: Touch: Setting last use to current time1177243626
> 070422 08:07:06 001 Xrd: Connect: Connecting to [rcas6132.rcf.bnl.gov:1097]
> 070422 08:07:06 001 Xrd: ClientSock::TryConnect: Trying to connect 
> torcas6132.rcf.bnl.gov(130.199.206.182):1097 Timeout=60
> 070422 08:07:06 001 Xrd: ClientSock::TryConnect: Connection 
> torcas6132.rcf.bnl.gov:1097 failed. (-1)
> 070422 08:07:06 001 Xrd: Connect: can't open connection to 
> [rcas6132.rcf.bnl.gov:1097]
> 070422 08:07:06 001 Xrd: PhyConnection: Disconnecting socket...
> 070422 08:07:06 001 Xrd: XrdClientPhyConnection: Destroying. [:-1]
> 070422 08:07:06 001 Xrd: PhyConnection: Disconnecting socket...
> 070422 08:07:06 001 Xrd: Connect: Connect(rcas6132.rcf.bnl.gov, 1097) 
> returned -1
> 070422 08:07:06 001 Xrd: XrdNetFile: Error creating logical connection 
> to rcas6132.rcf.bnl.gov:1097
> 070422 08:07:06 001 Xrd: Open: Disconnecting.
> 070422 08:07:06 001 Xrd: Open: Connection attempt failed. Sleeping 20 
> seconds.
> 
> 
> 
>