I thought that DNS RR serves the purpose of light-weight load balancing
between multiple redirectors. But now, I see that each redirector know
about the others and they can be configured as fail-over or distributing
the load between each other.
Thanks, I will try that
Pavel
Fabrizio Furano wrote:
> Hi Pavel,
>
> the "problem" is the DNS RR. The client receives only one IP addr and
> keeps that. You should only create an alias instead of using RR at the
> DNS level and make sure that the DNS gives all the aliases when
> requested to translate the addr.
>
> For example, this is the output of nslookup in the case of the
> redirectors at SLAC:
>
> fabrizio@bradipo 10:01:26 ~>nslookup kanolb-a.slac.stanford.edu
> Server: 192.84.143.16
> Address: 192.84.143.16#53
>
> Non-authoritative answer:
> Name: kanolb-a.slac.stanford.edu
> Address: 134.79.85.23
> Name: kanolb-a.slac.stanford.edu
> Address: 134.79.85.24
>
>
> as you can see, both IPs are returned to the client at the same time,
> and they will be considered during the connection phase.
>
> Fabrizio
>
> Pavel Jakl wrote:
>> Hi Fabrizio and Andy,
>>
>> I am not sure if we discussed this before, but let me explain my
>> problem. When Andy has implemented multiple redirectors for clusters
>> bigger than 64 servers, I though that it would bring us the full
>> recoverability in the case that something happened to the host acting
>> as redirector.
>> I did few tests and has found that client is not ready for that.
>> Maybe I am wrong and doing something wrong, so let me explain it.
>> We have DNS RR containing 2 servers configured as full redirectors
>> and managers of the cluster.
>> [The example: DNS RR - xrdstar.rcf.bnl.gov and 2 redirectors
>> (rcas6132, rcas6182)]
>>
>> The problem is that client will initially resolve one of the
>> redirectors, but if the particular host is down, the client doesn't
>> try to connect to the second redirector. It even doesn't keep track
>> of the servers which are available under DNS RR. I am not against
>> that client will try to connect fixed number of times, but when he is
>> not successful, move to other server under DNS RR.
>>
>> As you can see in the example, it resolves rcas6132 which was
>> temporarily down, but didn't try the second one rcas6182... I am not
>> sure, but how you handle this in SLAC ? What I am doing wrong ?
>>
>> Thanks
>> Pavel
>>
>> CINT/ROOT C/C++ Interpreter version 5.16.13, June 8, 2006
>> Type ? for help. Commands must be C++ statements.
>> Enclose multiple statements between { }.
>> *** Float Point Exception is OFF ***
>> *** Start at Date : Sun Apr 22 08:07:06 2007
>> QAInfo:You are using STAR_LEVEL : dev, ROOT_LEVEL : 5.12.00 and node
>> : rcas6009.rcf.bnl.gov
>> root4star [0]
>> Processing XROOTD_macro.C...
>> 070422 08:07:06 001 Xrd: Create: (C) 2004 SLAC INFN XrdClient
>> kXR_ver002+kXR_asyncap
>> 070422 08:07:06 001 Xrd: TakeUrl: parsing url:
>> 070422 08:07:06 001 Xrd: GetDomainToMatch:
>> GetHostName(rcas6009.rcf.bnl.gov) returned name=rcas6009.rcf.bnl.gov
>> 070422 08:07:06 001 Xrd: GetDomainToMatch:
>> GetDomain(rcas6009.rcf.bnl.gov) --> rcf.bnl.gov
>> 070422 08:07:06 001 Xrd: XrdClientUrlSet: parsing:
>> root://xrdstar.rcf.bnl.gov:1097//data1/reco/productionCentral/FullField/P05ic/2004/053/st_physics_5053078_raw_3020016.MuDst.root
>>
>> 070422 08:07:06 001 Xrd: XrdClientUrlSet: protocol: root
>> 070422 08:07:06 001 Xrd: XrdClientUrlSet: file:
>> /data1/reco/productionCentral/FullField/P05ic/2004/053/st_physics_5053078_raw_3020016.MuDst.root
>>
>> 070422 08:07:06 001 Xrd: XrdClientUrlSet: list of [host:port] :
>> xrdstar.rcf.bnl.gov:1097
>> 070422 08:07:06 001 Xrd: XrdClientUrlSet: Remote file to open is
>> '/data1/reco/productionCentral/FullField/P05ic/2004/053/st_physics_5053078_raw_3020016.MuDst.root'
>>
>> 070422 08:07:06 001 Xrd: XrdClientUrlSet: parsing entity:
>> xrdstar.rcf.bnl.gov:1097
>> 070422 08:07:06 001 Xrd: TakeUrl: parsing url: xrdstar.rcf.bnl.gov:1097
>> 070422 08:07:06 001 Xrd: TakeUrl: HostWPort:
>> xrdstar.rcf.bnl.gov:1097
>> 070422 08:07:06 001 Xrd: TakeUrl: File: /
>> 070422 08:07:06 001 Xrd: TakeUrl: Host: xrdstar.rcf.bnl.gov
>> 070422 08:07:06 001 Xrd: TakeUrl: Port: 1097
>> 070422 08:07:06 001 Xrd: ConvertDNSAlias: resolving
>> xrdstar.rcf.bnl.gov:1097
>> 070422 08:07:06 001 Xrd: CheckPort: specified port (1097) potentially
>> valid.
>> 070422 08:07:06 001 Xrd: ConvertDNSAlias: found host
>> rcas6132.rcf.bnl.gov with addr 130.199.206.182
>> 070422 08:07:06 001 Xrd: ShowUrls: The converted URLs count is 1
>> 070422 08:07:06 001 Xrd: ShowUrls: URL n.1:
>> root://rcas6132.rcf.bnl.gov:1097//data1/reco/productionCentral/FullField/P05ic/2004/053/st_physics_5053078_raw_3020016.MuDst.root.
>>
>> 070422 08:07:06 001 Xrd: GetDomainToMatch:
>> GetHostName(rcas6132.rcf.bnl.gov) returned name=rcas6132.rcf.bnl.gov
>> 070422 08:07:06 001 Xrd: GetDomainToMatch:
>> GetDomain(rcas6132.rcf.bnl.gov) --> rcf.bnl.gov
>> 070422 08:07:06 001 Xrd: CheckHostDomain: Resolved
>> [rcas6132.rcf.bnl.gov]'s domain name into [rcf.bnl.gov]
>> 070422 08:07:06 001 Xrd: DomainMatcher: search for 'rcf.bnl.gov' in
>> '<unknown>'
>> 070422 08:07:06 001 Xrd: DomainMatcher: checking domain: <unknown>
>> 070422 08:07:06 001 Xrd: DomainMatcher: no domain matching
>> 'rcf.bnl.gov' found in '<unknown>'
>> 070422 08:07:06 001 Xrd: DomainMatcher: search for 'rcf.bnl.gov' in
>> 'rcf.bnl.gov|usatlas.bnl.gov'
>> 070422 08:07:06 001 Xrd: DomainMatcher: checking domain: rcf.bnl.gov
>> 070422 08:07:06 001 Xrd: DomainMatcher: domain: rcf.bnl.gov matches
>> 'rcf.bnl.gov' (matching chars: 11)
>> 070422 08:07:06 001 Xrd: CheckHostDomain: Access granted to the
>> domain of [rcas6132.rcf.bnl.gov].
>> 070422 08:07:06 001 Xrd: Open: Trying to connect to
>> rcas6132.rcf.bnl.gov:1097. Connect try 1
>> 070422 08:07:06 001 Xrd: XrdClientConn: Trying to connect to
>> 130.199.206.182:1097
>> 070422 08:07:06 001 Xrd: Connect: Creating a logical connection...
>> 070422 08:07:06 001 Xrd: Connect: Physical connection not found.
>> Creating a new one...
>> 070422 08:07:06 001 Xrd: Touch: Setting last use to current
>> time1177243626
>> 070422 08:07:06 001 Xrd: Connect: Connecting to
>> [rcas6132.rcf.bnl.gov:1097]
>> 070422 08:07:06 001 Xrd: ClientSock::TryConnect: Trying to connect
>> torcas6132.rcf.bnl.gov(130.199.206.182):1097 Timeout=60
>> 070422 08:07:06 001 Xrd: ClientSock::TryConnect: Connection
>> torcas6132.rcf.bnl.gov:1097 failed. (-1)
>> 070422 08:07:06 001 Xrd: Connect: can't open connection to
>> [rcas6132.rcf.bnl.gov:1097]
>> 070422 08:07:06 001 Xrd: PhyConnection: Disconnecting socket...
>> 070422 08:07:06 001 Xrd: XrdClientPhyConnection: Destroying. [:-1]
>> 070422 08:07:06 001 Xrd: PhyConnection: Disconnecting socket...
>> 070422 08:07:06 001 Xrd: Connect: Connect(rcas6132.rcf.bnl.gov, 1097)
>> returned -1
>> 070422 08:07:06 001 Xrd: XrdNetFile: Error creating logical
>> connection to rcas6132.rcf.bnl.gov:1097
>> 070422 08:07:06 001 Xrd: Open: Disconnecting.
>> 070422 08:07:06 001 Xrd: Open: Connection attempt failed. Sleeping 20
>> seconds.
>>
>>
>>
>>
|