Print

Print


Hi Fabrizio,

after the discussion and few tests made by Jerome and one of our 
facility guy (Chris Hollowell), we concluded that there is nothing wrong 
with our DNS RR. When you use gethostbyname() struct inspection, both 
addresses
are returned. Even the server side is able to recognize two addresses 
under the DNS RR. So, there has to be something wrong with the client.

I have made a small investigation and have found where is the problem. 
To resolve the addresses under DNS RR, you use same function as the server:

int XrdNetDNS::getAddrName(const char *InetName, int maxipa, char 
**Addr, char **Name, char **errtxt)      

The problem is that the function immediately does this after the call :

// Max 10 addresses and 
names                                                                                                              

//                                                                                                                                         

  maxipa = (maxipa > 1 && maxipa < 10) ? maxipa : 1;

where maxipa is the number of ip which you want to return.

There is still nothing particularly wrong till you call the function 
with maxipa=10, then the maxipa is reset to maxipa=1.

So, I guess two solutions, change the condition as follows:
maxipa = (maxipa > 1 && maxipa <= 10) ? maxipa : 1;

or call with less maxipa :-)

I tested this and everything works then correctly.

Cheers
Pavel

Pavel Jakl wrote:
> I thought that DNS RR serves the purpose of light-weight load 
> balancing between multiple redirectors. But now, I see that each 
> redirector know about the others and they can be configured as 
> fail-over or distributing the load between each other.
>
> Thanks, I will try that
> Pavel
>
>
> Fabrizio Furano wrote:
>> Hi Pavel,
>>
>>  the "problem" is the DNS RR. The client receives only one IP addr 
>> and keeps that. You should only create an alias instead of using RR 
>> at the DNS level and make sure that the DNS gives all the aliases 
>> when requested to translate the addr.
>>
>>  For example, this is the output of nslookup in the case of the 
>> redirectors at SLAC:
>>
>> fabrizio@bradipo 10:01:26 ~>nslookup kanolb-a.slac.stanford.edu
>> Server:         192.84.143.16
>> Address:        192.84.143.16#53
>>
>> Non-authoritative answer:
>> Name:   kanolb-a.slac.stanford.edu
>> Address: 134.79.85.23
>> Name:   kanolb-a.slac.stanford.edu
>> Address: 134.79.85.24
>>
>>
>>  as you can see, both IPs are returned to the client at the same 
>> time, and they will be considered during the connection phase.
>>
>>  Fabrizio
>>
>> Pavel Jakl wrote:
>>> Hi Fabrizio and Andy,
>>>
>>> I am not sure if we discussed this before, but let me explain my 
>>> problem. When Andy has implemented multiple redirectors for clusters 
>>> bigger than 64 servers, I though that it would bring us the full 
>>> recoverability in the case that something happened to the host 
>>> acting as redirector.
>>> I did few tests and has found that client is not ready for that. 
>>> Maybe I am wrong and doing something wrong, so let me explain it.
>>> We have DNS RR containing 2 servers configured as full redirectors 
>>> and managers of the cluster.
>>> [The example: DNS RR - xrdstar.rcf.bnl.gov and 2 redirectors 
>>> (rcas6132, rcas6182)]
>>>
>>> The problem is that client will initially resolve one of the 
>>> redirectors, but if the particular host is down, the client doesn't 
>>> try to  connect to the second redirector.  It even doesn't keep 
>>> track of the servers which are available under DNS RR. I am not 
>>> against that client will try to connect fixed number of times, but 
>>> when he is not successful, move to other server under DNS RR.
>>>
>>> As you can see in the example, it resolves rcas6132 which was 
>>> temporarily down, but didn't try the second one rcas6182... I am not 
>>> sure, but how you handle this in SLAC ? What I am doing wrong ?
>>>
>>> Thanks
>>> Pavel
>>>
>>> CINT/ROOT C/C++ Interpreter version 5.16.13, June 8, 2006
>>> Type ? for help. Commands must be C++ statements.
>>> Enclose multiple statements between { }.
>>> *** Float Point Exception is OFF ***
>>> *** Start at Date : Sun Apr 22 08:07:06 2007
>>> QAInfo:You are using STAR_LEVEL : dev, ROOT_LEVEL : 5.12.00 and node 
>>> : rcas6009.rcf.bnl.gov
>>> root4star [0]
>>> Processing XROOTD_macro.C...
>>> 070422 08:07:06 001 Xrd: Create: (C) 2004 SLAC INFN XrdClient 
>>> kXR_ver002+kXR_asyncap
>>> 070422 08:07:06 001 Xrd: TakeUrl: parsing url:
>>> 070422 08:07:06 001 Xrd: GetDomainToMatch: 
>>> GetHostName(rcas6009.rcf.bnl.gov) returned name=rcas6009.rcf.bnl.gov
>>> 070422 08:07:06 001 Xrd: GetDomainToMatch: 
>>> GetDomain(rcas6009.rcf.bnl.gov) --> rcf.bnl.gov
>>> 070422 08:07:06 001 Xrd: XrdClientUrlSet: parsing: 
>>> root://xrdstar.rcf.bnl.gov:1097//data1/reco/productionCentral/FullField/P05ic/2004/053/st_physics_5053078_raw_3020016.MuDst.root 
>>>
>>> 070422 08:07:06 001 Xrd: XrdClientUrlSet: protocol: root
>>> 070422 08:07:06 001 Xrd: XrdClientUrlSet: file: 
>>> /data1/reco/productionCentral/FullField/P05ic/2004/053/st_physics_5053078_raw_3020016.MuDst.root 
>>>
>>> 070422 08:07:06 001 Xrd: XrdClientUrlSet: list of [host:port] : 
>>> xrdstar.rcf.bnl.gov:1097
>>> 070422 08:07:06 001 Xrd: XrdClientUrlSet: Remote file to open is 
>>> '/data1/reco/productionCentral/FullField/P05ic/2004/053/st_physics_5053078_raw_3020016.MuDst.root' 
>>>
>>> 070422 08:07:06 001 Xrd: XrdClientUrlSet: parsing entity: 
>>> xrdstar.rcf.bnl.gov:1097
>>> 070422 08:07:06 001 Xrd: TakeUrl: parsing url: xrdstar.rcf.bnl.gov:1097
>>> 070422 08:07:06 001 Xrd: TakeUrl:    HostWPort:   
>>> xrdstar.rcf.bnl.gov:1097
>>> 070422 08:07:06 001 Xrd: TakeUrl:    File:   /
>>> 070422 08:07:06 001 Xrd: TakeUrl:    Host:   xrdstar.rcf.bnl.gov
>>> 070422 08:07:06 001 Xrd: TakeUrl:    Port:   1097
>>> 070422 08:07:06 001 Xrd: ConvertDNSAlias: resolving 
>>> xrdstar.rcf.bnl.gov:1097
>>> 070422 08:07:06 001 Xrd: CheckPort: specified port (1097) 
>>> potentially valid.
>>> 070422 08:07:06 001 Xrd: ConvertDNSAlias: found host 
>>> rcas6132.rcf.bnl.gov with addr 130.199.206.182
>>> 070422 08:07:06 001 Xrd: ShowUrls: The converted URLs count is 1
>>> 070422 08:07:06 001 Xrd: ShowUrls: URL n.1: 
>>> root://rcas6132.rcf.bnl.gov:1097//data1/reco/productionCentral/FullField/P05ic/2004/053/st_physics_5053078_raw_3020016.MuDst.root. 
>>>
>>> 070422 08:07:06 001 Xrd: GetDomainToMatch: 
>>> GetHostName(rcas6132.rcf.bnl.gov) returned name=rcas6132.rcf.bnl.gov
>>> 070422 08:07:06 001 Xrd: GetDomainToMatch: 
>>> GetDomain(rcas6132.rcf.bnl.gov) --> rcf.bnl.gov
>>> 070422 08:07:06 001 Xrd: CheckHostDomain: Resolved 
>>> [rcas6132.rcf.bnl.gov]'s domain name into [rcf.bnl.gov]
>>> 070422 08:07:06 001 Xrd: DomainMatcher: search for 'rcf.bnl.gov' in 
>>> '<unknown>'
>>> 070422 08:07:06 001 Xrd: DomainMatcher: checking domain: <unknown>
>>> 070422 08:07:06 001 Xrd: DomainMatcher: no domain matching 
>>> 'rcf.bnl.gov' found in '<unknown>'
>>> 070422 08:07:06 001 Xrd: DomainMatcher: search for 'rcf.bnl.gov' in 
>>> 'rcf.bnl.gov|usatlas.bnl.gov'
>>> 070422 08:07:06 001 Xrd: DomainMatcher: checking domain: rcf.bnl.gov
>>> 070422 08:07:06 001 Xrd: DomainMatcher: domain: rcf.bnl.gov matches 
>>> 'rcf.bnl.gov' (matching chars: 11)
>>> 070422 08:07:06 001 Xrd: CheckHostDomain: Access granted to the 
>>> domain of [rcas6132.rcf.bnl.gov].
>>> 070422 08:07:06 001 Xrd: Open: Trying to connect to 
>>> rcas6132.rcf.bnl.gov:1097. Connect try 1
>>> 070422 08:07:06 001 Xrd: XrdClientConn: Trying to connect to 
>>> 130.199.206.182:1097
>>> 070422 08:07:06 001 Xrd: Connect: Creating a logical connection...
>>> 070422 08:07:06 001 Xrd: Connect: Physical connection not found. 
>>> Creating a new one...
>>> 070422 08:07:06 001 Xrd: Touch: Setting last use to current 
>>> time1177243626
>>> 070422 08:07:06 001 Xrd: Connect: Connecting to 
>>> [rcas6132.rcf.bnl.gov:1097]
>>> 070422 08:07:06 001 Xrd: ClientSock::TryConnect: Trying to connect 
>>> torcas6132.rcf.bnl.gov(130.199.206.182):1097 Timeout=60
>>> 070422 08:07:06 001 Xrd: ClientSock::TryConnect: Connection 
>>> torcas6132.rcf.bnl.gov:1097 failed. (-1)
>>> 070422 08:07:06 001 Xrd: Connect: can't open connection to 
>>> [rcas6132.rcf.bnl.gov:1097]
>>> 070422 08:07:06 001 Xrd: PhyConnection: Disconnecting socket...
>>> 070422 08:07:06 001 Xrd: XrdClientPhyConnection: Destroying. [:-1]
>>> 070422 08:07:06 001 Xrd: PhyConnection: Disconnecting socket...
>>> 070422 08:07:06 001 Xrd: Connect: Connect(rcas6132.rcf.bnl.gov, 
>>> 1097) returned -1
>>> 070422 08:07:06 001 Xrd: XrdNetFile: Error creating logical 
>>> connection to rcas6132.rcf.bnl.gov:1097
>>> 070422 08:07:06 001 Xrd: Open: Disconnecting.
>>> 070422 08:07:06 001 Xrd: Open: Connection attempt failed. Sleeping 
>>> 20 seconds.
>>>
>>>
>>>
>>>