Print

Print


Hello Andy,

first of all, thanks for the quick reply.
Andrew Hanushevsky wrote:

> Hi Pavel,
>
> What release are you running. I do recall that we had some end 
> conditions that were corrected in recent releases. I am particular 
> concerned here because the current version should not be sensitive to 
> stange load values. They are generally ignored (with a message).

We are running the latest production release, i.e 20050920-0008. Maybe 
you are probably thinking over, why I did come with load values. 
Presumably one month ago, I got the same strange behavior. I have found 
in olbd manager's log file that some of nodes were delivering wrong load 
values and than the node was scheduled for removal etc.
I corrected our script for meassuring the load and everything was after 
the repair ok.

You probably mean by the current version the 20060105-0311 development 
version.

>
> Anyway, here is the expected scenario:
>
> a) A server node drops out (i.e., the redirector cannot communicate 
> with it).
> b) The redirector takes the server "offline" (i.e., scheduled for 
> removal). This means that anyone who would have been redirected to 
> that server is told to wait.
> c) The server now has 10 minutes or so (this is configurable) to 
> reconnect to the redirector.

Ok, is it a {olb.delay drop 10m}, right ?*
*

> d) After 10 minutes, the server is dropped and considered no longer to 
> be in the configuration.
> e) The server in (d), of course, is free to reconnect.
> Now, the scenario works backwards as well. The server should 
> eventually see that the redirector is no longer communicating with it. 
> This will cause the server to terminate it's redirector connection and 
> try to re-establish that connection. Older version of the olbd had 
> some problems in that code relative to flaky network connections. That 
> should no longer be the case. What does the server log show?
>
> Assuming you are running the current version, should you be able to 
> get a server in that state (i.e., it canot reconnect to the 
> redirector), then a gcore of the server along with the complete log 
> file  would be extremely helpful.

The log files are located in http://www.star.bnl.gov/~pjakl.
About 060128 19:09:30 you can the problems with a network, located in 
"rcas6132/rcas6132.olb.log.20060129".
Your mentioned scenario can be seen in rcas6150.olb.log, but the problem 
is after that.
Last record is
060128 15:54:45 001 olb_Server: Logged into xrdstar

That is before the removal in "rcas6132.olb.log.20060129" at 060128 
19:09:30 at  and then nothing.

Hope that will help you.

Pavel

>
> Andy
>
> ----- Original Message ----- From: "Pavel Jakl" <[log in to unmask]>
> To: "Xrootd Mailing List" <[log in to unmask]>
> Cc: "Jerome LAURET" <[log in to unmask]>
> Sent: Monday, January 30, 2006 4:36 PM
> Subject: Host removal on the olbd manager
>
>
>> Hi all,
>>
>> I have got very strange behavior of our installation. Let me describe 
>> it:
>> We had some problems on one of Cisco switch boards where is also 
>> connected our redirector node. There was discovered these lines in 
>> olb log file during the crash of network :
>> Example for one node:
>>
>> 060128 19:09:30 20424 olb_GetLine: Unable to read request; no route 
>> to host
>> 060128 19:09:30 20424 olb_Manager: rcas6150:1095 scheduled for 
>> removal; not responding
>>
>> This node "rcas6150" didn't recover a connection to the redirector 
>> olbd server anymore, but the olbd proccess is stil running on that node.
>> And when someone tried to request the file from that node then he 
>> wasn't redirected to that node, even the file is there.
>> If the olbd process is restarted on that node, everything is in on 
>> order and the user is redirected to that node and file is opened.
>>
>> You can simulate this strange behavior by giving wrong numbers (means 
>> value greater than 100 etc.) of load, io etc. to redirector node. 
>> Then node is scheduled for removal ....
>>
>> Thanks for a advice
>> Let me know if you need something
>> Pavel
>>
>>
>>