Hello Andy,
first of all, thanks for the quick reply.
Andrew Hanushevsky wrote:
> Hi Pavel,
>
> What release are you running. I do recall that we had some end
> conditions that were corrected in recent releases. I am particular
> concerned here because the current version should not be sensitive to
> stange load values. They are generally ignored (with a message).
We are running the latest production release, i.e 20050920-0008. Maybe
you are probably thinking over, why I did come with load values.
Presumably one month ago, I got the same strange behavior. I have found
in olbd manager's log file that some of nodes were delivering wrong load
values and than the node was scheduled for removal etc.
I corrected our script for meassuring the load and everything was after
the repair ok.
You probably mean by the current version the 20060105-0311 development
version.
>
> Anyway, here is the expected scenario:
>
> a) A server node drops out (i.e., the redirector cannot communicate
> with it).
> b) The redirector takes the server "offline" (i.e., scheduled for
> removal). This means that anyone who would have been redirected to
> that server is told to wait.
> c) The server now has 10 minutes or so (this is configurable) to
> reconnect to the redirector.
Ok, is it a {olb.delay drop 10m}, right ?*
*
> d) After 10 minutes, the server is dropped and considered no longer to
> be in the configuration.
> e) The server in (d), of course, is free to reconnect.
> Now, the scenario works backwards as well. The server should
> eventually see that the redirector is no longer communicating with it.
> This will cause the server to terminate it's redirector connection and
> try to re-establish that connection. Older version of the olbd had
> some problems in that code relative to flaky network connections. That
> should no longer be the case. What does the server log show?
>
> Assuming you are running the current version, should you be able to
> get a server in that state (i.e., it canot reconnect to the
> redirector), then a gcore of the server along with the complete log
> file would be extremely helpful.
The log files are located in http://www.star.bnl.gov/~pjakl.
About 060128 19:09:30 you can the problems with a network, located in
"rcas6132/rcas6132.olb.log.20060129".
Your mentioned scenario can be seen in rcas6150.olb.log, but the problem
is after that.
Last record is
060128 15:54:45 001 olb_Server: Logged into xrdstar
That is before the removal in "rcas6132.olb.log.20060129" at 060128
19:09:30 at and then nothing.
Hope that will help you.
Pavel
>
> Andy
>
> ----- Original Message ----- From: "Pavel Jakl" <[log in to unmask]>
> To: "Xrootd Mailing List" <[log in to unmask]>
> Cc: "Jerome LAURET" <[log in to unmask]>
> Sent: Monday, January 30, 2006 4:36 PM
> Subject: Host removal on the olbd manager
>
>
>> Hi all,
>>
>> I have got very strange behavior of our installation. Let me describe
>> it:
>> We had some problems on one of Cisco switch boards where is also
>> connected our redirector node. There was discovered these lines in
>> olb log file during the crash of network :
>> Example for one node:
>>
>> 060128 19:09:30 20424 olb_GetLine: Unable to read request; no route
>> to host
>> 060128 19:09:30 20424 olb_Manager: rcas6150:1095 scheduled for
>> removal; not responding
>>
>> This node "rcas6150" didn't recover a connection to the redirector
>> olbd server anymore, but the olbd proccess is stil running on that node.
>> And when someone tried to request the file from that node then he
>> wasn't redirected to that node, even the file is there.
>> If the olbd process is restarted on that node, everything is in on
>> order and the user is redirected to that node and file is opened.
>>
>> You can simulate this strange behavior by giving wrong numbers (means
>> value greater than 100 etc.) of load, io etc. to redirector node.
>> Then node is scheduled for removal ....
>>
>> Thanks for a advice
>> Let me know if you need something
>> Pavel
>>
>>
>>
|