Print

Print


Hi Pavel,

What release are you running. I do recall that we had some end conditions 
that were corrected in recent releases. I am particular concerned here 
because the current version should not be sensitive to stange load values. 
They are generally ignored (with a message).

Anyway, here is the expected scenario:

a) A server node drops out (i.e., the redirector cannot communicate with 
it).
b) The redirector takes the server "offline" (i.e., scheduled for removal). 
This means that anyone who would have been redirected to that server is told 
to wait.
c) The server now has 10 minutes or so (this is configurable) to reconnect 
to the redirector.
d) After 10 minutes, the server is dropped and considered no longer to be in 
the configuration.
e) The server in (d), of course, is free to reconnect.

Now, the scenario works backwards as well. The server should eventually see 
that the redirector is no longer communicating with it. This will cause the 
server to terminate it's redirector connection and try to re-establish that 
connection. Older version of the olbd had some problems in that code 
relative to flaky network connections. That should no longer be the case. 
What does the server log show?

Assuming you are running the current version, should you be able to get a 
server in that state (i.e., it canot reconnect to the redirector), then a 
gcore of the server along with the complete log file  would be extremely 
helpful.

Andy

----- Original Message ----- 
From: "Pavel Jakl" <[log in to unmask]>
To: "Xrootd Mailing List" <[log in to unmask]>
Cc: "Jerome LAURET" <[log in to unmask]>
Sent: Monday, January 30, 2006 4:36 PM
Subject: Host removal on the olbd manager


> Hi all,
>
> I have got very strange behavior of our installation. Let me describe it:
> We had some problems on one of Cisco switch boards where is also connected 
> our redirector node. There was discovered these lines in olb log file 
> during the crash of network :
> Example for one node:
>
> 060128 19:09:30 20424 olb_GetLine: Unable to read request; no route to 
> host
> 060128 19:09:30 20424 olb_Manager: rcas6150:1095 scheduled for removal; 
> not responding
>
> This node "rcas6150" didn't recover a connection to the redirector olbd 
> server anymore, but the olbd proccess is stil running on that node.
> And when someone tried to request the file from that node then he wasn't 
> redirected to that node, even the file is there.
> If the olbd process is restarted on that node, everything is in on order 
> and the user is redirected to that node and file is opened.
>
> You can simulate this strange behavior by giving wrong numbers (means 
> value greater than 100 etc.) of load, io etc. to redirector node. Then 
> node is scheduled for removal ....
>
> Thanks for a advice
> Let me know if you need something
> Pavel
>
>
>