Hello Andy, first of all, thanks for the quick reply. Andrew Hanushevsky wrote: > Hi Pavel, > > What release are you running. I do recall that we had some end > conditions that were corrected in recent releases. I am particular > concerned here because the current version should not be sensitive to > stange load values. They are generally ignored (with a message). We are running the latest production release, i.e 20050920-0008. Maybe you are probably thinking over, why I did come with load values. Presumably one month ago, I got the same strange behavior. I have found in olbd manager's log file that some of nodes were delivering wrong load values and than the node was scheduled for removal etc. I corrected our script for meassuring the load and everything was after the repair ok. You probably mean by the current version the 20060105-0311 development version. > > Anyway, here is the expected scenario: > > a) A server node drops out (i.e., the redirector cannot communicate > with it). > b) The redirector takes the server "offline" (i.e., scheduled for > removal). This means that anyone who would have been redirected to > that server is told to wait. > c) The server now has 10 minutes or so (this is configurable) to > reconnect to the redirector. Ok, is it a {olb.delay drop 10m}, right ?* * > d) After 10 minutes, the server is dropped and considered no longer to > be in the configuration. > e) The server in (d), of course, is free to reconnect. > Now, the scenario works backwards as well. The server should > eventually see that the redirector is no longer communicating with it. > This will cause the server to terminate it's redirector connection and > try to re-establish that connection. Older version of the olbd had > some problems in that code relative to flaky network connections. That > should no longer be the case. What does the server log show? > > Assuming you are running the current version, should you be able to > get a server in that state (i.e., it canot reconnect to the > redirector), then a gcore of the server along with the complete log > file would be extremely helpful. The log files are located in http://www.star.bnl.gov/~pjakl. About 060128 19:09:30 you can the problems with a network, located in "rcas6132/rcas6132.olb.log.20060129". Your mentioned scenario can be seen in rcas6150.olb.log, but the problem is after that. Last record is 060128 15:54:45 001 olb_Server: Logged into xrdstar That is before the removal in "rcas6132.olb.log.20060129" at 060128 19:09:30 at and then nothing. Hope that will help you. Pavel > > Andy > > ----- Original Message ----- From: "Pavel Jakl" <[log in to unmask]> > To: "Xrootd Mailing List" <[log in to unmask]> > Cc: "Jerome LAURET" <[log in to unmask]> > Sent: Monday, January 30, 2006 4:36 PM > Subject: Host removal on the olbd manager > > >> Hi all, >> >> I have got very strange behavior of our installation. Let me describe >> it: >> We had some problems on one of Cisco switch boards where is also >> connected our redirector node. There was discovered these lines in >> olb log file during the crash of network : >> Example for one node: >> >> 060128 19:09:30 20424 olb_GetLine: Unable to read request; no route >> to host >> 060128 19:09:30 20424 olb_Manager: rcas6150:1095 scheduled for >> removal; not responding >> >> This node "rcas6150" didn't recover a connection to the redirector >> olbd server anymore, but the olbd proccess is stil running on that node. >> And when someone tried to request the file from that node then he >> wasn't redirected to that node, even the file is there. >> If the olbd process is restarted on that node, everything is in on >> order and the user is redirected to that node and file is opened. >> >> You can simulate this strange behavior by giving wrong numbers (means >> value greater than 100 etc.) of load, io etc. to redirector node. >> Then node is scheduled for removal .... >> >> Thanks for a advice >> Let me know if you need something >> Pavel >> >> >>