Hi Pavel, One more thing... the xrootd log is also confusing as there is a big gap that covers the time in question: 060128 15:48:00 9497 fsimon.26647:46@rcas6107 ofs_open: 0-644 fn=/home/starlib/reco/ppProduction/FullField/P05if/2005/121/st_physics_6121076_raw_1030007.MuDst.root 060129 15:20:03 001 (c) 2004 Stanford University/SLAC xrd version 20050920-0008_dbg There seems to be 24 hours missing here. Andy P.S. If you could edit it down to cover the full day of 1/28 and the 1st 12 hours of 1/29 that would be great. On Tue, 31 Jan 2006, Pavel Jakl wrote: > Hello Andy, > > first of all, thanks for the quick reply. > Andrew Hanushevsky wrote: > > > Hi Pavel, > > > > What release are you running. I do recall that we had some end > > conditions that were corrected in recent releases. I am particular > > concerned here because the current version should not be sensitive to > > stange load values. They are generally ignored (with a message). > > We are running the latest production release, i.e 20050920-0008. Maybe > you are probably thinking over, why I did come with load values. > Presumably one month ago, I got the same strange behavior. I have found > in olbd manager's log file that some of nodes were delivering wrong load > values and than the node was scheduled for removal etc. > I corrected our script for meassuring the load and everything was after > the repair ok. > > You probably mean by the current version the 20060105-0311 development > version. > > > > > Anyway, here is the expected scenario: > > > > a) A server node drops out (i.e., the redirector cannot communicate > > with it). > > b) The redirector takes the server "offline" (i.e., scheduled for > > removal). This means that anyone who would have been redirected to > > that server is told to wait. > > c) The server now has 10 minutes or so (this is configurable) to > > reconnect to the redirector. > > Ok, is it a {olb.delay drop 10m}, right ?* > * > > > d) After 10 minutes, the server is dropped and considered no longer to > > be in the configuration. > > e) The server in (d), of course, is free to reconnect. > > Now, the scenario works backwards as well. The server should > > eventually see that the redirector is no longer communicating with it. > > This will cause the server to terminate it's redirector connection and > > try to re-establish that connection. Older version of the olbd had > > some problems in that code relative to flaky network connections. That > > should no longer be the case. What does the server log show? > > > > Assuming you are running the current version, should you be able to > > get a server in that state (i.e., it canot reconnect to the > > redirector), then a gcore of the server along with the complete log > > file would be extremely helpful. > > The log files are located in http://www.star.bnl.gov/~pjakl. > About 060128 19:09:30 you can the problems with a network, located in > "rcas6132/rcas6132.olb.log.20060129". > Your mentioned scenario can be seen in rcas6150.olb.log, but the problem > is after that. > Last record is > 060128 15:54:45 001 olb_Server: Logged into xrdstar > > That is before the removal in "rcas6132.olb.log.20060129" at 060128 > 19:09:30 at and then nothing. > > Hope that will help you. > > Pavel > > > > > Andy > > > > ----- Original Message ----- From: "Pavel Jakl" <[log in to unmask]> > > To: "Xrootd Mailing List" <[log in to unmask]> > > Cc: "Jerome LAURET" <[log in to unmask]> > > Sent: Monday, January 30, 2006 4:36 PM > > Subject: Host removal on the olbd manager > > > > > >> Hi all, > >> > >> I have got very strange behavior of our installation. Let me describe > >> it: > >> We had some problems on one of Cisco switch boards where is also > >> connected our redirector node. There was discovered these lines in > >> olb log file during the crash of network : > >> Example for one node: > >> > >> 060128 19:09:30 20424 olb_GetLine: Unable to read request; no route > >> to host > >> 060128 19:09:30 20424 olb_Manager: rcas6150:1095 scheduled for > >> removal; not responding > >> > >> This node "rcas6150" didn't recover a connection to the redirector > >> olbd server anymore, but the olbd proccess is stil running on that node. > >> And when someone tried to request the file from that node then he > >> wasn't redirected to that node, even the file is there. > >> If the olbd process is restarted on that node, everything is in on > >> order and the user is redirected to that node and file is opened. > >> > >> You can simulate this strange behavior by giving wrong numbers (means > >> value greater than 100 etc.) of load, io etc. to redirector node. > >> Then node is scheduled for removal .... > >> > >> Thanks for a advice > >> Let me know if you need something > >> Pavel > >> > >> > >> > >