Print

Print


Hi Pavel,

One more thing... the xrootd log is also confusing as there is a big gap
that covers the time in question:

060128 15:48:00 9497 fsimon.26647:46@rcas6107 ofs_open: 0-644
fn=/home/starlib/reco/ppProduction/FullField/P05if/2005/121/st_physics_6121076_raw_1030007.MuDst.root
060129 15:20:03 001 (c) 2004 Stanford University/SLAC xrd version
20050920-0008_dbg

There seems to be 24 hours missing here.

Andy

P.S. If you could edit it down to cover the full day of 1/28 and the 1st
12 hours of 1/29 that would be great.

On Tue, 31 Jan 2006, Pavel Jakl wrote:

> Hello Andy,
>
> first of all, thanks for the quick reply.
> Andrew Hanushevsky wrote:
>
> > Hi Pavel,
> >
> > What release are you running. I do recall that we had some end
> > conditions that were corrected in recent releases. I am particular
> > concerned here because the current version should not be sensitive to
> > stange load values. They are generally ignored (with a message).
>
> We are running the latest production release, i.e 20050920-0008. Maybe
> you are probably thinking over, why I did come with load values.
> Presumably one month ago, I got the same strange behavior. I have found
> in olbd manager's log file that some of nodes were delivering wrong load
> values and than the node was scheduled for removal etc.
> I corrected our script for meassuring the load and everything was after
> the repair ok.
>
> You probably mean by the current version the 20060105-0311 development
> version.
>
> >
> > Anyway, here is the expected scenario:
> >
> > a) A server node drops out (i.e., the redirector cannot communicate
> > with it).
> > b) The redirector takes the server "offline" (i.e., scheduled for
> > removal). This means that anyone who would have been redirected to
> > that server is told to wait.
> > c) The server now has 10 minutes or so (this is configurable) to
> > reconnect to the redirector.
>
> Ok, is it a {olb.delay drop 10m}, right ?*
> *
>
> > d) After 10 minutes, the server is dropped and considered no longer to
> > be in the configuration.
> > e) The server in (d), of course, is free to reconnect.
> > Now, the scenario works backwards as well. The server should
> > eventually see that the redirector is no longer communicating with it.
> > This will cause the server to terminate it's redirector connection and
> > try to re-establish that connection. Older version of the olbd had
> > some problems in that code relative to flaky network connections. That
> > should no longer be the case. What does the server log show?
> >
> > Assuming you are running the current version, should you be able to
> > get a server in that state (i.e., it canot reconnect to the
> > redirector), then a gcore of the server along with the complete log
> > file  would be extremely helpful.
>
> The log files are located in http://www.star.bnl.gov/~pjakl.
> About 060128 19:09:30 you can the problems with a network, located in
> "rcas6132/rcas6132.olb.log.20060129".
> Your mentioned scenario can be seen in rcas6150.olb.log, but the problem
> is after that.
> Last record is
> 060128 15:54:45 001 olb_Server: Logged into xrdstar
>
> That is before the removal in "rcas6132.olb.log.20060129" at 060128
> 19:09:30 at  and then nothing.
>
> Hope that will help you.
>
> Pavel
>
> >
> > Andy
> >
> > ----- Original Message ----- From: "Pavel Jakl" <[log in to unmask]>
> > To: "Xrootd Mailing List" <[log in to unmask]>
> > Cc: "Jerome LAURET" <[log in to unmask]>
> > Sent: Monday, January 30, 2006 4:36 PM
> > Subject: Host removal on the olbd manager
> >
> >
> >> Hi all,
> >>
> >> I have got very strange behavior of our installation. Let me describe
> >> it:
> >> We had some problems on one of Cisco switch boards where is also
> >> connected our redirector node. There was discovered these lines in
> >> olb log file during the crash of network :
> >> Example for one node:
> >>
> >> 060128 19:09:30 20424 olb_GetLine: Unable to read request; no route
> >> to host
> >> 060128 19:09:30 20424 olb_Manager: rcas6150:1095 scheduled for
> >> removal; not responding
> >>
> >> This node "rcas6150" didn't recover a connection to the redirector
> >> olbd server anymore, but the olbd proccess is stil running on that node.
> >> And when someone tried to request the file from that node then he
> >> wasn't redirected to that node, even the file is there.
> >> If the olbd process is restarted on that node, everything is in on
> >> order and the user is redirected to that node and file is opened.
> >>
> >> You can simulate this strange behavior by giving wrong numbers (means
> >> value greater than 100 etc.) of load, io etc. to redirector node.
> >> Then node is scheduled for removal ....
> >>
> >> Thanks for a advice
> >> Let me know if you need something
> >> Pavel
> >>
> >>
> >>
>
>