Print

Print


Hi,

Any suggestions of extra looging to switch on on the servers before I
head off.

I just had to restart the olbds on 6 out of 20 servers but I'm not going
to be around over the weekend to restart them.

Yours,
Chris.

> -----Original Message-----
> From: [log in to unmask] 
> [mailto:[log in to unmask]] On Behalf Of 
> Brew, CAJ (Chris)
> Sent: 18 February 2005 11:27
> To: Andrew Hanushevsky; Peter Elmer
> Cc: [log in to unmask]
> Subject: RE: olbd problems at RAL
> 
> Hi,
> 
> The logs for the last couple of days are in
> /afs/slac.stanford.edu/u/br/brew/olb_problem. There isn't much info
> there. If you suggest the extra traces to add I'll put them in the
> config files.
> 
> The rdr xrootd and olbd hasn't been restarted for a while so it'd
> startup isn't in the period covered by the logs.
> 
> The -t option on the xrootd is identical to the "ofs.redirect target
> hostname" in the xrootd.cf file, correct? We seem to be using that
> rather than -t on the command line. I'll add it now and see if it has
> any effect.
> 
> Before I left last night I set up a job that once an hour would try to
> access a file of each server every hour (the job failed in the early
> hours because my desktop got rebooted to install security patches) but
> you can see in the rdr xrd log that between 18:08 and 19:10 
> some servers
> stopped answering requests for the files. Strangely between 
> my test job
> stoping overnight and my test this morning the list of non responsive
> servers changed.
> 
> We do have a slightly mixed system at the moment and it's in flux.
> 
> All but one test server are running the xrootd-20040907-0403 version,
> one test server (csfnfs45) is running xrootd-20041214-1142 but the
> problem appears on both types.
> 
> We've just added the extra SL3 redirector xrootd107 in addition to our
> old RH73 one csflnx108 but the problem happens via both redirectors.
> 
> Most of the servers have an entry for an extra redirector xrootd108 in
> their config files (but not all the servers experiencing the problem)
> which is what csflnx108 will become when it's reinstalled with SL3 but
> the DNS name does not yet exist so there is some complaint to the
> olblogs about that.
> 
> Yours,
> Chris.
> 
> > -----Original Message-----
> > From: Andrew Hanushevsky [mailto:[log in to unmask]] 
> > Sent: 17 February 2005 15:22
> > To: Peter Elmer
> > Cc: Brew, CAJ (Chris); [log in to unmask]
> > Subject: Re: olbd problems at RAL
> > 
> > Hi Pete,
> > 
> > Sometimes minor things leak in that have major impacts. 
> > Usually I go by
> > what is really running successful elsewhere to determine the 
> > probability
> > of success. However, you do bring up a good point. Chris, 
> > please make sure
> > that you are using -w on the olbd consistently with -t on the 
> > xrootd data
> > server. If you specifu -w but not -t then you will see 
> > exactly what you
> > described. Also, the logs during start-up time to hang time would be
> > helpful (i.e., redirector: xrootd and olbd, and data server 
> > xrootd/olbd).
> > Please clearly identify which is which. Thanks.
> > 
> > Andy
> > 
> > On Thu, 17 Feb 2005, Peter Elmer wrote:
> > 
> > >   Hi Andy,
> > >
> > >   From:
> > >
> > >   http://xrootd.slac.stanford.edu/xrootd.History
> > >
> > > the only differences between version 20040907-0403 (the one 
> > we currently
> > > label "production") and 20040830-0105 are small changes to 
> > the ./configure
> > > and makefiles, but nothing of substance that would lead to 
> > problems with
> > > the olbd. I suspect that there is something else going on. 
> > (e.g. the famous
> > > wait/-w problems?)
> > >
> > >                                    Pete
> > >
> > > On Thu, Feb 17, 2005 at 07:12:08AM -0800, Andrew 
> Hanushevsky wrote:
> > > > Hi Chris,
> > > >
> > > > Those two particular releases seem to have had some 
> > problems. I assume
> > > > you are not mixing releases here (i.e., running either on 
> > all servers
> > > > causes you to see the problem).
> > > >
> > > > I do know that 20040830 is a stable release. We run that 
> > everywhere at
> > > > SLAC for analysis. I'd suggest going with that one until 
> > we test out
> > > > the latest release that should have fixed some other 
> > problem relating
> > > > to writing files.
> > > >
> > > > Andy
> > > >
> > > > On Thu, 17 Feb 2005, Brew, CAJ (Chris) wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Since increasing the number of servers at RAL from 8 to 
> > 21 we seem to be
> > > > > seeing a new failure mode.
> > > > >
> > > > > All the processes seem to be running fine and you can 
> > read a file by
> > > > > going directly to the server that hold is but the 
> > server does not seem
> > > > > to respond via the olbd network so if you try to access 
> > a file via the
> > > > > load balancer you fail.
> > > > >
> > > > > Restarting the load balancer on the data server fixes 
> > the problem.
> > > > >
> > > > > There is nothing unusual in the logs at either end as 
> > far or anything
> > > > > missing either as I can tell.
> > > > >
> > > > > This is on data servers running RH73 and 
> xrootd-20040907-0403 or
> > > > > xrootd-20041214-1142.
> > > > >
> > > > > Has anyone else seen this? Is there a fix?
> > > > >
> > > > > Thanks,
> > > > > Chris.
> > > > >
> > >
> > >
> > >
> > > 
> > --------------------------------------------------------------
> > -----------
> > > Peter Elmer     E-mail: [log in to unmask]      Phone: +41 
> > (22) 767-4644
> > > Address: CERN Division PPE, Bat. 32 2C-14, CH-1211 Geneva 
> > 23, Switzerland
> > > 
> > --------------------------------------------------------------
> > -----------
> > >
> > 
> 
>