Print

Print


Hi,

The logs for the last couple of days are in
/afs/slac.stanford.edu/u/br/brew/olb_problem. There isn't much info
there. If you suggest the extra traces to add I'll put them in the
config files.

The rdr xrootd and olbd hasn't been restarted for a while so it'd
startup isn't in the period covered by the logs.

The -t option on the xrootd is identical to the "ofs.redirect target
hostname" in the xrootd.cf file, correct? We seem to be using that
rather than -t on the command line. I'll add it now and see if it has
any effect.

Before I left last night I set up a job that once an hour would try to
access a file of each server every hour (the job failed in the early
hours because my desktop got rebooted to install security patches) but
you can see in the rdr xrd log that between 18:08 and 19:10 some servers
stopped answering requests for the files. Strangely between my test job
stoping overnight and my test this morning the list of non responsive
servers changed.

We do have a slightly mixed system at the moment and it's in flux.

All but one test server are running the xrootd-20040907-0403 version,
one test server (csfnfs45) is running xrootd-20041214-1142 but the
problem appears on both types.

We've just added the extra SL3 redirector xrootd107 in addition to our
old RH73 one csflnx108 but the problem happens via both redirectors.

Most of the servers have an entry for an extra redirector xrootd108 in
their config files (but not all the servers experiencing the problem)
which is what csflnx108 will become when it's reinstalled with SL3 but
the DNS name does not yet exist so there is some complaint to the
olblogs about that.

Yours,
Chris.

> -----Original Message-----
> From: Andrew Hanushevsky [mailto:[log in to unmask]] 
> Sent: 17 February 2005 15:22
> To: Peter Elmer
> Cc: Brew, CAJ (Chris); [log in to unmask]
> Subject: Re: olbd problems at RAL
> 
> Hi Pete,
> 
> Sometimes minor things leak in that have major impacts. 
> Usually I go by
> what is really running successful elsewhere to determine the 
> probability
> of success. However, you do bring up a good point. Chris, 
> please make sure
> that you are using -w on the olbd consistently with -t on the 
> xrootd data
> server. If you specifu -w but not -t then you will see 
> exactly what you
> described. Also, the logs during start-up time to hang time would be
> helpful (i.e., redirector: xrootd and olbd, and data server 
> xrootd/olbd).
> Please clearly identify which is which. Thanks.
> 
> Andy
> 
> On Thu, 17 Feb 2005, Peter Elmer wrote:
> 
> >   Hi Andy,
> >
> >   From:
> >
> >   http://xrootd.slac.stanford.edu/xrootd.History
> >
> > the only differences between version 20040907-0403 (the one 
> we currently
> > label "production") and 20040830-0105 are small changes to 
> the ./configure
> > and makefiles, but nothing of substance that would lead to 
> problems with
> > the olbd. I suspect that there is something else going on. 
> (e.g. the famous
> > wait/-w problems?)
> >
> >                                    Pete
> >
> > On Thu, Feb 17, 2005 at 07:12:08AM -0800, Andrew Hanushevsky wrote:
> > > Hi Chris,
> > >
> > > Those two particular releases seem to have had some 
> problems. I assume
> > > you are not mixing releases here (i.e., running either on 
> all servers
> > > causes you to see the problem).
> > >
> > > I do know that 20040830 is a stable release. We run that 
> everywhere at
> > > SLAC for analysis. I'd suggest going with that one until 
> we test out
> > > the latest release that should have fixed some other 
> problem relating
> > > to writing files.
> > >
> > > Andy
> > >
> > > On Thu, 17 Feb 2005, Brew, CAJ (Chris) wrote:
> > >
> > > > Hi,
> > > >
> > > > Since increasing the number of servers at RAL from 8 to 
> 21 we seem to be
> > > > seeing a new failure mode.
> > > >
> > > > All the processes seem to be running fine and you can 
> read a file by
> > > > going directly to the server that hold is but the 
> server does not seem
> > > > to respond via the olbd network so if you try to access 
> a file via the
> > > > load balancer you fail.
> > > >
> > > > Restarting the load balancer on the data server fixes 
> the problem.
> > > >
> > > > There is nothing unusual in the logs at either end as 
> far or anything
> > > > missing either as I can tell.
> > > >
> > > > This is on data servers running RH73 and xrootd-20040907-0403 or
> > > > xrootd-20041214-1142.
> > > >
> > > > Has anyone else seen this? Is there a fix?
> > > >
> > > > Thanks,
> > > > Chris.
> > > >
> >
> >
> >
> > 
> --------------------------------------------------------------
> -----------
> > Peter Elmer     E-mail: [log in to unmask]      Phone: +41 
> (22) 767-4644
> > Address: CERN Division PPE, Bat. 32 2C-14, CH-1211 Geneva 
> 23, Switzerland
> > 
> --------------------------------------------------------------
> -----------
> >
>