Hi,
Any suggestions of extra looging to switch on on the servers before I
head off.
I just had to restart the olbds on 6 out of 20 servers but I'm not going
to be around over the weekend to restart them.
Yours,
Chris.
> -----Original Message-----
> From: [log in to unmask]
> [mailto:[log in to unmask]] On Behalf Of
> Brew, CAJ (Chris)
> Sent: 18 February 2005 11:27
> To: Andrew Hanushevsky; Peter Elmer
> Cc: [log in to unmask]
> Subject: RE: olbd problems at RAL
>
> Hi,
>
> The logs for the last couple of days are in
> /afs/slac.stanford.edu/u/br/brew/olb_problem. There isn't much info
> there. If you suggest the extra traces to add I'll put them in the
> config files.
>
> The rdr xrootd and olbd hasn't been restarted for a while so it'd
> startup isn't in the period covered by the logs.
>
> The -t option on the xrootd is identical to the "ofs.redirect target
> hostname" in the xrootd.cf file, correct? We seem to be using that
> rather than -t on the command line. I'll add it now and see if it has
> any effect.
>
> Before I left last night I set up a job that once an hour would try to
> access a file of each server every hour (the job failed in the early
> hours because my desktop got rebooted to install security patches) but
> you can see in the rdr xrd log that between 18:08 and 19:10
> some servers
> stopped answering requests for the files. Strangely between
> my test job
> stoping overnight and my test this morning the list of non responsive
> servers changed.
>
> We do have a slightly mixed system at the moment and it's in flux.
>
> All but one test server are running the xrootd-20040907-0403 version,
> one test server (csfnfs45) is running xrootd-20041214-1142 but the
> problem appears on both types.
>
> We've just added the extra SL3 redirector xrootd107 in addition to our
> old RH73 one csflnx108 but the problem happens via both redirectors.
>
> Most of the servers have an entry for an extra redirector xrootd108 in
> their config files (but not all the servers experiencing the problem)
> which is what csflnx108 will become when it's reinstalled with SL3 but
> the DNS name does not yet exist so there is some complaint to the
> olblogs about that.
>
> Yours,
> Chris.
>
> > -----Original Message-----
> > From: Andrew Hanushevsky [mailto:[log in to unmask]]
> > Sent: 17 February 2005 15:22
> > To: Peter Elmer
> > Cc: Brew, CAJ (Chris); [log in to unmask]
> > Subject: Re: olbd problems at RAL
> >
> > Hi Pete,
> >
> > Sometimes minor things leak in that have major impacts.
> > Usually I go by
> > what is really running successful elsewhere to determine the
> > probability
> > of success. However, you do bring up a good point. Chris,
> > please make sure
> > that you are using -w on the olbd consistently with -t on the
> > xrootd data
> > server. If you specifu -w but not -t then you will see
> > exactly what you
> > described. Also, the logs during start-up time to hang time would be
> > helpful (i.e., redirector: xrootd and olbd, and data server
> > xrootd/olbd).
> > Please clearly identify which is which. Thanks.
> >
> > Andy
> >
> > On Thu, 17 Feb 2005, Peter Elmer wrote:
> >
> > > Hi Andy,
> > >
> > > From:
> > >
> > > http://xrootd.slac.stanford.edu/xrootd.History
> > >
> > > the only differences between version 20040907-0403 (the one
> > we currently
> > > label "production") and 20040830-0105 are small changes to
> > the ./configure
> > > and makefiles, but nothing of substance that would lead to
> > problems with
> > > the olbd. I suspect that there is something else going on.
> > (e.g. the famous
> > > wait/-w problems?)
> > >
> > > Pete
> > >
> > > On Thu, Feb 17, 2005 at 07:12:08AM -0800, Andrew
> Hanushevsky wrote:
> > > > Hi Chris,
> > > >
> > > > Those two particular releases seem to have had some
> > problems. I assume
> > > > you are not mixing releases here (i.e., running either on
> > all servers
> > > > causes you to see the problem).
> > > >
> > > > I do know that 20040830 is a stable release. We run that
> > everywhere at
> > > > SLAC for analysis. I'd suggest going with that one until
> > we test out
> > > > the latest release that should have fixed some other
> > problem relating
> > > > to writing files.
> > > >
> > > > Andy
> > > >
> > > > On Thu, 17 Feb 2005, Brew, CAJ (Chris) wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Since increasing the number of servers at RAL from 8 to
> > 21 we seem to be
> > > > > seeing a new failure mode.
> > > > >
> > > > > All the processes seem to be running fine and you can
> > read a file by
> > > > > going directly to the server that hold is but the
> > server does not seem
> > > > > to respond via the olbd network so if you try to access
> > a file via the
> > > > > load balancer you fail.
> > > > >
> > > > > Restarting the load balancer on the data server fixes
> > the problem.
> > > > >
> > > > > There is nothing unusual in the logs at either end as
> > far or anything
> > > > > missing either as I can tell.
> > > > >
> > > > > This is on data servers running RH73 and
> xrootd-20040907-0403 or
> > > > > xrootd-20041214-1142.
> > > > >
> > > > > Has anyone else seen this? Is there a fix?
> > > > >
> > > > > Thanks,
> > > > > Chris.
> > > > >
> > >
> > >
> > >
> > >
> > --------------------------------------------------------------
> > -----------
> > > Peter Elmer E-mail: [log in to unmask] Phone: +41
> > (22) 767-4644
> > > Address: CERN Division PPE, Bat. 32 2C-14, CH-1211 Geneva
> > 23, Switzerland
> > >
> > --------------------------------------------------------------
> > -----------
> > >
> >
>
>
|