Print

Print


Hi,

OK I've done a some more debugging and whilst I haven't caught the
problem on servers that have got the debug option switch on I have got
some more info.

It's pretty definitely to do with the second load balancer. I ran all
day today with one load balancer without running into the problems but
about 30 mins after I restarted the second load balancer I ran into it
again. Stopping the olbd abd xrootd on the second LB restored access to
the files.

It's possible I've got this setup misconfigured so the config file from
the LBs is attached below.

I'll continue to try to get debug output from an LB and a dataserver.
The servers don't crash so I won't be able to get core dumps.

Yours,
Chris.

[xrootd107] /opt/xrootd/etc > cat xrootd.cf
# RAL XROOTD base config file 
# CAJB 051004

# Xrd Configuration

# xrootd configuration
xrootd.fslib /opt/xrootd/lib/libXrdOfs.so
xrootd.export /store

#ODC Configuration
odc.manager xrootd107.gridpp.rl.ac.uk 1095
odc.manager xrootd108.gridpp.rl.ac.uk 1095

# OFS Configuration
ofs.redirect remote xrootd107.rl.ac.uk
ofs.redirect remote xrootd108.rl.ac.uk
ofs.redirect target csfnfs35.rl.ac.uk
ofs.redirect target csfnfs41.rl.ac.uk
ofs.redirect target csfnfs45.rl.ac.uk
ofs.redirect target csfnfs46.rl.ac.uk
ofs.redirect target csfnfs47.rl.ac.uk
ofs.redirect target csfnfs48.rl.ac.uk
ofs.redirect target csfnfs49.rl.ac.uk

#OLB Configuration
olb.port 1095
olb.subscribe xrootd107.gridpp.rl.ac.uk 1095
olb.subscribe xrootd108.gridpp.rl.ac.uk 1095
olb.path r /store
olb.wait
olb.trace debug

odc.trace debug
ofs.trace debug

#OSS Configuration
oss.path /store r/o
# Manager config for xrootd redirectors
odc.trace redirect
olb.allow host csfnfs*.rl.ac.uk
olb.allow host csflnx108.rl.ac.uk
olb.allow host xrootd107.gridpp.rl.ac.uk
olb.allow host xrootd108.gridpp.rl.ac.uk

> -----Original Message-----
> From: Andrew Hanushevsky [mailto:[log in to unmask]] 
> Sent: 01 March 2005 19:49
> To: Brew, CAJ (Chris)
> Cc: Olaiya, EO (Emmanuel); [log in to unmask]
> Subject: RE: olbd tracing
> 
> Hi Chris,
> 
> Yes, especially if it thinks one of the two is not 
> particularly responsive
> to its requests. The xrootd redirectors are programmed to be greedy.
> 
> Andy
> 
> On Tue, 1 Mar 2005, Brew, CAJ (Chris) wrote:
> 
> > Hi,
> >
> > The clients should only know about one of the load balancers but the
> > load balancers know about each other from the olbd 
> "network" since they
> > share a config file and log into each other. So when the 
> client asks the
> > xrootd on the LB server for a file could it be asking the 
> olbd on the
> > other LB server to find the file for it.
> >
> > That could explain why what looked to be an intermittant 
> problem on the
> > LB server's olbd affected finding files via both LB servers 
> in the same
> > way at the same time.
> >
> > Yours,
> > Chris.
> >
> > > -----Original Message-----
> > > From: Andrew Hanushevsky [mailto:[log in to unmask]]
> > > Sent: 01 March 2005 17:40
> > > To: Brew, CAJ (Chris)
> > > Cc: Olaiya, EO (Emmanuel); [log in to unmask]
> > > Subject: RE: olbd tracing
> > >
> > > Hi Chris,
> > >
> > > Yes and no. Depends on what client you use. The client 
> code has gone
> > > through several changes. Some always ask only one 
> redirector, others
> > > switch back and forth. But, in any case, you should have
> > > looked at both
> > > redirector logs. Presumably, the xrootd defaults are 
> being used (i.e.,
> > > fail-over mode). Take a look at the xrootd logs on the
> > > redirector to see
> > > if anything strange is going on there.
> > >
> > > Andy
> > >
> > >
> > > On Tue, 1 Mar 2005, Brew, CAJ (Chris) wrote:
> > >
> > > > Hmmm...
> > > >
> > > > Actually there are some debug messages in the log file now:
> > > >
> > > > When I just ran another test I got:
> > > >
> > > > 050301 16:46:39 24895 do_Select Lookup delay
> > > xrootd107.gridpp.rl.ac.uk 5
> > > > 050301 16:46:39 24895 Receive From csfnfs49.rl.ac.uk:1094:
> > > 7@0 have r
> > > > /store/test/csfnfs49.01.root
> > > > 050301 16:46:44 24895 Receive From
> > > xrootd107.gridpp.rl.ac.uk: 35 select
> > > > r /store/test/csfnfs49.01.root
> > > > 050301 16:46:44 24895 do_Select Redirect
> > > xrootd107.gridpp.rl.ac.uk ->
> > > > csfnfs49.rl.ac.uk:1094 for /store/test/csfnfs49.01.root
> > > >
> > > > Is it possible that because there was another load 
> balancer in the
> > > > set-up it was asking that to find the files for it? That
> > > other machine
> > > > has now gone down to be reinstalled with SL3 and now we're
> > > getting more
> > > > logging info.
> > > >
> > > > Weird.
> > > >
> > > > Chris.
> > > >
> > > > > -----Original Message-----
> > > > > From: [log in to unmask]
> > > > > [mailto:[log in to unmask]] On Behalf Of
> > > > > Brew, CAJ (Chris)
> > > > > Sent: 01 March 2005 16:13
> > > > > To: Andrew Hanushevsky; Olaiya, EO (Emmanuel)
> > > > > Cc: [log in to unmask]
> > > > > Subject: RE: olbd tracing
> > > > >
> > > > > cc'd to Manny in case he doesn't catch it on the list.
> > > > >
> > > > > The machine I'm trying to turn the logging on on is
> > > > > xrootd107.gridpp.rl.ac.uk our new master load balancer.
> > > We don't have
> > > > > root access on the box but can restart the deamons and modify
> > > > > the config
> > > > > file.
> > > > >
> > > > > I started the deamon with the -d option by running StopOLB
> > > > > and StartOLB
> > > > > -d rather than using the sudo /sbin/service olbd start|stop
> > > > > we normally
> > > > > do.
> > > > >
> > > > > We haven't had any problems since I reduced the number of
> > > > > xrootd servers
> > > > > in the cluster but I've now only got about 7TB free on the
> > > > > servers left
> > > > > in the cluster so will run out of room in about a week.
> > > > >
> > > > > Yours,
> > > > > Chris.
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Andrew Hanushevsky [mailto:[log in to unmask]]
> > > > > > Sent: 01 March 2005 15:53
> > > > > > To: Brew, CAJ (Chris)
> > > > > > Cc: Andrew Hanushevsky; [log in to unmask]
> > > > > > Subject: RE: olbd tracing
> > > > > >
> > > > > > Hi Chris,
> > > > > >
> > > > > > OK, something seems to be amiss with the overall
> > > > > configuration if even
> > > > > > this doesn't work. Let me get together with Manny and take a
> > > > > > look at what
> > > > > > is actually running and how it is put together. Manny when?
> > > > > >
> > > > > > Andy
> > > > > >
> > > > > > On Tue, 1 Mar 2005, Brew, CAJ (Chris) wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I've got:
> > > > > > >
> > > > > > > olb.trace debug
> > > > > > > odc.trace debug
> > > > > > > ofs.trace debug
> > > > > > >
> > > > > > > in my xrootd.cf file and started the olbd with -d on
> > > the LB server
> > > > > > > [xrootd107] /opt/xrootd/etc > ps fwwwU bbdatsrv
> > > > > > >   PID TTY      STAT   TIME COMMAND
> > > > > > > 18801 ?        S      0:00 sshd: bbdatsrv@pts/0
> > > > > > > 18803 pts/0    S      0:00 -bash
> > > > > > > 25484 pts/0    R      0:00  \_ ps fwwwU bbdatsrv
> > > > > > > 24895 pts/0    S      0:00 /opt/xrootd/bin/olbd -d -m -l
> > > > > > > /opt/xrootd/logs/olbdlog -c /opt/xrootd//etc/xrootd.cf
> > > > > > > 23940 pts/0    S      0:00 /opt/xrootd/bin/xrootd -r -l
> > > > > > > /opt/xrootd/logs/xrdlog -c /opt/xrootd/etc/xrootd.cf
> > > > > > > 23975 pts/0    S      0:00 /opt/xrootd/bin/xrootd -r -l
> > > > > > > /opt/xrootd/logs/xrdlog -c /opt/xrootd/etc/xrootd.cf
> > > > > > >
> > > > > > > but am still not getting any debug info on how it's
> > > > > > locating the files:
> > > > > > >
> > > > > > > the old.trace debug on the Data Servers does get me:
> > > > > > >
> > > > > > > 050301 11:45:08 616 Receive From 
> xrootd108.gridpp.rl.ac.uk:
> > > > > > 7@0  state
> > > > > > > /store...
> > > > > > >
> > > > > > > when looking for a file.
> > > > > > >
> > > > > > > Anyone know what else I need on the LB server?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Chris.
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Andrew Hanushevsky [mailto:[log in to unmask]]
> > > > > > > > Sent: 23 February 2005 20:59
> > > > > > > > To: Brew, CAJ (Chris)
> > > > > > > > Cc: [log in to unmask]
> > > > > > > > Subject: Re: olbd tracing
> > > > > > > >
> > > > > > > > Hi Chris,
> > > > > > > >
> > > > > > > > That's starting the olbd with the -d option 
> (for debugging).
> > > > > > > >
> > > > > > > > ----- Original Message -----
> > > > > > > > From: "Brew, CAJ (Chris)" <[log in to unmask]>
> > > > > > > > To: "Andrew Hanushevsky" <[log in to unmask]>
> > > > > > > > Cc: <[log in to unmask]>
> > > > > > > > Sent: Wednesday, February 23, 2005 11:01 AM
> > > > > > > > Subject: RE: olbd tracing
> > > > > > > >
> > > > > > > >
> > > > > > > > > Hi Andy,
> > > > > > > > >
> > > > > > > > > I don't think the odc.trace redirect is the one
> > > I'm looking
> > > > > > > > for. What's
> > > > > > > > > the directive that puts the "have ?" and 
> "have" replies
> > > > > > > > into the olbd
> > > > > > > > > log.
> > > > > > > > >
> > > > > > > > > Once I narrow it down to the manager not 
> asking the server
> > > > > > > > or the server
> > > > > > > > > not replying correctly I can turn debug on on 
> the relavant
> > > > > > > > machine. I'm
> > > > > > > > > reluctant to turn it on on all machines because
> > > it's a fair
> > > > > > > > time before
> > > > > > > > > the problem manifests itself.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Chris.
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Andrew Hanushevsky [mailto:[log in to unmask]]
> > > > > > > > > > Sent: 22 February 2005 21:26
> > > > > > > > > > To: Brew, CAJ (Chris); [log in to unmask]
> > > > > > > > > > Subject: Re: olbd tracing
> > > > > > > > > >
> > > > > > > > > > Hi Chris,
> > > > > > > > > >
> > > > > > > > > > Try:
> > > > > > > > > >
> > > > > > > > > > odc.trace redirect
> > > > > > > > > >
> > > > > > > > > > for the olb try using the '-d' option; though
> > > you may get
> > > > > > > > > > more information
> > > > > > > > > > than really needed.
> > > > > > > > > >
> > > > > > > > > > Andy
> > > > > > > > > >
> > > > > > > > > > ----- Original Message -----
> > > > > > > > > > From: "Brew, CAJ (Chris)" <[log in to unmask]>
> > > > > > > > > > To: <[log in to unmask]>
> > > > > > > > > > Sent: Tuesday, February 22, 2005 6:53 AM
> > > > > > > > > > Subject: olbd tracing
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > What's the trace argument to add to the 
> xrootd.cf file
> > > > > > > > to get it to
> > > > > > > > > > > output the queries to locate files to the logs.
> > > > > > > > > > >
> > > > > > > > > > > We're still having problems at RAL with
> > > > > > files/servers not being
> > > > > > > > > > > available via the load balancers when 
> they are if you
> > > > > > > > contact them
> > > > > > > > > > > directly.
> > > > > > > > > > >
> > > > > > > > > > > Yours,
> > > > > > > > > > > Chris.
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > >   Chris Brew  ([log in to unmask])  +44 
> 1235 446326
> > > > > > > > > > >   Particle Physics Department
> > > > > > > > > > >   Rutherford Appleton Laboratory
> > > > > > > > > > >   Chilton, Didcot. Oxfordshire.
> > > > > > > > > > >   OX11 0QX. United Kingdom.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> >
> >
>