Print

Print


Hi Chris, since you get this to ail within 30 minutes. Could you please
send me the full logs of all the machines involved after the failure
occurs. A gcore of each server (i.e., the failing olbd and the failing
xrootd) would be very much appreciated.

Andy

On Sun, 6 Mar 2005, Brew, CAJ (Chris) wrote:

> Hi Andy,
>
> Slightly more info to put into the mix.
>
> We've now been runnning for four days with both redirectors running in
> debug mode without any problem however when I ran the olbd that seems to
> be the one doing the redirecting with debugging switched on it exhibited
> the problem in about 30 mins.
>
> I'll try running some more tests but I'm guessing that switching
> debugging on changes the way the code works enough so that the problem
> disappears. I guess that means it might just go away with the next
> production release but it will always be a possibility and I don't
> really want to have to run the redirectors in permanent debug mode.
>
> Yours,
> Chris.
>
> > -----Original Message-----
> > From: Andrew Hanushevsky [mailto:[log in to unmask]]
> > Sent: 02 March 2005 19:27
> > To: Brew, CAJ (Chris)
> > Cc: Olaiya, EO (Emmanuel); [log in to unmask]
> > Subject: RE: olbd tracing
> >
> > Hi Chris,
> >
> > I don't see anything wrong in the config file. I've looked at
> > our setup
> > where we have 3 redirectors (though the client really knows only about
> > two, so it's moot). However, both redirectors are happy and
> > are currently
> > using only one of the redirector olbds (we run in fail-over
> > mode -- the
> > default). The system has been running like this for over 12
> > hours and is
> > being very heavily used (14k files/hour from just one server probably
> > means 60K/hour on the redirector). So, I guess logs covering
> > the problem
> > would be great and, if possible, gcores of the catotonic
> > servers when that
> > happens.
> >
> > BTW remind me if you are mixing versions (i.e., data servers using one
> > version redirector using another one). There was a development release
> > we had that ran into problems when versions were mixed.
> >
> > Andy
> >
> > On Wed, 2 Mar 2005, Brew, CAJ (Chris) wrote:
> >
> > > Hi,
> > >
> > > OK I've done a some more debugging and whilst I haven't caught the
> > > problem on servers that have got the debug option switch on
> > I have got
> > > some more info.
> > >
> > > It's pretty definitely to do with the second load balancer.
> > I ran all
> > > day today with one load balancer without running into the
> > problems but
> > > about 30 mins after I restarted the second load balancer I
> > ran into it
> > > again. Stopping the olbd abd xrootd on the second LB
> > restored access to
> > > the files.
> > >
> > > It's possible I've got this setup misconfigured so the
> > config file from
> > > the LBs is attached below.
> > >
> > > I'll continue to try to get debug output from an LB and a
> > dataserver.
> > > The servers don't crash so I won't be able to get core dumps.
> > >
> > > Yours,
> > > Chris.
> > >
> > > [xrootd107] /opt/xrootd/etc > cat xrootd.cf
> > > # RAL XROOTD base config file
> > > # CAJB 051004
> > >
> > > # Xrd Configuration
> > >
> > > # xrootd configuration
> > > xrootd.fslib /opt/xrootd/lib/libXrdOfs.so
> > > xrootd.export /store
> > >
> > > #ODC Configuration
> > > odc.manager xrootd107.gridpp.rl.ac.uk 1095
> > > odc.manager xrootd108.gridpp.rl.ac.uk 1095
> > >
> > > # OFS Configuration
> > > ofs.redirect remote xrootd107.rl.ac.uk
> > > ofs.redirect remote xrootd108.rl.ac.uk
> > > ofs.redirect target csfnfs35.rl.ac.uk
> > > ofs.redirect target csfnfs41.rl.ac.uk
> > > ofs.redirect target csfnfs45.rl.ac.uk
> > > ofs.redirect target csfnfs46.rl.ac.uk
> > > ofs.redirect target csfnfs47.rl.ac.uk
> > > ofs.redirect target csfnfs48.rl.ac.uk
> > > ofs.redirect target csfnfs49.rl.ac.uk
> > >
> > > #OLB Configuration
> > > olb.port 1095
> > > olb.subscribe xrootd107.gridpp.rl.ac.uk 1095
> > > olb.subscribe xrootd108.gridpp.rl.ac.uk 1095
> > > olb.path r /store
> > > olb.wait
> > > olb.trace debug
> > >
> > > odc.trace debug
> > > ofs.trace debug
> > >
> > > #OSS Configuration
> > > oss.path /store r/o
> > > # Manager config for xrootd redirectors
> > > odc.trace redirect
> > > olb.allow host csfnfs*.rl.ac.uk
> > > olb.allow host csflnx108.rl.ac.uk
> > > olb.allow host xrootd107.gridpp.rl.ac.uk
> > > olb.allow host xrootd108.gridpp.rl.ac.uk
> > >
> > > > -----Original Message-----
> > > > From: Andrew Hanushevsky [mailto:[log in to unmask]]
> > > > Sent: 01 March 2005 19:49
> > > > To: Brew, CAJ (Chris)
> > > > Cc: Olaiya, EO (Emmanuel); [log in to unmask]
> > > > Subject: RE: olbd tracing
> > > >
> > > > Hi Chris,
> > > >
> > > > Yes, especially if it thinks one of the two is not
> > > > particularly responsive
> > > > to its requests. The xrootd redirectors are programmed to
> > be greedy.
> > > >
> > > > Andy
> > > >
> > > > On Tue, 1 Mar 2005, Brew, CAJ (Chris) wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > The clients should only know about one of the load
> > balancers but the
> > > > > load balancers know about each other from the olbd
> > > > "network" since they
> > > > > share a config file and log into each other. So when the
> > > > client asks the
> > > > > xrootd on the LB server for a file could it be asking the
> > > > olbd on the
> > > > > other LB server to find the file for it.
> > > > >
> > > > > That could explain why what looked to be an intermittant
> > > > problem on the
> > > > > LB server's olbd affected finding files via both LB servers
> > > > in the same
> > > > > way at the same time.
> > > > >
> > > > > Yours,
> > > > > Chris.
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Andrew Hanushevsky [mailto:[log in to unmask]]
> > > > > > Sent: 01 March 2005 17:40
> > > > > > To: Brew, CAJ (Chris)
> > > > > > Cc: Olaiya, EO (Emmanuel); [log in to unmask]
> > > > > > Subject: RE: olbd tracing
> > > > > >
> > > > > > Hi Chris,
> > > > > >
> > > > > > Yes and no. Depends on what client you use. The client
> > > > code has gone
> > > > > > through several changes. Some always ask only one
> > > > redirector, others
> > > > > > switch back and forth. But, in any case, you should have
> > > > > > looked at both
> > > > > > redirector logs. Presumably, the xrootd defaults are
> > > > being used (i.e.,
> > > > > > fail-over mode). Take a look at the xrootd logs on the
> > > > > > redirector to see
> > > > > > if anything strange is going on there.
> > > > > >
> > > > > > Andy
> > > > > >
> > > > > >
> > > > > > On Tue, 1 Mar 2005, Brew, CAJ (Chris) wrote:
> > > > > >
> > > > > > > Hmmm...
> > > > > > >
> > > > > > > Actually there are some debug messages in the log file now:
> > > > > > >
> > > > > > > When I just ran another test I got:
> > > > > > >
> > > > > > > 050301 16:46:39 24895 do_Select Lookup delay
> > > > > > xrootd107.gridpp.rl.ac.uk 5
> > > > > > > 050301 16:46:39 24895 Receive From csfnfs49.rl.ac.uk:1094:
> > > > > > 7@0 have r
> > > > > > > /store/test/csfnfs49.01.root
> > > > > > > 050301 16:46:44 24895 Receive From
> > > > > > xrootd107.gridpp.rl.ac.uk: 35 select
> > > > > > > r /store/test/csfnfs49.01.root
> > > > > > > 050301 16:46:44 24895 do_Select Redirect
> > > > > > xrootd107.gridpp.rl.ac.uk ->
> > > > > > > csfnfs49.rl.ac.uk:1094 for /store/test/csfnfs49.01.root
> > > > > > >
> > > > > > > Is it possible that because there was another load
> > > > balancer in the
> > > > > > > set-up it was asking that to find the files for it? That
> > > > > > other machine
> > > > > > > has now gone down to be reinstalled with SL3 and now we're
> > > > > > getting more
> > > > > > > logging info.
> > > > > > >
> > > > > > > Weird.
> > > > > > >
> > > > > > > Chris.
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: [log in to unmask]
> > > > > > > > [mailto:[log in to unmask]]
> > On Behalf Of
> > > > > > > > Brew, CAJ (Chris)
> > > > > > > > Sent: 01 March 2005 16:13
> > > > > > > > To: Andrew Hanushevsky; Olaiya, EO (Emmanuel)
> > > > > > > > Cc: [log in to unmask]
> > > > > > > > Subject: RE: olbd tracing
> > > > > > > >
> > > > > > > > cc'd to Manny in case he doesn't catch it on the list.
> > > > > > > >
> > > > > > > > The machine I'm trying to turn the logging on on is
> > > > > > > > xrootd107.gridpp.rl.ac.uk our new master load balancer.
> > > > > > We don't have
> > > > > > > > root access on the box but can restart the
> > deamons and modify
> > > > > > > > the config
> > > > > > > > file.
> > > > > > > >
> > > > > > > > I started the deamon with the -d option by running StopOLB
> > > > > > > > and StartOLB
> > > > > > > > -d rather than using the sudo /sbin/service olbd
> > start|stop
> > > > > > > > we normally
> > > > > > > > do.
> > > > > > > >
> > > > > > > > We haven't had any problems since I reduced the number of
> > > > > > > > xrootd servers
> > > > > > > > in the cluster but I've now only got about 7TB free on the
> > > > > > > > servers left
> > > > > > > > in the cluster so will run out of room in about a week.
> > > > > > > >
> > > > > > > > Yours,
> > > > > > > > Chris.
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Andrew Hanushevsky [mailto:[log in to unmask]]
> > > > > > > > > Sent: 01 March 2005 15:53
> > > > > > > > > To: Brew, CAJ (Chris)
> > > > > > > > > Cc: Andrew Hanushevsky; [log in to unmask]
> > > > > > > > > Subject: RE: olbd tracing
> > > > > > > > >
> > > > > > > > > Hi Chris,
> > > > > > > > >
> > > > > > > > > OK, something seems to be amiss with the overall
> > > > > > > > configuration if even
> > > > > > > > > this doesn't work. Let me get together with
> > Manny and take a
> > > > > > > > > look at what
> > > > > > > > > is actually running and how it is put together.
> > Manny when?
> > > > > > > > >
> > > > > > > > > Andy
> > > > > > > > >
> > > > > > > > > On Tue, 1 Mar 2005, Brew, CAJ (Chris) wrote:
> > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > I've got:
> > > > > > > > > >
> > > > > > > > > > olb.trace debug
> > > > > > > > > > odc.trace debug
> > > > > > > > > > ofs.trace debug
> > > > > > > > > >
> > > > > > > > > > in my xrootd.cf file and started the olbd with -d on
> > > > > > the LB server
> > > > > > > > > > [xrootd107] /opt/xrootd/etc > ps fwwwU bbdatsrv
> > > > > > > > > >   PID TTY      STAT   TIME COMMAND
> > > > > > > > > > 18801 ?        S      0:00 sshd: bbdatsrv@pts/0
> > > > > > > > > > 18803 pts/0    S      0:00 -bash
> > > > > > > > > > 25484 pts/0    R      0:00  \_ ps fwwwU bbdatsrv
> > > > > > > > > > 24895 pts/0    S      0:00
> > /opt/xrootd/bin/olbd -d -m -l
> > > > > > > > > > /opt/xrootd/logs/olbdlog -c /opt/xrootd//etc/xrootd.cf
> > > > > > > > > > 23940 pts/0    S      0:00
> > /opt/xrootd/bin/xrootd -r -l
> > > > > > > > > > /opt/xrootd/logs/xrdlog -c /opt/xrootd/etc/xrootd.cf
> > > > > > > > > > 23975 pts/0    S      0:00
> > /opt/xrootd/bin/xrootd -r -l
> > > > > > > > > > /opt/xrootd/logs/xrdlog -c /opt/xrootd/etc/xrootd.cf
> > > > > > > > > >
> > > > > > > > > > but am still not getting any debug info on how it's
> > > > > > > > > locating the files:
> > > > > > > > > >
> > > > > > > > > > the old.trace debug on the Data Servers does get me:
> > > > > > > > > >
> > > > > > > > > > 050301 11:45:08 616 Receive From
> > > > xrootd108.gridpp.rl.ac.uk:
> > > > > > > > > 7@0  state
> > > > > > > > > > /store...
> > > > > > > > > >
> > > > > > > > > > when looking for a file.
> > > > > > > > > >
> > > > > > > > > > Anyone know what else I need on the LB server?
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Chris.
> > > > > > > > > >
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: Andrew Hanushevsky [mailto:[log in to unmask]]
> > > > > > > > > > > Sent: 23 February 2005 20:59
> > > > > > > > > > > To: Brew, CAJ (Chris)
> > > > > > > > > > > Cc: [log in to unmask]
> > > > > > > > > > > Subject: Re: olbd tracing
> > > > > > > > > > >
> > > > > > > > > > > Hi Chris,
> > > > > > > > > > >
> > > > > > > > > > > That's starting the olbd with the -d option
> > > > (for debugging).
> > > > > > > > > > >
> > > > > > > > > > > ----- Original Message -----
> > > > > > > > > > > From: "Brew, CAJ (Chris)" <[log in to unmask]>
> > > > > > > > > > > To: "Andrew Hanushevsky" <[log in to unmask]>
> > > > > > > > > > > Cc: <[log in to unmask]>
> > > > > > > > > > > Sent: Wednesday, February 23, 2005 11:01 AM
> > > > > > > > > > > Subject: RE: olbd tracing
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > Hi Andy,
> > > > > > > > > > > >
> > > > > > > > > > > > I don't think the odc.trace redirect is the one
> > > > > > I'm looking
> > > > > > > > > > > for. What's
> > > > > > > > > > > > the directive that puts the "have ?" and
> > > > "have" replies
> > > > > > > > > > > into the olbd
> > > > > > > > > > > > log.
> > > > > > > > > > > >
> > > > > > > > > > > > Once I narrow it down to the manager not
> > > > asking the server
> > > > > > > > > > > or the server
> > > > > > > > > > > > not replying correctly I can turn debug on on
> > > > the relavant
> > > > > > > > > > > machine. I'm
> > > > > > > > > > > > reluctant to turn it on on all machines because
> > > > > > it's a fair
> > > > > > > > > > > time before
> > > > > > > > > > > > the problem manifests itself.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Chris.
> > > > > > > > > > > >
> > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > From: Andrew Hanushevsky
> > [mailto:[log in to unmask]]
> > > > > > > > > > > > > Sent: 22 February 2005 21:26
> > > > > > > > > > > > > To: Brew, CAJ (Chris);
> > [log in to unmask]
> > > > > > > > > > > > > Subject: Re: olbd tracing
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi Chris,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Try:
> > > > > > > > > > > > >
> > > > > > > > > > > > > odc.trace redirect
> > > > > > > > > > > > >
> > > > > > > > > > > > > for the olb try using the '-d' option; though
> > > > > > you may get
> > > > > > > > > > > > > more information
> > > > > > > > > > > > > than really needed.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Andy
> > > > > > > > > > > > >
> > > > > > > > > > > > > ----- Original Message -----
> > > > > > > > > > > > > From: "Brew, CAJ (Chris)" <[log in to unmask]>
> > > > > > > > > > > > > To: <[log in to unmask]>
> > > > > > > > > > > > > Sent: Tuesday, February 22, 2005 6:53 AM
> > > > > > > > > > > > > Subject: olbd tracing
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > What's the trace argument to add to the
> > > > xrootd.cf file
> > > > > > > > > > > to get it to
> > > > > > > > > > > > > > output the queries to locate files to
> > the logs.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > We're still having problems at RAL with
> > > > > > > > > files/servers not being
> > > > > > > > > > > > > > available via the load balancers when
> > > > they are if you
> > > > > > > > > > > contact them
> > > > > > > > > > > > > > directly.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yours,
> > > > > > > > > > > > > > Chris.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > >   Chris Brew  ([log in to unmask])  +44
> > > > 1235 446326
> > > > > > > > > > > > > >   Particle Physics Department
> > > > > > > > > > > > > >   Rutherford Appleton Laboratory
> > > > > > > > > > > > > >   Chilton, Didcot. Oxfordshire.
> > > > > > > > > > > > > >   OX11 0QX. United Kingdom.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> > >
> >
>
>