Hi Chris, since you get this to ail within 30 minutes. Could you please send me the full logs of all the machines involved after the failure occurs. A gcore of each server (i.e., the failing olbd and the failing xrootd) would be very much appreciated. Andy On Sun, 6 Mar 2005, Brew, CAJ (Chris) wrote: > Hi Andy, > > Slightly more info to put into the mix. > > We've now been runnning for four days with both redirectors running in > debug mode without any problem however when I ran the olbd that seems to > be the one doing the redirecting with debugging switched on it exhibited > the problem in about 30 mins. > > I'll try running some more tests but I'm guessing that switching > debugging on changes the way the code works enough so that the problem > disappears. I guess that means it might just go away with the next > production release but it will always be a possibility and I don't > really want to have to run the redirectors in permanent debug mode. > > Yours, > Chris. > > > -----Original Message----- > > From: Andrew Hanushevsky [mailto:[log in to unmask]] > > Sent: 02 March 2005 19:27 > > To: Brew, CAJ (Chris) > > Cc: Olaiya, EO (Emmanuel); [log in to unmask] > > Subject: RE: olbd tracing > > > > Hi Chris, > > > > I don't see anything wrong in the config file. I've looked at > > our setup > > where we have 3 redirectors (though the client really knows only about > > two, so it's moot). However, both redirectors are happy and > > are currently > > using only one of the redirector olbds (we run in fail-over > > mode -- the > > default). The system has been running like this for over 12 > > hours and is > > being very heavily used (14k files/hour from just one server probably > > means 60K/hour on the redirector). So, I guess logs covering > > the problem > > would be great and, if possible, gcores of the catotonic > > servers when that > > happens. > > > > BTW remind me if you are mixing versions (i.e., data servers using one > > version redirector using another one). There was a development release > > we had that ran into problems when versions were mixed. > > > > Andy > > > > On Wed, 2 Mar 2005, Brew, CAJ (Chris) wrote: > > > > > Hi, > > > > > > OK I've done a some more debugging and whilst I haven't caught the > > > problem on servers that have got the debug option switch on > > I have got > > > some more info. > > > > > > It's pretty definitely to do with the second load balancer. > > I ran all > > > day today with one load balancer without running into the > > problems but > > > about 30 mins after I restarted the second load balancer I > > ran into it > > > again. Stopping the olbd abd xrootd on the second LB > > restored access to > > > the files. > > > > > > It's possible I've got this setup misconfigured so the > > config file from > > > the LBs is attached below. > > > > > > I'll continue to try to get debug output from an LB and a > > dataserver. > > > The servers don't crash so I won't be able to get core dumps. > > > > > > Yours, > > > Chris. > > > > > > [xrootd107] /opt/xrootd/etc > cat xrootd.cf > > > # RAL XROOTD base config file > > > # CAJB 051004 > > > > > > # Xrd Configuration > > > > > > # xrootd configuration > > > xrootd.fslib /opt/xrootd/lib/libXrdOfs.so > > > xrootd.export /store > > > > > > #ODC Configuration > > > odc.manager xrootd107.gridpp.rl.ac.uk 1095 > > > odc.manager xrootd108.gridpp.rl.ac.uk 1095 > > > > > > # OFS Configuration > > > ofs.redirect remote xrootd107.rl.ac.uk > > > ofs.redirect remote xrootd108.rl.ac.uk > > > ofs.redirect target csfnfs35.rl.ac.uk > > > ofs.redirect target csfnfs41.rl.ac.uk > > > ofs.redirect target csfnfs45.rl.ac.uk > > > ofs.redirect target csfnfs46.rl.ac.uk > > > ofs.redirect target csfnfs47.rl.ac.uk > > > ofs.redirect target csfnfs48.rl.ac.uk > > > ofs.redirect target csfnfs49.rl.ac.uk > > > > > > #OLB Configuration > > > olb.port 1095 > > > olb.subscribe xrootd107.gridpp.rl.ac.uk 1095 > > > olb.subscribe xrootd108.gridpp.rl.ac.uk 1095 > > > olb.path r /store > > > olb.wait > > > olb.trace debug > > > > > > odc.trace debug > > > ofs.trace debug > > > > > > #OSS Configuration > > > oss.path /store r/o > > > # Manager config for xrootd redirectors > > > odc.trace redirect > > > olb.allow host csfnfs*.rl.ac.uk > > > olb.allow host csflnx108.rl.ac.uk > > > olb.allow host xrootd107.gridpp.rl.ac.uk > > > olb.allow host xrootd108.gridpp.rl.ac.uk > > > > > > > -----Original Message----- > > > > From: Andrew Hanushevsky [mailto:[log in to unmask]] > > > > Sent: 01 March 2005 19:49 > > > > To: Brew, CAJ (Chris) > > > > Cc: Olaiya, EO (Emmanuel); [log in to unmask] > > > > Subject: RE: olbd tracing > > > > > > > > Hi Chris, > > > > > > > > Yes, especially if it thinks one of the two is not > > > > particularly responsive > > > > to its requests. The xrootd redirectors are programmed to > > be greedy. > > > > > > > > Andy > > > > > > > > On Tue, 1 Mar 2005, Brew, CAJ (Chris) wrote: > > > > > > > > > Hi, > > > > > > > > > > The clients should only know about one of the load > > balancers but the > > > > > load balancers know about each other from the olbd > > > > "network" since they > > > > > share a config file and log into each other. So when the > > > > client asks the > > > > > xrootd on the LB server for a file could it be asking the > > > > olbd on the > > > > > other LB server to find the file for it. > > > > > > > > > > That could explain why what looked to be an intermittant > > > > problem on the > > > > > LB server's olbd affected finding files via both LB servers > > > > in the same > > > > > way at the same time. > > > > > > > > > > Yours, > > > > > Chris. > > > > > > > > > > > -----Original Message----- > > > > > > From: Andrew Hanushevsky [mailto:[log in to unmask]] > > > > > > Sent: 01 March 2005 17:40 > > > > > > To: Brew, CAJ (Chris) > > > > > > Cc: Olaiya, EO (Emmanuel); [log in to unmask] > > > > > > Subject: RE: olbd tracing > > > > > > > > > > > > Hi Chris, > > > > > > > > > > > > Yes and no. Depends on what client you use. The client > > > > code has gone > > > > > > through several changes. Some always ask only one > > > > redirector, others > > > > > > switch back and forth. But, in any case, you should have > > > > > > looked at both > > > > > > redirector logs. Presumably, the xrootd defaults are > > > > being used (i.e., > > > > > > fail-over mode). Take a look at the xrootd logs on the > > > > > > redirector to see > > > > > > if anything strange is going on there. > > > > > > > > > > > > Andy > > > > > > > > > > > > > > > > > > On Tue, 1 Mar 2005, Brew, CAJ (Chris) wrote: > > > > > > > > > > > > > Hmmm... > > > > > > > > > > > > > > Actually there are some debug messages in the log file now: > > > > > > > > > > > > > > When I just ran another test I got: > > > > > > > > > > > > > > 050301 16:46:39 24895 do_Select Lookup delay > > > > > > xrootd107.gridpp.rl.ac.uk 5 > > > > > > > 050301 16:46:39 24895 Receive From csfnfs49.rl.ac.uk:1094: > > > > > > 7@0 have r > > > > > > > /store/test/csfnfs49.01.root > > > > > > > 050301 16:46:44 24895 Receive From > > > > > > xrootd107.gridpp.rl.ac.uk: 35 select > > > > > > > r /store/test/csfnfs49.01.root > > > > > > > 050301 16:46:44 24895 do_Select Redirect > > > > > > xrootd107.gridpp.rl.ac.uk -> > > > > > > > csfnfs49.rl.ac.uk:1094 for /store/test/csfnfs49.01.root > > > > > > > > > > > > > > Is it possible that because there was another load > > > > balancer in the > > > > > > > set-up it was asking that to find the files for it? That > > > > > > other machine > > > > > > > has now gone down to be reinstalled with SL3 and now we're > > > > > > getting more > > > > > > > logging info. > > > > > > > > > > > > > > Weird. > > > > > > > > > > > > > > Chris. > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: [log in to unmask] > > > > > > > > [mailto:[log in to unmask]] > > On Behalf Of > > > > > > > > Brew, CAJ (Chris) > > > > > > > > Sent: 01 March 2005 16:13 > > > > > > > > To: Andrew Hanushevsky; Olaiya, EO (Emmanuel) > > > > > > > > Cc: [log in to unmask] > > > > > > > > Subject: RE: olbd tracing > > > > > > > > > > > > > > > > cc'd to Manny in case he doesn't catch it on the list. > > > > > > > > > > > > > > > > The machine I'm trying to turn the logging on on is > > > > > > > > xrootd107.gridpp.rl.ac.uk our new master load balancer. > > > > > > We don't have > > > > > > > > root access on the box but can restart the > > deamons and modify > > > > > > > > the config > > > > > > > > file. > > > > > > > > > > > > > > > > I started the deamon with the -d option by running StopOLB > > > > > > > > and StartOLB > > > > > > > > -d rather than using the sudo /sbin/service olbd > > start|stop > > > > > > > > we normally > > > > > > > > do. > > > > > > > > > > > > > > > > We haven't had any problems since I reduced the number of > > > > > > > > xrootd servers > > > > > > > > in the cluster but I've now only got about 7TB free on the > > > > > > > > servers left > > > > > > > > in the cluster so will run out of room in about a week. > > > > > > > > > > > > > > > > Yours, > > > > > > > > Chris. > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > From: Andrew Hanushevsky [mailto:[log in to unmask]] > > > > > > > > > Sent: 01 March 2005 15:53 > > > > > > > > > To: Brew, CAJ (Chris) > > > > > > > > > Cc: Andrew Hanushevsky; [log in to unmask] > > > > > > > > > Subject: RE: olbd tracing > > > > > > > > > > > > > > > > > > Hi Chris, > > > > > > > > > > > > > > > > > > OK, something seems to be amiss with the overall > > > > > > > > configuration if even > > > > > > > > > this doesn't work. Let me get together with > > Manny and take a > > > > > > > > > look at what > > > > > > > > > is actually running and how it is put together. > > Manny when? > > > > > > > > > > > > > > > > > > Andy > > > > > > > > > > > > > > > > > > On Tue, 1 Mar 2005, Brew, CAJ (Chris) wrote: > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > I've got: > > > > > > > > > > > > > > > > > > > > olb.trace debug > > > > > > > > > > odc.trace debug > > > > > > > > > > ofs.trace debug > > > > > > > > > > > > > > > > > > > > in my xrootd.cf file and started the olbd with -d on > > > > > > the LB server > > > > > > > > > > [xrootd107] /opt/xrootd/etc > ps fwwwU bbdatsrv > > > > > > > > > > PID TTY STAT TIME COMMAND > > > > > > > > > > 18801 ? S 0:00 sshd: bbdatsrv@pts/0 > > > > > > > > > > 18803 pts/0 S 0:00 -bash > > > > > > > > > > 25484 pts/0 R 0:00 \_ ps fwwwU bbdatsrv > > > > > > > > > > 24895 pts/0 S 0:00 > > /opt/xrootd/bin/olbd -d -m -l > > > > > > > > > > /opt/xrootd/logs/olbdlog -c /opt/xrootd//etc/xrootd.cf > > > > > > > > > > 23940 pts/0 S 0:00 > > /opt/xrootd/bin/xrootd -r -l > > > > > > > > > > /opt/xrootd/logs/xrdlog -c /opt/xrootd/etc/xrootd.cf > > > > > > > > > > 23975 pts/0 S 0:00 > > /opt/xrootd/bin/xrootd -r -l > > > > > > > > > > /opt/xrootd/logs/xrdlog -c /opt/xrootd/etc/xrootd.cf > > > > > > > > > > > > > > > > > > > > but am still not getting any debug info on how it's > > > > > > > > > locating the files: > > > > > > > > > > > > > > > > > > > > the old.trace debug on the Data Servers does get me: > > > > > > > > > > > > > > > > > > > > 050301 11:45:08 616 Receive From > > > > xrootd108.gridpp.rl.ac.uk: > > > > > > > > > 7@0 state > > > > > > > > > > /store... > > > > > > > > > > > > > > > > > > > > when looking for a file. > > > > > > > > > > > > > > > > > > > > Anyone know what else I need on the LB server? > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Chris. > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > > > From: Andrew Hanushevsky [mailto:[log in to unmask]] > > > > > > > > > > > Sent: 23 February 2005 20:59 > > > > > > > > > > > To: Brew, CAJ (Chris) > > > > > > > > > > > Cc: [log in to unmask] > > > > > > > > > > > Subject: Re: olbd tracing > > > > > > > > > > > > > > > > > > > > > > Hi Chris, > > > > > > > > > > > > > > > > > > > > > > That's starting the olbd with the -d option > > > > (for debugging). > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > From: "Brew, CAJ (Chris)" <[log in to unmask]> > > > > > > > > > > > To: "Andrew Hanushevsky" <[log in to unmask]> > > > > > > > > > > > Cc: <[log in to unmask]> > > > > > > > > > > > Sent: Wednesday, February 23, 2005 11:01 AM > > > > > > > > > > > Subject: RE: olbd tracing > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Andy, > > > > > > > > > > > > > > > > > > > > > > > > I don't think the odc.trace redirect is the one > > > > > > I'm looking > > > > > > > > > > > for. What's > > > > > > > > > > > > the directive that puts the "have ?" and > > > > "have" replies > > > > > > > > > > > into the olbd > > > > > > > > > > > > log. > > > > > > > > > > > > > > > > > > > > > > > > Once I narrow it down to the manager not > > > > asking the server > > > > > > > > > > > or the server > > > > > > > > > > > > not replying correctly I can turn debug on on > > > > the relavant > > > > > > > > > > > machine. I'm > > > > > > > > > > > > reluctant to turn it on on all machines because > > > > > > it's a fair > > > > > > > > > > > time before > > > > > > > > > > > > the problem manifests itself. > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Chris. > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > > > > > From: Andrew Hanushevsky > > [mailto:[log in to unmask]] > > > > > > > > > > > > > Sent: 22 February 2005 21:26 > > > > > > > > > > > > > To: Brew, CAJ (Chris); > > [log in to unmask] > > > > > > > > > > > > > Subject: Re: olbd tracing > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Chris, > > > > > > > > > > > > > > > > > > > > > > > > > > Try: > > > > > > > > > > > > > > > > > > > > > > > > > > odc.trace redirect > > > > > > > > > > > > > > > > > > > > > > > > > > for the olb try using the '-d' option; though > > > > > > you may get > > > > > > > > > > > > > more information > > > > > > > > > > > > > than really needed. > > > > > > > > > > > > > > > > > > > > > > > > > > Andy > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > > > From: "Brew, CAJ (Chris)" <[log in to unmask]> > > > > > > > > > > > > > To: <[log in to unmask]> > > > > > > > > > > > > > Sent: Tuesday, February 22, 2005 6:53 AM > > > > > > > > > > > > > Subject: olbd tracing > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > > > > > > > What's the trace argument to add to the > > > > xrootd.cf file > > > > > > > > > > > to get it to > > > > > > > > > > > > > > output the queries to locate files to > > the logs. > > > > > > > > > > > > > > > > > > > > > > > > > > > > We're still having problems at RAL with > > > > > > > > > files/servers not being > > > > > > > > > > > > > > available via the load balancers when > > > > they are if you > > > > > > > > > > > contact them > > > > > > > > > > > > > > directly. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Yours, > > > > > > > > > > > > > > Chris. > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > Chris Brew ([log in to unmask]) +44 > > > > 1235 446326 > > > > > > > > > > > > > > Particle Physics Department > > > > > > > > > > > > > > Rutherford Appleton Laboratory > > > > > > > > > > > > > > Chilton, Didcot. Oxfordshire. > > > > > > > > > > > > > > OX11 0QX. United Kingdom. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >