Hi, Any suggestions of extra looging to switch on on the servers before I head off. I just had to restart the olbds on 6 out of 20 servers but I'm not going to be around over the weekend to restart them. Yours, Chris. > -----Original Message----- > From: [log in to unmask] > [mailto:[log in to unmask]] On Behalf Of > Brew, CAJ (Chris) > Sent: 18 February 2005 11:27 > To: Andrew Hanushevsky; Peter Elmer > Cc: [log in to unmask] > Subject: RE: olbd problems at RAL > > Hi, > > The logs for the last couple of days are in > /afs/slac.stanford.edu/u/br/brew/olb_problem. There isn't much info > there. If you suggest the extra traces to add I'll put them in the > config files. > > The rdr xrootd and olbd hasn't been restarted for a while so it'd > startup isn't in the period covered by the logs. > > The -t option on the xrootd is identical to the "ofs.redirect target > hostname" in the xrootd.cf file, correct? We seem to be using that > rather than -t on the command line. I'll add it now and see if it has > any effect. > > Before I left last night I set up a job that once an hour would try to > access a file of each server every hour (the job failed in the early > hours because my desktop got rebooted to install security patches) but > you can see in the rdr xrd log that between 18:08 and 19:10 > some servers > stopped answering requests for the files. Strangely between > my test job > stoping overnight and my test this morning the list of non responsive > servers changed. > > We do have a slightly mixed system at the moment and it's in flux. > > All but one test server are running the xrootd-20040907-0403 version, > one test server (csfnfs45) is running xrootd-20041214-1142 but the > problem appears on both types. > > We've just added the extra SL3 redirector xrootd107 in addition to our > old RH73 one csflnx108 but the problem happens via both redirectors. > > Most of the servers have an entry for an extra redirector xrootd108 in > their config files (but not all the servers experiencing the problem) > which is what csflnx108 will become when it's reinstalled with SL3 but > the DNS name does not yet exist so there is some complaint to the > olblogs about that. > > Yours, > Chris. > > > -----Original Message----- > > From: Andrew Hanushevsky [mailto:[log in to unmask]] > > Sent: 17 February 2005 15:22 > > To: Peter Elmer > > Cc: Brew, CAJ (Chris); [log in to unmask] > > Subject: Re: olbd problems at RAL > > > > Hi Pete, > > > > Sometimes minor things leak in that have major impacts. > > Usually I go by > > what is really running successful elsewhere to determine the > > probability > > of success. However, you do bring up a good point. Chris, > > please make sure > > that you are using -w on the olbd consistently with -t on the > > xrootd data > > server. If you specifu -w but not -t then you will see > > exactly what you > > described. Also, the logs during start-up time to hang time would be > > helpful (i.e., redirector: xrootd and olbd, and data server > > xrootd/olbd). > > Please clearly identify which is which. Thanks. > > > > Andy > > > > On Thu, 17 Feb 2005, Peter Elmer wrote: > > > > > Hi Andy, > > > > > > From: > > > > > > http://xrootd.slac.stanford.edu/xrootd.History > > > > > > the only differences between version 20040907-0403 (the one > > we currently > > > label "production") and 20040830-0105 are small changes to > > the ./configure > > > and makefiles, but nothing of substance that would lead to > > problems with > > > the olbd. I suspect that there is something else going on. > > (e.g. the famous > > > wait/-w problems?) > > > > > > Pete > > > > > > On Thu, Feb 17, 2005 at 07:12:08AM -0800, Andrew > Hanushevsky wrote: > > > > Hi Chris, > > > > > > > > Those two particular releases seem to have had some > > problems. I assume > > > > you are not mixing releases here (i.e., running either on > > all servers > > > > causes you to see the problem). > > > > > > > > I do know that 20040830 is a stable release. We run that > > everywhere at > > > > SLAC for analysis. I'd suggest going with that one until > > we test out > > > > the latest release that should have fixed some other > > problem relating > > > > to writing files. > > > > > > > > Andy > > > > > > > > On Thu, 17 Feb 2005, Brew, CAJ (Chris) wrote: > > > > > > > > > Hi, > > > > > > > > > > Since increasing the number of servers at RAL from 8 to > > 21 we seem to be > > > > > seeing a new failure mode. > > > > > > > > > > All the processes seem to be running fine and you can > > read a file by > > > > > going directly to the server that hold is but the > > server does not seem > > > > > to respond via the olbd network so if you try to access > > a file via the > > > > > load balancer you fail. > > > > > > > > > > Restarting the load balancer on the data server fixes > > the problem. > > > > > > > > > > There is nothing unusual in the logs at either end as > > far or anything > > > > > missing either as I can tell. > > > > > > > > > > This is on data servers running RH73 and > xrootd-20040907-0403 or > > > > > xrootd-20041214-1142. > > > > > > > > > > Has anyone else seen this? Is there a fix? > > > > > > > > > > Thanks, > > > > > Chris. > > > > > > > > > > > > > > > > > > > -------------------------------------------------------------- > > ----------- > > > Peter Elmer E-mail: [log in to unmask] Phone: +41 > > (22) 767-4644 > > > Address: CERN Division PPE, Bat. 32 2C-14, CH-1211 Geneva > > 23, Switzerland > > > > > -------------------------------------------------------------- > > ----------- > > > > > > >