Print

Print


  Hi Wilko,

On Sat, Nov 20, 2004 at 11:26:06AM -0800, Wilko Kroeger wrote:
> On Sat, 20 Nov 2004, Peter Elmer wrote:
> > On Fri, Nov 19, 2004 at 04:29:28PM -0800, Remi Mommsen wrote:
> > > I have many erratic problems with the bbrprod0X servers inhibiting the
> > > use of xrootd for the skim production. I cannot reliable reproduce the
> > > errors, but about 30% of the transfers fail. The tracebacks are similar
> > > to the one posted by Alvise and myself to xrootd-l.
> > >
> > > Questions:
> > > - Are you (or somebody else) actively looking into these issues? We
> > > need to get this solved by early next week.
> > > - Which version(s) of xrootd are running on bbrprod0X? Can you please
> > > start the latest version on all of them?
> >
> >   Andy shouldn't be doing this. We arranged things such this decision
> > should be _entirely_ in the hands of Wilko, Artem, etc. (i.e. Andy
> > shouldn't even need to be in the loop to distribute the software via
> > "taylor" at SLAC). Wilko, is that not true?
> 
> Artem and I do the restart and also do some of the configuration of the
> xrootds but setting up a release so that it gets distributed by taylor
> is done by Andy. 

  Ok, this is different from what Andy, Chuck and I agreed. What we
wanted was that (a) someone (me) makes the releases and (b) you/Artem
have complete control over distributing them with taylor and starting them 
on machines at SLAC. This takes Andy completely out of the operational 
loop. We need to fix that.

  (You are of course supposed to ignore the fact that we are taking the 
person normally ~10 meters from you out of the loop while a person 8000km 
from you remains in the loop...)

> Any how, I am restarting the xrootd servers from the
> latest version, 20041118-0948, but right now it doesn't start which looks
> like a configuration issue.

  What happens?

                                   Pete


> > > - I can get a checksum only from bbrprod05. Do you know what the
> > > problem is?
> >
> >   There is clearly a big mess for the versions. I see:
> >
> >   bbrprod01  20041022-0258
> >   bbrprod02  20040830-0105
> >   bbrprod03  20040830-0105
> >   bbrprod04  20040830-0105
> >   bbrprod05  20041022-0258
> >
> > and of course I've no idea if they have all been started with the new
> > version of the config file which includes the external checksum script.
> >
> >   Actually, you can always check the versions in Ganglia:
> >
> >   http://www-gmon.slac.stanford.edu:8080/ganglia/?m=xrootd_version&r=hour&s=by%2520hostname&c=xrootd-prod&h=&sh=1&hc=4
> >
> >   Wilko, could you please sort this out?
> >
> > > There is a test perl script at
> > > /afs/slac.stanford.edu/u/br/bbrskim/releases/test-16.0.1a/workdir/
> > > testPAdmin.pl
> > > which exercises the functionality which we need.
> > >
> > > BTW: we gave up to get it to work using olb on the time scale of next
> > > week. We will be happy if the functionality required by testPAdmin.pl
> > > works for all 5 bbrprod0X machines.
> >
> >   I'll take a look at it once they start the latest version of the
> > server (20041118-0948) on all 5 machines with the config file containing
> > the directive with the external checksum script.
> >
> >   BTW, the fact that you are using your own compiled version of (HEAD of) the
> > client instead of the version installed in afs is also a bit confusing. I'll
> > try to sort out the debug version for linux to help this along.
> >
> >                                  thanks,
> >                                    Pete
> >



-------------------------------------------------------------------------
Peter Elmer     E-mail: [log in to unmask]      Phone: +41 (22) 767-4644
Address: CERN Division PPE, Bat. 32 2C-14, CH-1211 Geneva 23, Switzerland
-------------------------------------------------------------------------