Print

Print


  Hi Dan,

On Mon, Apr 11, 2005 at 10:41:13AM -0500, Dan Bradley wrote:
> Thanks for looking into this.  I have copied the logs into the AFS  
> directory where you were looking.  However, I see nothing reported at  
> the time of the crash.  Perhaps I should turn on more verbose tracing?

  Andy, do you have any idea what the crash reported below might be? (See
traceback at the very end of this mail.)

  Perhaps I confused myself with the time of the core file. Did the
crash happen earlier than "Apr  6 20:01"? I can see from the 20050405 log
that it was restarted around "050405 10:28:20". Is that perhaps the
restart after the crash?

  In any case the last entry before the restart was at "050404 23:16:53"
and it looks like a normal redirect, so there is no extra info (i.e.
"last words") there.

> The redirector hasn't crashed since my last report.  Perhaps this is  
> related to somewhat decreased load.  I've been keeping an eye on open  
> file descriptors, which are well below 1k, but I'm just glancing now  
> and then by hand, so I don't yet know how bursty it is.

                                   Pete


> On Apr 9, 2005, at 2:53 AM, Peter Elmer wrote:
> 
> >   Hi Dan,
> >
> >   Andy is travelling today and tomorrow, so he'll probably be able to  
> > take
> > a look at this on Monday. Were there any messages in the (redirector)  
> > xrootd
> > log file? I see the core file is time-stamped "Apr  6 20:01", but the  
> > log for
> > 20050406 isn't in the "logs" area in your afs area.
> >
> >   One thing occurs to me already: how many clients are now hitting your
> > redirector? Is it possible that you are hitting the default file  
> > descriptor
> > limit?
> >
> >   http://xrootd.slac.stanford.edu/hardware_os_config.html
> >
> > In that case, you might see some messages in the xrootd logs about  
> > problems
> > starting new threads, for example.
> >
> >   BTW, I found the logs in your area: you are doing a reasonable  
> > number of
> > file opens. (261k on 20050326, for example.) Very nice. The average  
> > rate
> > isn't huge, but presumably there are some peaks as things start in  
> > bunches
> > and whatnot. It also looks like there are about 2k files being opened,  
> > so
> > most of them are probably cached in memory, too. Is it all pileup for  
> > MC? At
> > least the redirects seem fairly balanced over the servers:
> >
> >   30589 s5n01.hep.wisc.edu:1094
> >   32969 s5n03.hep.wisc.edu:1094
> >   32960 s5n04.hep.wisc.edu:1094
> >   32938 s5n05.hep.wisc.edu:1094
> >   32942 s5n06.hep.wisc.edu:1094
> >   32936 s5n07.hep.wisc.edu:1094
> >   32948 s5n08.hep.wisc.edu:1094
> >   32939 s5n09.hep.wisc.edu:1094
> >
> >   I also see from the 20050405 log file that there were something like  
> > 381
> > different machines connecting from all over campus (and perhaps some  
> > may have
> > had more than one application connecting if they are dual-cpu). That  
> > is still
> > a bit short of 1024 even if I put in the factor of 2, but it depends  
> > on what
> > else is happening on the machine. You can probably check this from the  
> > xrootd
> > log and with 'lsof'.
> >
> >                                    Pete
> >
> > On Fri, Apr 08, 2005 at 03:46:39PM -0500, Dan Bradley wrote:
> >> I am getting occasional crashes of xrootd on the redirector.  I am
> >> running version 20050328-0656 under Scientific Linux 3.0.4.
> >>
> >> The redirector crashes with the following stack dump:
> >>
> >> #0  0x080845cf in typeinfo name for XrdXrootdPrepare ()
> >> #1  0x0807266f in XrdProtocol_Select::Process (this=0x80917d8,
> >> lp=0x83df5dc)
> >>    at XrdProtocol.cc:165
> >> #2  0x0806d622 in XrdLink::DoIt (this=0x83df5dc) at XrdLink.cc:296
> >> #3  0x080739bc in XrdScheduler::Run (this=0x8091640) at  
> >> XrdScheduler.cc:293
> >> #4  0x08072a1c in XrdStartWorking (carg=0x8091640) at  
> >> XrdScheduler.cc:82
> >> #5  0x0807f4be in XrdOucThread_Xeq (myargs=0x839eba0) at
> >> XrdOucPthread.cc:80
> >> #6  0x00f4adec in start_thread () from /lib/tls/libpthread.so.0
> >> #7  0x0032ea2a in clone () from /lib/tls/libc.so.6
> >>
> >> You may find the core file here:
> >>
> >> /afs/hep.wisc.edu/cms/sw/xrootd/debug/core.5250
> >>
> >> The binary is here:
> >>
> >> /afs/hep.wisc.edu/cms/sw/xrootd/bin/xrootd
> >



-------------------------------------------------------------------------
Peter Elmer     E-mail: [log in to unmask]      Phone: +41 (22) 767-4644
Address: CERN Division PPE, Bat. 32 2C-14, CH-1211 Geneva 23, Switzerland
-------------------------------------------------------------------------