Hi Dan,
On Mon, Apr 11, 2005 at 10:41:13AM -0500, Dan Bradley wrote:
> Thanks for looking into this. I have copied the logs into the AFS
> directory where you were looking. However, I see nothing reported at
> the time of the crash. Perhaps I should turn on more verbose tracing?
Andy, do you have any idea what the crash reported below might be? (See
traceback at the very end of this mail.)
Perhaps I confused myself with the time of the core file. Did the
crash happen earlier than "Apr 6 20:01"? I can see from the 20050405 log
that it was restarted around "050405 10:28:20". Is that perhaps the
restart after the crash?
In any case the last entry before the restart was at "050404 23:16:53"
and it looks like a normal redirect, so there is no extra info (i.e.
"last words") there.
> The redirector hasn't crashed since my last report. Perhaps this is
> related to somewhat decreased load. I've been keeping an eye on open
> file descriptors, which are well below 1k, but I'm just glancing now
> and then by hand, so I don't yet know how bursty it is.
Pete
> On Apr 9, 2005, at 2:53 AM, Peter Elmer wrote:
>
> > Hi Dan,
> >
> > Andy is travelling today and tomorrow, so he'll probably be able to
> > take
> > a look at this on Monday. Were there any messages in the (redirector)
> > xrootd
> > log file? I see the core file is time-stamped "Apr 6 20:01", but the
> > log for
> > 20050406 isn't in the "logs" area in your afs area.
> >
> > One thing occurs to me already: how many clients are now hitting your
> > redirector? Is it possible that you are hitting the default file
> > descriptor
> > limit?
> >
> > http://xrootd.slac.stanford.edu/hardware_os_config.html
> >
> > In that case, you might see some messages in the xrootd logs about
> > problems
> > starting new threads, for example.
> >
> > BTW, I found the logs in your area: you are doing a reasonable
> > number of
> > file opens. (261k on 20050326, for example.) Very nice. The average
> > rate
> > isn't huge, but presumably there are some peaks as things start in
> > bunches
> > and whatnot. It also looks like there are about 2k files being opened,
> > so
> > most of them are probably cached in memory, too. Is it all pileup for
> > MC? At
> > least the redirects seem fairly balanced over the servers:
> >
> > 30589 s5n01.hep.wisc.edu:1094
> > 32969 s5n03.hep.wisc.edu:1094
> > 32960 s5n04.hep.wisc.edu:1094
> > 32938 s5n05.hep.wisc.edu:1094
> > 32942 s5n06.hep.wisc.edu:1094
> > 32936 s5n07.hep.wisc.edu:1094
> > 32948 s5n08.hep.wisc.edu:1094
> > 32939 s5n09.hep.wisc.edu:1094
> >
> > I also see from the 20050405 log file that there were something like
> > 381
> > different machines connecting from all over campus (and perhaps some
> > may have
> > had more than one application connecting if they are dual-cpu). That
> > is still
> > a bit short of 1024 even if I put in the factor of 2, but it depends
> > on what
> > else is happening on the machine. You can probably check this from the
> > xrootd
> > log and with 'lsof'.
> >
> > Pete
> >
> > On Fri, Apr 08, 2005 at 03:46:39PM -0500, Dan Bradley wrote:
> >> I am getting occasional crashes of xrootd on the redirector. I am
> >> running version 20050328-0656 under Scientific Linux 3.0.4.
> >>
> >> The redirector crashes with the following stack dump:
> >>
> >> #0 0x080845cf in typeinfo name for XrdXrootdPrepare ()
> >> #1 0x0807266f in XrdProtocol_Select::Process (this=0x80917d8,
> >> lp=0x83df5dc)
> >> at XrdProtocol.cc:165
> >> #2 0x0806d622 in XrdLink::DoIt (this=0x83df5dc) at XrdLink.cc:296
> >> #3 0x080739bc in XrdScheduler::Run (this=0x8091640) at
> >> XrdScheduler.cc:293
> >> #4 0x08072a1c in XrdStartWorking (carg=0x8091640) at
> >> XrdScheduler.cc:82
> >> #5 0x0807f4be in XrdOucThread_Xeq (myargs=0x839eba0) at
> >> XrdOucPthread.cc:80
> >> #6 0x00f4adec in start_thread () from /lib/tls/libpthread.so.0
> >> #7 0x0032ea2a in clone () from /lib/tls/libc.so.6
> >>
> >> You may find the core file here:
> >>
> >> /afs/hep.wisc.edu/cms/sw/xrootd/debug/core.5250
> >>
> >> The binary is here:
> >>
> >> /afs/hep.wisc.edu/cms/sw/xrootd/bin/xrootd
> >
-------------------------------------------------------------------------
Peter Elmer E-mail: [log in to unmask] Phone: +41 (22) 767-4644
Address: CERN Division PPE, Bat. 32 2C-14, CH-1211 Geneva 23, Switzerland
-------------------------------------------------------------------------
|