Print

Print


Hi Pete & Dan,

I looked through the core file. Everything appears to be normal (except,
perhaps for the number of existing threads). The core fle shows that user
?:[log in to unmask] was connecting (note that FD=161, so we
didn't exceed the file descriptor limit). Everything went fine until the
xrootd protocol processing step was executed. The core file indicates that
a wild branch was taken at that point (or shortly after that point).
The vprt in the protocol pointer is set to 0x8409940 instead of the
expected 0x80902c8. That would indicate a corrupted object (name your
favorite cause). It also appears that the count of protocol objects is
wrong. I will need the logfile (just give me a pointer to it) to
understand what may be going on here. I notice that some tracing is turned
on. So, please give me a pointer to the log file.

Andy

On Tue, 12 Apr 2005, Peter Elmer wrote:

>   Hi Dan,
>
> On Mon, Apr 11, 2005 at 10:41:13AM -0500, Dan Bradley wrote:
> > Thanks for looking into this.  I have copied the logs into the AFS
> > directory where you were looking.  However, I see nothing reported at
> > the time of the crash.  Perhaps I should turn on more verbose tracing?
>
>   Andy, do you have any idea what the crash reported below might be? (See
> traceback at the very end of this mail.)
>
>   Perhaps I confused myself with the time of the core file. Did the
> crash happen earlier than "Apr  6 20:01"? I can see from the 20050405 log
> that it was restarted around "050405 10:28:20". Is that perhaps the
> restart after the crash?
>
>   In any case the last entry before the restart was at "050404 23:16:53"
> and it looks like a normal redirect, so there is no extra info (i.e.
> "last words") there.
>
> > The redirector hasn't crashed since my last report.  Perhaps this is
> > related to somewhat decreased load.  I've been keeping an eye on open
> > file descriptors, which are well below 1k, but I'm just glancing now
> > and then by hand, so I don't yet know how bursty it is.
>
>                                    Pete
>
>
> > On Apr 9, 2005, at 2:53 AM, Peter Elmer wrote:
> >
> > >   Hi Dan,
> > >
> > >   Andy is travelling today and tomorrow, so he'll probably be able to
> > > take
> > > a look at this on Monday. Were there any messages in the (redirector)
> > > xrootd
> > > log file? I see the core file is time-stamped "Apr  6 20:01", but the
> > > log for
> > > 20050406 isn't in the "logs" area in your afs area.
> > >
> > >   One thing occurs to me already: how many clients are now hitting your
> > > redirector? Is it possible that you are hitting the default file
> > > descriptor
> > > limit?
> > >
> > >   http://xrootd.slac.stanford.edu/hardware_os_config.html
> > >
> > > In that case, you might see some messages in the xrootd logs about
> > > problems
> > > starting new threads, for example.
> > >
> > >   BTW, I found the logs in your area: you are doing a reasonable
> > > number of
> > > file opens. (261k on 20050326, for example.) Very nice. The average
> > > rate
> > > isn't huge, but presumably there are some peaks as things start in
> > > bunches
> > > and whatnot. It also looks like there are about 2k files being opened,
> > > so
> > > most of them are probably cached in memory, too. Is it all pileup for
> > > MC? At
> > > least the redirects seem fairly balanced over the servers:
> > >
> > >   30589 s5n01.hep.wisc.edu:1094
> > >   32969 s5n03.hep.wisc.edu:1094
> > >   32960 s5n04.hep.wisc.edu:1094
> > >   32938 s5n05.hep.wisc.edu:1094
> > >   32942 s5n06.hep.wisc.edu:1094
> > >   32936 s5n07.hep.wisc.edu:1094
> > >   32948 s5n08.hep.wisc.edu:1094
> > >   32939 s5n09.hep.wisc.edu:1094
> > >
> > >   I also see from the 20050405 log file that there were something like
> > > 381
> > > different machines connecting from all over campus (and perhaps some
> > > may have
> > > had more than one application connecting if they are dual-cpu). That
> > > is still
> > > a bit short of 1024 even if I put in the factor of 2, but it depends
> > > on what
> > > else is happening on the machine. You can probably check this from the
> > > xrootd
> > > log and with 'lsof'.
> > >
> > >                                    Pete
> > >
> > > On Fri, Apr 08, 2005 at 03:46:39PM -0500, Dan Bradley wrote:
> > >> I am getting occasional crashes of xrootd on the redirector.  I am
> > >> running version 20050328-0656 under Scientific Linux 3.0.4.
> > >>
> > >> The redirector crashes with the following stack dump:
> > >>
> > >> #0  0x080845cf in typeinfo name for XrdXrootdPrepare ()
> > >> #1  0x0807266f in XrdProtocol_Select::Process (this=0x80917d8,
> > >> lp=0x83df5dc)
> > >>    at XrdProtocol.cc:165
> > >> #2  0x0806d622 in XrdLink::DoIt (this=0x83df5dc) at XrdLink.cc:296
> > >> #3  0x080739bc in XrdScheduler::Run (this=0x8091640) at
> > >> XrdScheduler.cc:293
> > >> #4  0x08072a1c in XrdStartWorking (carg=0x8091640) at
> > >> XrdScheduler.cc:82
> > >> #5  0x0807f4be in XrdOucThread_Xeq (myargs=0x839eba0) at
> > >> XrdOucPthread.cc:80
> > >> #6  0x00f4adec in start_thread () from /lib/tls/libpthread.so.0
> > >> #7  0x0032ea2a in clone () from /lib/tls/libc.so.6
> > >>
> > >> You may find the core file here:
> > >>
> > >> /afs/hep.wisc.edu/cms/sw/xrootd/debug/core.5250
> > >>
> > >> The binary is here:
> > >>
> > >> /afs/hep.wisc.edu/cms/sw/xrootd/bin/xrootd
> > >
>
>
>
> -------------------------------------------------------------------------
> Peter Elmer     E-mail: [log in to unmask]      Phone: +41 (22) 767-4644
> Address: CERN Division PPE, Bat. 32 2C-14, CH-1211 Geneva 23, Switzerland
> -------------------------------------------------------------------------
>