Print

Print


Hi Pete & Dan,

No need to send log files; I found the problem. So much for the compiler
warning you about missing parenthesis. The statement (x = y ? x->z : 0) is
not evaluated in the expected way unless 'x = y' is in parenthesis because
'=' has lower precendence than '?' which means that the assignement
actually occurs after the selector is evaluated (strange but true). The
result is that when a very active system goes idle, an inevitable segv
will follow at some point. Very active is defined as when more than 25% of
possible file descriptors are used for incomming connections. That rarely
happens in most systems. However, based on your configuration, it happens
often enough to cause problems. You can immediately bypass this problem
by:

a) Increasing the maximum number of file descriptors allowed to be >1.25
of the number of maximum expected clients.
b) Specifying 'xrd.connections dur 9999h' on the config file. This will
disable the protocol object optimizer for 416.25 days (long enough).

We should have a release with the fix by the end of the week.

Andy
On Tue, 12 Apr 2005, Peter Elmer wrote:

>   Hi Dan,
>
> On Mon, Apr 11, 2005 at 10:41:13AM -0500, Dan Bradley wrote:
> > Thanks for looking into this.  I have copied the logs into the AFS
> > directory where you were looking.  However, I see nothing reported at
> > the time of the crash.  Perhaps I should turn on more verbose tracing?
>
>   Andy, do you have any idea what the crash reported below might be? (See
> traceback at the very end of this mail.)
>
>   Perhaps I confused myself with the time of the core file. Did the
> crash happen earlier than "Apr  6 20:01"? I can see from the 20050405 log
> that it was restarted around "050405 10:28:20". Is that perhaps the
> restart after the crash?
>
>   In any case the last entry before the restart was at "050404 23:16:53"
> and it looks like a normal redirect, so there is no extra info (i.e.
> "last words") there.
>
> > The redirector hasn't crashed since my last report.  Perhaps this is
> > related to somewhat decreased load.  I've been keeping an eye on open
> > file descriptors, which are well below 1k, but I'm just glancing now
> > and then by hand, so I don't yet know how bursty it is.
>
>                                    Pete
>
>
> > On Apr 9, 2005, at 2:53 AM, Peter Elmer wrote:
> >
> > >   Hi Dan,
> > >
> > >   Andy is travelling today and tomorrow, so he'll probably be able to
> > > take
> > > a look at this on Monday. Were there any messages in the (redirector)
> > > xrootd
> > > log file? I see the core file is time-stamped "Apr  6 20:01", but the
> > > log for
> > > 20050406 isn't in the "logs" area in your afs area.
> > >
> > >   One thing occurs to me already: how many clients are now hitting your
> > > redirector? Is it possible that you are hitting the default file
> > > descriptor
> > > limit?
> > >
> > >   http://xrootd.slac.stanford.edu/hardware_os_config.html
> > >
> > > In that case, you might see some messages in the xrootd logs about
> > > problems
> > > starting new threads, for example.
> > >
> > >   BTW, I found the logs in your area: you are doing a reasonable
> > > number of
> > > file opens. (261k on 20050326, for example.) Very nice. The average
> > > rate
> > > isn't huge, but presumably there are some peaks as things start in
> > > bunches
> > > and whatnot. It also looks like there are about 2k files being opened,
> > > so
> > > most of them are probably cached in memory, too. Is it all pileup for
> > > MC? At
> > > least the redirects seem fairly balanced over the servers:
> > >
> > >   30589 s5n01.hep.wisc.edu:1094
> > >   32969 s5n03.hep.wisc.edu:1094
> > >   32960 s5n04.hep.wisc.edu:1094
> > >   32938 s5n05.hep.wisc.edu:1094
> > >   32942 s5n06.hep.wisc.edu:1094
> > >   32936 s5n07.hep.wisc.edu:1094
> > >   32948 s5n08.hep.wisc.edu:1094
> > >   32939 s5n09.hep.wisc.edu:1094
> > >
> > >   I also see from the 20050405 log file that there were something like
> > > 381
> > > different machines connecting from all over campus (and perhaps some
> > > may have
> > > had more than one application connecting if they are dual-cpu). That
> > > is still
> > > a bit short of 1024 even if I put in the factor of 2, but it depends
> > > on what
> > > else is happening on the machine. You can probably check this from the
> > > xrootd
> > > log and with 'lsof'.
> > >
> > >                                    Pete
> > >
> > > On Fri, Apr 08, 2005 at 03:46:39PM -0500, Dan Bradley wrote:
> > >> I am getting occasional crashes of xrootd on the redirector.  I am
> > >> running version 20050328-0656 under Scientific Linux 3.0.4.
> > >>
> > >> The redirector crashes with the following stack dump:
> > >>
> > >> #0  0x080845cf in typeinfo name for XrdXrootdPrepare ()
> > >> #1  0x0807266f in XrdProtocol_Select::Process (this=0x80917d8,
> > >> lp=0x83df5dc)
> > >>    at XrdProtocol.cc:165
> > >> #2  0x0806d622 in XrdLink::DoIt (this=0x83df5dc) at XrdLink.cc:296
> > >> #3  0x080739bc in XrdScheduler::Run (this=0x8091640) at
> > >> XrdScheduler.cc:293
> > >> #4  0x08072a1c in XrdStartWorking (carg=0x8091640) at
> > >> XrdScheduler.cc:82
> > >> #5  0x0807f4be in XrdOucThread_Xeq (myargs=0x839eba0) at
> > >> XrdOucPthread.cc:80
> > >> #6  0x00f4adec in start_thread () from /lib/tls/libpthread.so.0
> > >> #7  0x0032ea2a in clone () from /lib/tls/libc.so.6
> > >>
> > >> You may find the core file here:
> > >>
> > >> /afs/hep.wisc.edu/cms/sw/xrootd/debug/core.5250
> > >>
> > >> The binary is here:
> > >>
> > >> /afs/hep.wisc.edu/cms/sw/xrootd/bin/xrootd
> > >
>
>
>
> -------------------------------------------------------------------------
> Peter Elmer     E-mail: [log in to unmask]      Phone: +41 (22) 767-4644
> Address: CERN Division PPE, Bat. 32 2C-14, CH-1211 Geneva 23, Switzerland
> -------------------------------------------------------------------------
>