Excellent. Thanks Andy! --Dan Andrew Hanushevsky wrote: >Hi Pete & Dan, > >No need to send log files; I found the problem. So much for the compiler >warning you about missing parenthesis. The statement (x = y ? x->z : 0) is >not evaluated in the expected way unless 'x = y' is in parenthesis because >'=' has lower precendence than '?' which means that the assignement >actually occurs after the selector is evaluated (strange but true). The >result is that when a very active system goes idle, an inevitable segv >will follow at some point. Very active is defined as when more than 25% of >possible file descriptors are used for incomming connections. That rarely >happens in most systems. However, based on your configuration, it happens >often enough to cause problems. You can immediately bypass this problem >by: > >a) Increasing the maximum number of file descriptors allowed to be >1.25 >of the number of maximum expected clients. >b) Specifying 'xrd.connections dur 9999h' on the config file. This will >disable the protocol object optimizer for 416.25 days (long enough). > >We should have a release with the fix by the end of the week. > >Andy >On Tue, 12 Apr 2005, Peter Elmer wrote: > > > >> Hi Dan, >> >>On Mon, Apr 11, 2005 at 10:41:13AM -0500, Dan Bradley wrote: >> >> >>>Thanks for looking into this. I have copied the logs into the AFS >>>directory where you were looking. However, I see nothing reported at >>>the time of the crash. Perhaps I should turn on more verbose tracing? >>> >>> >> Andy, do you have any idea what the crash reported below might be? (See >>traceback at the very end of this mail.) >> >> Perhaps I confused myself with the time of the core file. Did the >>crash happen earlier than "Apr 6 20:01"? I can see from the 20050405 log >>that it was restarted around "050405 10:28:20". Is that perhaps the >>restart after the crash? >> >> In any case the last entry before the restart was at "050404 23:16:53" >>and it looks like a normal redirect, so there is no extra info (i.e. >>"last words") there. >> >> >> >>>The redirector hasn't crashed since my last report. Perhaps this is >>>related to somewhat decreased load. I've been keeping an eye on open >>>file descriptors, which are well below 1k, but I'm just glancing now >>>and then by hand, so I don't yet know how bursty it is. >>> >>> >> Pete >> >> >> >> >>>On Apr 9, 2005, at 2:53 AM, Peter Elmer wrote: >>> >>> >>> >>>> Hi Dan, >>>> >>>> Andy is travelling today and tomorrow, so he'll probably be able to >>>>take >>>>a look at this on Monday. Were there any messages in the (redirector) >>>>xrootd >>>>log file? I see the core file is time-stamped "Apr 6 20:01", but the >>>>log for >>>>20050406 isn't in the "logs" area in your afs area. >>>> >>>> One thing occurs to me already: how many clients are now hitting your >>>>redirector? Is it possible that you are hitting the default file >>>>descriptor >>>>limit? >>>> >>>> http://xrootd.slac.stanford.edu/hardware_os_config.html >>>> >>>>In that case, you might see some messages in the xrootd logs about >>>>problems >>>>starting new threads, for example. >>>> >>>> BTW, I found the logs in your area: you are doing a reasonable >>>>number of >>>>file opens. (261k on 20050326, for example.) Very nice. The average >>>>rate >>>>isn't huge, but presumably there are some peaks as things start in >>>>bunches >>>>and whatnot. It also looks like there are about 2k files being opened, >>>>so >>>>most of them are probably cached in memory, too. Is it all pileup for >>>>MC? At >>>>least the redirects seem fairly balanced over the servers: >>>> >>>> 30589 s5n01.hep.wisc.edu:1094 >>>> 32969 s5n03.hep.wisc.edu:1094 >>>> 32960 s5n04.hep.wisc.edu:1094 >>>> 32938 s5n05.hep.wisc.edu:1094 >>>> 32942 s5n06.hep.wisc.edu:1094 >>>> 32936 s5n07.hep.wisc.edu:1094 >>>> 32948 s5n08.hep.wisc.edu:1094 >>>> 32939 s5n09.hep.wisc.edu:1094 >>>> >>>> I also see from the 20050405 log file that there were something like >>>>381 >>>>different machines connecting from all over campus (and perhaps some >>>>may have >>>>had more than one application connecting if they are dual-cpu). That >>>>is still >>>>a bit short of 1024 even if I put in the factor of 2, but it depends >>>>on what >>>>else is happening on the machine. You can probably check this from the >>>>xrootd >>>>log and with 'lsof'. >>>> >>>> Pete >>>> >>>>On Fri, Apr 08, 2005 at 03:46:39PM -0500, Dan Bradley wrote: >>>> >>>> >>>>>I am getting occasional crashes of xrootd on the redirector. I am >>>>>running version 20050328-0656 under Scientific Linux 3.0.4. >>>>> >>>>>The redirector crashes with the following stack dump: >>>>> >>>>>#0 0x080845cf in typeinfo name for XrdXrootdPrepare () >>>>>#1 0x0807266f in XrdProtocol_Select::Process (this=0x80917d8, >>>>>lp=0x83df5dc) >>>>> at XrdProtocol.cc:165 >>>>>#2 0x0806d622 in XrdLink::DoIt (this=0x83df5dc) at XrdLink.cc:296 >>>>>#3 0x080739bc in XrdScheduler::Run (this=0x8091640) at >>>>>XrdScheduler.cc:293 >>>>>#4 0x08072a1c in XrdStartWorking (carg=0x8091640) at >>>>>XrdScheduler.cc:82 >>>>>#5 0x0807f4be in XrdOucThread_Xeq (myargs=0x839eba0) at >>>>>XrdOucPthread.cc:80 >>>>>#6 0x00f4adec in start_thread () from /lib/tls/libpthread.so.0 >>>>>#7 0x0032ea2a in clone () from /lib/tls/libc.so.6 >>>>> >>>>>You may find the core file here: >>>>> >>>>>/afs/hep.wisc.edu/cms/sw/xrootd/debug/core.5250 >>>>> >>>>>The binary is here: >>>>> >>>>>/afs/hep.wisc.edu/cms/sw/xrootd/bin/xrootd >>>>> >>>>> >> >>------------------------------------------------------------------------- >>Peter Elmer E-mail: [log in to unmask] Phone: +41 (22) 767-4644 >>Address: CERN Division PPE, Bat. 32 2C-14, CH-1211 Geneva 23, Switzerland >>------------------------------------------------------------------------- >> >> >>