Excellent. Thanks Andy!
--Dan
Andrew Hanushevsky wrote:
>Hi Pete & Dan,
>
>No need to send log files; I found the problem. So much for the compiler
>warning you about missing parenthesis. The statement (x = y ? x->z : 0) is
>not evaluated in the expected way unless 'x = y' is in parenthesis because
>'=' has lower precendence than '?' which means that the assignement
>actually occurs after the selector is evaluated (strange but true). The
>result is that when a very active system goes idle, an inevitable segv
>will follow at some point. Very active is defined as when more than 25% of
>possible file descriptors are used for incomming connections. That rarely
>happens in most systems. However, based on your configuration, it happens
>often enough to cause problems. You can immediately bypass this problem
>by:
>
>a) Increasing the maximum number of file descriptors allowed to be >1.25
>of the number of maximum expected clients.
>b) Specifying 'xrd.connections dur 9999h' on the config file. This will
>disable the protocol object optimizer for 416.25 days (long enough).
>
>We should have a release with the fix by the end of the week.
>
>Andy
>On Tue, 12 Apr 2005, Peter Elmer wrote:
>
>
>
>> Hi Dan,
>>
>>On Mon, Apr 11, 2005 at 10:41:13AM -0500, Dan Bradley wrote:
>>
>>
>>>Thanks for looking into this. I have copied the logs into the AFS
>>>directory where you were looking. However, I see nothing reported at
>>>the time of the crash. Perhaps I should turn on more verbose tracing?
>>>
>>>
>> Andy, do you have any idea what the crash reported below might be? (See
>>traceback at the very end of this mail.)
>>
>> Perhaps I confused myself with the time of the core file. Did the
>>crash happen earlier than "Apr 6 20:01"? I can see from the 20050405 log
>>that it was restarted around "050405 10:28:20". Is that perhaps the
>>restart after the crash?
>>
>> In any case the last entry before the restart was at "050404 23:16:53"
>>and it looks like a normal redirect, so there is no extra info (i.e.
>>"last words") there.
>>
>>
>>
>>>The redirector hasn't crashed since my last report. Perhaps this is
>>>related to somewhat decreased load. I've been keeping an eye on open
>>>file descriptors, which are well below 1k, but I'm just glancing now
>>>and then by hand, so I don't yet know how bursty it is.
>>>
>>>
>> Pete
>>
>>
>>
>>
>>>On Apr 9, 2005, at 2:53 AM, Peter Elmer wrote:
>>>
>>>
>>>
>>>> Hi Dan,
>>>>
>>>> Andy is travelling today and tomorrow, so he'll probably be able to
>>>>take
>>>>a look at this on Monday. Were there any messages in the (redirector)
>>>>xrootd
>>>>log file? I see the core file is time-stamped "Apr 6 20:01", but the
>>>>log for
>>>>20050406 isn't in the "logs" area in your afs area.
>>>>
>>>> One thing occurs to me already: how many clients are now hitting your
>>>>redirector? Is it possible that you are hitting the default file
>>>>descriptor
>>>>limit?
>>>>
>>>> http://xrootd.slac.stanford.edu/hardware_os_config.html
>>>>
>>>>In that case, you might see some messages in the xrootd logs about
>>>>problems
>>>>starting new threads, for example.
>>>>
>>>> BTW, I found the logs in your area: you are doing a reasonable
>>>>number of
>>>>file opens. (261k on 20050326, for example.) Very nice. The average
>>>>rate
>>>>isn't huge, but presumably there are some peaks as things start in
>>>>bunches
>>>>and whatnot. It also looks like there are about 2k files being opened,
>>>>so
>>>>most of them are probably cached in memory, too. Is it all pileup for
>>>>MC? At
>>>>least the redirects seem fairly balanced over the servers:
>>>>
>>>> 30589 s5n01.hep.wisc.edu:1094
>>>> 32969 s5n03.hep.wisc.edu:1094
>>>> 32960 s5n04.hep.wisc.edu:1094
>>>> 32938 s5n05.hep.wisc.edu:1094
>>>> 32942 s5n06.hep.wisc.edu:1094
>>>> 32936 s5n07.hep.wisc.edu:1094
>>>> 32948 s5n08.hep.wisc.edu:1094
>>>> 32939 s5n09.hep.wisc.edu:1094
>>>>
>>>> I also see from the 20050405 log file that there were something like
>>>>381
>>>>different machines connecting from all over campus (and perhaps some
>>>>may have
>>>>had more than one application connecting if they are dual-cpu). That
>>>>is still
>>>>a bit short of 1024 even if I put in the factor of 2, but it depends
>>>>on what
>>>>else is happening on the machine. You can probably check this from the
>>>>xrootd
>>>>log and with 'lsof'.
>>>>
>>>> Pete
>>>>
>>>>On Fri, Apr 08, 2005 at 03:46:39PM -0500, Dan Bradley wrote:
>>>>
>>>>
>>>>>I am getting occasional crashes of xrootd on the redirector. I am
>>>>>running version 20050328-0656 under Scientific Linux 3.0.4.
>>>>>
>>>>>The redirector crashes with the following stack dump:
>>>>>
>>>>>#0 0x080845cf in typeinfo name for XrdXrootdPrepare ()
>>>>>#1 0x0807266f in XrdProtocol_Select::Process (this=0x80917d8,
>>>>>lp=0x83df5dc)
>>>>> at XrdProtocol.cc:165
>>>>>#2 0x0806d622 in XrdLink::DoIt (this=0x83df5dc) at XrdLink.cc:296
>>>>>#3 0x080739bc in XrdScheduler::Run (this=0x8091640) at
>>>>>XrdScheduler.cc:293
>>>>>#4 0x08072a1c in XrdStartWorking (carg=0x8091640) at
>>>>>XrdScheduler.cc:82
>>>>>#5 0x0807f4be in XrdOucThread_Xeq (myargs=0x839eba0) at
>>>>>XrdOucPthread.cc:80
>>>>>#6 0x00f4adec in start_thread () from /lib/tls/libpthread.so.0
>>>>>#7 0x0032ea2a in clone () from /lib/tls/libc.so.6
>>>>>
>>>>>You may find the core file here:
>>>>>
>>>>>/afs/hep.wisc.edu/cms/sw/xrootd/debug/core.5250
>>>>>
>>>>>The binary is here:
>>>>>
>>>>>/afs/hep.wisc.edu/cms/sw/xrootd/bin/xrootd
>>>>>
>>>>>
>>
>>-------------------------------------------------------------------------
>>Peter Elmer E-mail: [log in to unmask] Phone: +41 (22) 767-4644
>>Address: CERN Division PPE, Bat. 32 2C-14, CH-1211 Geneva 23, Switzerland
>>-------------------------------------------------------------------------
>>
>>
>>
|