Print

Print


Excellent.  Thanks Andy!

--Dan

Andrew Hanushevsky wrote:

>Hi Pete & Dan,
>
>No need to send log files; I found the problem. So much for the compiler
>warning you about missing parenthesis. The statement (x = y ? x->z : 0) is
>not evaluated in the expected way unless 'x = y' is in parenthesis because
>'=' has lower precendence than '?' which means that the assignement
>actually occurs after the selector is evaluated (strange but true). The
>result is that when a very active system goes idle, an inevitable segv
>will follow at some point. Very active is defined as when more than 25% of
>possible file descriptors are used for incomming connections. That rarely
>happens in most systems. However, based on your configuration, it happens
>often enough to cause problems. You can immediately bypass this problem
>by:
>
>a) Increasing the maximum number of file descriptors allowed to be >1.25
>of the number of maximum expected clients.
>b) Specifying 'xrd.connections dur 9999h' on the config file. This will
>disable the protocol object optimizer for 416.25 days (long enough).
>
>We should have a release with the fix by the end of the week.
>
>Andy
>On Tue, 12 Apr 2005, Peter Elmer wrote:
>
>  
>
>>  Hi Dan,
>>
>>On Mon, Apr 11, 2005 at 10:41:13AM -0500, Dan Bradley wrote:
>>    
>>
>>>Thanks for looking into this.  I have copied the logs into the AFS
>>>directory where you were looking.  However, I see nothing reported at
>>>the time of the crash.  Perhaps I should turn on more verbose tracing?
>>>      
>>>
>>  Andy, do you have any idea what the crash reported below might be? (See
>>traceback at the very end of this mail.)
>>
>>  Perhaps I confused myself with the time of the core file. Did the
>>crash happen earlier than "Apr  6 20:01"? I can see from the 20050405 log
>>that it was restarted around "050405 10:28:20". Is that perhaps the
>>restart after the crash?
>>
>>  In any case the last entry before the restart was at "050404 23:16:53"
>>and it looks like a normal redirect, so there is no extra info (i.e.
>>"last words") there.
>>
>>    
>>
>>>The redirector hasn't crashed since my last report.  Perhaps this is
>>>related to somewhat decreased load.  I've been keeping an eye on open
>>>file descriptors, which are well below 1k, but I'm just glancing now
>>>and then by hand, so I don't yet know how bursty it is.
>>>      
>>>
>>                                   Pete
>>
>>
>>    
>>
>>>On Apr 9, 2005, at 2:53 AM, Peter Elmer wrote:
>>>
>>>      
>>>
>>>>  Hi Dan,
>>>>
>>>>  Andy is travelling today and tomorrow, so he'll probably be able to
>>>>take
>>>>a look at this on Monday. Were there any messages in the (redirector)
>>>>xrootd
>>>>log file? I see the core file is time-stamped "Apr  6 20:01", but the
>>>>log for
>>>>20050406 isn't in the "logs" area in your afs area.
>>>>
>>>>  One thing occurs to me already: how many clients are now hitting your
>>>>redirector? Is it possible that you are hitting the default file
>>>>descriptor
>>>>limit?
>>>>
>>>>  http://xrootd.slac.stanford.edu/hardware_os_config.html
>>>>
>>>>In that case, you might see some messages in the xrootd logs about
>>>>problems
>>>>starting new threads, for example.
>>>>
>>>>  BTW, I found the logs in your area: you are doing a reasonable
>>>>number of
>>>>file opens. (261k on 20050326, for example.) Very nice. The average
>>>>rate
>>>>isn't huge, but presumably there are some peaks as things start in
>>>>bunches
>>>>and whatnot. It also looks like there are about 2k files being opened,
>>>>so
>>>>most of them are probably cached in memory, too. Is it all pileup for
>>>>MC? At
>>>>least the redirects seem fairly balanced over the servers:
>>>>
>>>>  30589 s5n01.hep.wisc.edu:1094
>>>>  32969 s5n03.hep.wisc.edu:1094
>>>>  32960 s5n04.hep.wisc.edu:1094
>>>>  32938 s5n05.hep.wisc.edu:1094
>>>>  32942 s5n06.hep.wisc.edu:1094
>>>>  32936 s5n07.hep.wisc.edu:1094
>>>>  32948 s5n08.hep.wisc.edu:1094
>>>>  32939 s5n09.hep.wisc.edu:1094
>>>>
>>>>  I also see from the 20050405 log file that there were something like
>>>>381
>>>>different machines connecting from all over campus (and perhaps some
>>>>may have
>>>>had more than one application connecting if they are dual-cpu). That
>>>>is still
>>>>a bit short of 1024 even if I put in the factor of 2, but it depends
>>>>on what
>>>>else is happening on the machine. You can probably check this from the
>>>>xrootd
>>>>log and with 'lsof'.
>>>>
>>>>                                   Pete
>>>>
>>>>On Fri, Apr 08, 2005 at 03:46:39PM -0500, Dan Bradley wrote:
>>>>        
>>>>
>>>>>I am getting occasional crashes of xrootd on the redirector.  I am
>>>>>running version 20050328-0656 under Scientific Linux 3.0.4.
>>>>>
>>>>>The redirector crashes with the following stack dump:
>>>>>
>>>>>#0  0x080845cf in typeinfo name for XrdXrootdPrepare ()
>>>>>#1  0x0807266f in XrdProtocol_Select::Process (this=0x80917d8,
>>>>>lp=0x83df5dc)
>>>>>   at XrdProtocol.cc:165
>>>>>#2  0x0806d622 in XrdLink::DoIt (this=0x83df5dc) at XrdLink.cc:296
>>>>>#3  0x080739bc in XrdScheduler::Run (this=0x8091640) at
>>>>>XrdScheduler.cc:293
>>>>>#4  0x08072a1c in XrdStartWorking (carg=0x8091640) at
>>>>>XrdScheduler.cc:82
>>>>>#5  0x0807f4be in XrdOucThread_Xeq (myargs=0x839eba0) at
>>>>>XrdOucPthread.cc:80
>>>>>#6  0x00f4adec in start_thread () from /lib/tls/libpthread.so.0
>>>>>#7  0x0032ea2a in clone () from /lib/tls/libc.so.6
>>>>>
>>>>>You may find the core file here:
>>>>>
>>>>>/afs/hep.wisc.edu/cms/sw/xrootd/debug/core.5250
>>>>>
>>>>>The binary is here:
>>>>>
>>>>>/afs/hep.wisc.edu/cms/sw/xrootd/bin/xrootd
>>>>>          
>>>>>
>>
>>-------------------------------------------------------------------------
>>Peter Elmer     E-mail: [log in to unmask]      Phone: +41 (22) 767-4644
>>Address: CERN Division PPE, Bat. 32 2C-14, CH-1211 Geneva 23, Switzerland
>>-------------------------------------------------------------------------
>>
>>    
>>