Print

Print


Pete,

Thanks for looking into this.  I have copied the logs into the AFS  
directory where you were looking.  However, I see nothing reported at  
the time of the crash.  Perhaps I should turn on more verbose tracing?

The redirector hasn't crashed since my last report.  Perhaps this is  
related to somewhat decreased load.  I've been keeping an eye on open  
file descriptors, which are well below 1k, but I'm just glancing now  
and then by hand, so I don't yet know how bursty it is.

--Dan

On Apr 9, 2005, at 2:53 AM, Peter Elmer wrote:

>   Hi Dan,
>
>   Andy is travelling today and tomorrow, so he'll probably be able to  
> take
> a look at this on Monday. Were there any messages in the (redirector)  
> xrootd
> log file? I see the core file is time-stamped "Apr  6 20:01", but the  
> log for
> 20050406 isn't in the "logs" area in your afs area.
>
>   One thing occurs to me already: how many clients are now hitting your
> redirector? Is it possible that you are hitting the default file  
> descriptor
> limit?
>
>   http://xrootd.slac.stanford.edu/hardware_os_config.html
>
> In that case, you might see some messages in the xrootd logs about  
> problems
> starting new threads, for example.
>
>   BTW, I found the logs in your area: you are doing a reasonable  
> number of
> file opens. (261k on 20050326, for example.) Very nice. The average  
> rate
> isn't huge, but presumably there are some peaks as things start in  
> bunches
> and whatnot. It also looks like there are about 2k files being opened,  
> so
> most of them are probably cached in memory, too. Is it all pileup for  
> MC? At
> least the redirects seem fairly balanced over the servers:
>
>   30589 s5n01.hep.wisc.edu:1094
>   32969 s5n03.hep.wisc.edu:1094
>   32960 s5n04.hep.wisc.edu:1094
>   32938 s5n05.hep.wisc.edu:1094
>   32942 s5n06.hep.wisc.edu:1094
>   32936 s5n07.hep.wisc.edu:1094
>   32948 s5n08.hep.wisc.edu:1094
>   32939 s5n09.hep.wisc.edu:1094
>
>   I also see from the 20050405 log file that there were something like  
> 381
> different machines connecting from all over campus (and perhaps some  
> may have
> had more than one application connecting if they are dual-cpu). That  
> is still
> a bit short of 1024 even if I put in the factor of 2, but it depends  
> on what
> else is happening on the machine. You can probably check this from the  
> xrootd
> log and with 'lsof'.
>
>                                    Pete
>
> On Fri, Apr 08, 2005 at 03:46:39PM -0500, Dan Bradley wrote:
>> I am getting occasional crashes of xrootd on the redirector.  I am
>> running version 20050328-0656 under Scientific Linux 3.0.4.
>>
>> The redirector crashes with the following stack dump:
>>
>> #0  0x080845cf in typeinfo name for XrdXrootdPrepare ()
>> #1  0x0807266f in XrdProtocol_Select::Process (this=0x80917d8,
>> lp=0x83df5dc)
>>    at XrdProtocol.cc:165
>> #2  0x0806d622 in XrdLink::DoIt (this=0x83df5dc) at XrdLink.cc:296
>> #3  0x080739bc in XrdScheduler::Run (this=0x8091640) at  
>> XrdScheduler.cc:293
>> #4  0x08072a1c in XrdStartWorking (carg=0x8091640) at  
>> XrdScheduler.cc:82
>> #5  0x0807f4be in XrdOucThread_Xeq (myargs=0x839eba0) at
>> XrdOucPthread.cc:80
>> #6  0x00f4adec in start_thread () from /lib/tls/libpthread.so.0
>> #7  0x0032ea2a in clone () from /lib/tls/libc.so.6
>>
>> You may find the core file here:
>>
>> /afs/hep.wisc.edu/cms/sw/xrootd/debug/core.5250
>>
>> The binary is here:
>>
>> /afs/hep.wisc.edu/cms/sw/xrootd/bin/xrootd
>
>
>
> ----------------------------------------------------------------------- 
> --
> Peter Elmer     E-mail: [log in to unmask]      Phone: +41 (22)  
> 767-4644
> Address: CERN Division PPE, Bat. 32 2C-14, CH-1211 Geneva 23,  
> Switzerland
> ----------------------------------------------------------------------- 
> --