Hi Pete & Dan, I looked through the core file. Everything appears to be normal (except, perhaps for the number of existing threads). The core fle shows that user ?:[log in to unmask] was connecting (note that FD=161, so we didn't exceed the file descriptor limit). Everything went fine until the xrootd protocol processing step was executed. The core file indicates that a wild branch was taken at that point (or shortly after that point). The vprt in the protocol pointer is set to 0x8409940 instead of the expected 0x80902c8. That would indicate a corrupted object (name your favorite cause). It also appears that the count of protocol objects is wrong. I will need the logfile (just give me a pointer to it) to understand what may be going on here. I notice that some tracing is turned on. So, please give me a pointer to the log file. Andy On Tue, 12 Apr 2005, Peter Elmer wrote: > Hi Dan, > > On Mon, Apr 11, 2005 at 10:41:13AM -0500, Dan Bradley wrote: > > Thanks for looking into this. I have copied the logs into the AFS > > directory where you were looking. However, I see nothing reported at > > the time of the crash. Perhaps I should turn on more verbose tracing? > > Andy, do you have any idea what the crash reported below might be? (See > traceback at the very end of this mail.) > > Perhaps I confused myself with the time of the core file. Did the > crash happen earlier than "Apr 6 20:01"? I can see from the 20050405 log > that it was restarted around "050405 10:28:20". Is that perhaps the > restart after the crash? > > In any case the last entry before the restart was at "050404 23:16:53" > and it looks like a normal redirect, so there is no extra info (i.e. > "last words") there. > > > The redirector hasn't crashed since my last report. Perhaps this is > > related to somewhat decreased load. I've been keeping an eye on open > > file descriptors, which are well below 1k, but I'm just glancing now > > and then by hand, so I don't yet know how bursty it is. > > Pete > > > > On Apr 9, 2005, at 2:53 AM, Peter Elmer wrote: > > > > > Hi Dan, > > > > > > Andy is travelling today and tomorrow, so he'll probably be able to > > > take > > > a look at this on Monday. Were there any messages in the (redirector) > > > xrootd > > > log file? I see the core file is time-stamped "Apr 6 20:01", but the > > > log for > > > 20050406 isn't in the "logs" area in your afs area. > > > > > > One thing occurs to me already: how many clients are now hitting your > > > redirector? Is it possible that you are hitting the default file > > > descriptor > > > limit? > > > > > > http://xrootd.slac.stanford.edu/hardware_os_config.html > > > > > > In that case, you might see some messages in the xrootd logs about > > > problems > > > starting new threads, for example. > > > > > > BTW, I found the logs in your area: you are doing a reasonable > > > number of > > > file opens. (261k on 20050326, for example.) Very nice. The average > > > rate > > > isn't huge, but presumably there are some peaks as things start in > > > bunches > > > and whatnot. It also looks like there are about 2k files being opened, > > > so > > > most of them are probably cached in memory, too. Is it all pileup for > > > MC? At > > > least the redirects seem fairly balanced over the servers: > > > > > > 30589 s5n01.hep.wisc.edu:1094 > > > 32969 s5n03.hep.wisc.edu:1094 > > > 32960 s5n04.hep.wisc.edu:1094 > > > 32938 s5n05.hep.wisc.edu:1094 > > > 32942 s5n06.hep.wisc.edu:1094 > > > 32936 s5n07.hep.wisc.edu:1094 > > > 32948 s5n08.hep.wisc.edu:1094 > > > 32939 s5n09.hep.wisc.edu:1094 > > > > > > I also see from the 20050405 log file that there were something like > > > 381 > > > different machines connecting from all over campus (and perhaps some > > > may have > > > had more than one application connecting if they are dual-cpu). That > > > is still > > > a bit short of 1024 even if I put in the factor of 2, but it depends > > > on what > > > else is happening on the machine. You can probably check this from the > > > xrootd > > > log and with 'lsof'. > > > > > > Pete > > > > > > On Fri, Apr 08, 2005 at 03:46:39PM -0500, Dan Bradley wrote: > > >> I am getting occasional crashes of xrootd on the redirector. I am > > >> running version 20050328-0656 under Scientific Linux 3.0.4. > > >> > > >> The redirector crashes with the following stack dump: > > >> > > >> #0 0x080845cf in typeinfo name for XrdXrootdPrepare () > > >> #1 0x0807266f in XrdProtocol_Select::Process (this=0x80917d8, > > >> lp=0x83df5dc) > > >> at XrdProtocol.cc:165 > > >> #2 0x0806d622 in XrdLink::DoIt (this=0x83df5dc) at XrdLink.cc:296 > > >> #3 0x080739bc in XrdScheduler::Run (this=0x8091640) at > > >> XrdScheduler.cc:293 > > >> #4 0x08072a1c in XrdStartWorking (carg=0x8091640) at > > >> XrdScheduler.cc:82 > > >> #5 0x0807f4be in XrdOucThread_Xeq (myargs=0x839eba0) at > > >> XrdOucPthread.cc:80 > > >> #6 0x00f4adec in start_thread () from /lib/tls/libpthread.so.0 > > >> #7 0x0032ea2a in clone () from /lib/tls/libc.so.6 > > >> > > >> You may find the core file here: > > >> > > >> /afs/hep.wisc.edu/cms/sw/xrootd/debug/core.5250 > > >> > > >> The binary is here: > > >> > > >> /afs/hep.wisc.edu/cms/sw/xrootd/bin/xrootd > > > > > > > ------------------------------------------------------------------------- > Peter Elmer E-mail: [log in to unmask] Phone: +41 (22) 767-4644 > Address: CERN Division PPE, Bat. 32 2C-14, CH-1211 Geneva 23, Switzerland > ------------------------------------------------------------------------- >