Hello Andy, I ran gcore on the top xrootd (7040) and olbd (7041) and put the core files at SLAC: ~schott/core.704* The executable and shared libraries I have been using are ~schott/xrootd-20050623-0016.i386_linux24.tgz You might find additional needed information on the web page: http://www.slac.stanford.edu/~schott/internal/gridka/xrootd.html Please let me know if I can kill and restart xrootd now or if you need something more regarding this issue. Cheers, Gregory On Fri, 8 Jul 2005, Peter Elmer wrote: > Hi Gregory, > > Regarding which process to gcore, it is probably 7040, but try running > 'pstree' and take the one at the top of the tree. > > (Andy will have to respond regarding stopping/starting and additional > tests.) > > Pete > > On Thu, Jul 07, 2005 at 07:10:15PM +0200, Gregory Schott wrote: >> Hello Andy, >> >> Just to let you know, I couldn't yet do the gcore as it was not installed >> on these NAS boxes. The GridKa admins were apparently busy... I found out >> today that gcore exists only since Redhat 8.0 and these NAS boxes are RedHat >> 7.3. I tried to compile from source but I am still missing some dependances. >> >> I have two questions: >> - on which of the processes should I run gcore (there are many running) >> >> 000 S xrootd 7040 1 0 85 10 - 4780 schedu Jul04 ? 00:00:01 /home/xrootd/software/current/bin/xrootd -p 1094 -l /tmp/f01-014-106.xrdlog -c c >> 000 S xrootd 7041 1 0 85 10 - 3417 schedu Jul04 ? 00:00:00 /home/xrootd/software/current/bin/olbd -s -l /tmp/f01-014-106.olblog -c config/d >> 040 S xrootd 7042 7041 0 85 10 - 3417 schedu Jul04 ? 00:00:00 /home/xrootd/software/current/bin/olbd -s -l /tmp/f01-014-106.olblog -c config/d >> 040 S xrootd 7043 7042 0 85 10 - 3417 schedu Jul04 ? 00:00:00 /home/xrootd/software/current/bin/olbd -s -l /tmp/f01-014-106.olblog -c config/d >> 040 S xrootd 7044 7040 0 85 10 - 4780 schedu Jul04 ? 00:00:00 /home/xrootd/software/current/bin/xrootd -p 1094 -l /tmp/f01-014-106.xrdlog -c c >> 040 S xrootd 7045 7044 0 85 10 - 4780 schedu Jul04 ? 00:00:00 /home/xrootd/software/current/bin/xrootd -p 1094 -l /tmp/f01-014-106.xrdlog -c c >> 040 S xrootd 7046 7044 0 85 10 - 4780 schedu Jul04 ? 00:00:00 /home/xrootd/software/current/bin/xrootd -p 1094 -l /tmp/f01-014-106.xrdlog -c c >> 040 S xrootd 7048 7044 0 85 10 - 4780 schedu Jul04 ? 00:00:00 /home/xrootd/software/current/bin/xrootd -p 1094 -l /tmp/f01-014-106.xrdlog -c c >> 040 S xrootd 7049 7044 0 85 10 - 4780 schedu Jul04 ? 00:00:00 /home/xrootd/software/current/bin/xrootd -p 1094 -l /tmp/f01-014-106.xrdlog -c c >> 040 S xrootd 7050 7044 0 85 10 - 4780 schedu Jul04 ? 00:00:00 /home/xrootd/software/current/bin/xrootd -p 1094 -l /tmp/f01-014-106.xrdlog -c c >> 040 S xrootd 7051 7044 0 85 10 - 4780 schedu Jul04 ? 00:00:00 /home/xrootd/software/current/bin/xrootd -p 1094 -l /tmp/f01-014-106.xrdlog -c c >> 040 S xrootd 7052 7044 0 90 10 - 4780 schedu Jul04 ? 00:00:00 /home/xrootd/software/current/bin/xrootd -p 1094 -l /tmp/f01-014-106.xrdlog -c c >> 040 S xrootd 7053 7044 0 90 10 - 4780 schedu Jul04 ? 00:00:00 /home/xrootd/software/current/bin/xrootd -p 1094 -l /tmp/f01-014-106.xrdlog -c c >> 040 S xrootd 7054 7044 0 85 10 - 4780 schedu Jul04 ? 00:00:00 /home/xrootd/software/current/bin/xrootd -p 1094 -l /tmp/f01-014-106.xrdlog -c c >> 040 S xrootd 7055 7044 0 85 10 - 4780 schedu Jul04 ? 00:00:00 /home/xrootd/software/current/bin/xrootd -p 1094 -l /tmp/f01-014-106.xrdlog -c c >> 040 S xrootd 7056 7042 0 85 10 - 3417 schedu Jul04 ? 00:00:00 /home/xrootd/software/current/bin/olbd -s -l /tmp/f01-014-106.olblog -c config/d >> 040 S xrootd 7057 7042 0 90 10 - 3417 schedu Jul04 ? 00:00:00 /home/xrootd/software/current/bin/olbd -s -l /tmp/f01-014-106.olblog -c config/d >> 040 S xrootd 7058 7042 0 85 10 - 3417 schedu Jul04 ? 00:00:00 /home/xrootd/software/current/bin/olbd -s -l /tmp/f01-014-106.olblog -c config/d >> 040 S xrootd 7062 7042 0 85 10 - 3417 schedu Jul04 ? 00:00:00 /home/xrootd/software/current/bin/olbd -s -l /tmp/f01-014-106.olblog -c config/d >> 040 S xrootd 25073 7044 0 85 10 - 4780 rt_sig Jul06 ? 00:01:34 /home/xrootd/software/current/bin/xrootd -p 1094 -l /tmp/f01-014-106.xrdlog -c c >> 040 S xrootd 28474 7044 0 85 10 - 4780 rt_sig Jul06 ? 00:01:30 /home/xrootd/software/current/bin/xrootd -p 1094 -l /tmp/f01-014-106.xrdlog -c c >> 040 S xrootd 6991 7044 0 85 10 - 4780 rt_sig Jul06 ? 00:00:31 /home/xrootd/software/current/bin/xrootd -p 1094 -l /tmp/f01-014-106.xrdlog -c c >> 040 S xrootd 9026 7044 0 85 10 - 4780 rt_sig Jul06 ? 00:00:10 /home/xrootd/software/current/bin/xrootd -p 1094 -l /tmp/f01-014-106.xrdlog -c c >> 040 S xrootd 9486 7044 0 85 10 - 4780 rt_sig Jul06 ? 00:00:07 /home/xrootd/software/current/bin/xrootd -p 1094 -l /tmp/f01-014-106.xrdlog -c c >> >> - can I stop and restart xrootd after the gcore or do you need some >> additional test? >> >> I'll send you the gcore once I have it. >> >> Cheers, >> Gregory >> >> >> On Thu, 30 Jun 2005, Andy Hanushevsky wrote: >> >>> Hi Gregory, >>> >>> A simple gcore on the one you kept is sufficient. I see that you are running >>> 20050328. I beielve we had some fixes in that area since that release. We are >>> close to certifying a new production release that may very well be better to >>> run overall. Anyway, if you can, put the gcore in afs along with the >>> executable and shared library that you are using; thanks. >>> >>> Andy >>> >>> ----- Original Message ----- From: "Gregory Schott" <[log in to unmask]> >>> To: "xrootd mailing list" <[log in to unmask]> >>> Sent: Thursday, June 30, 2005 2:29 AM >>> Subject: Manager appears to be dead >>> >>> >>>> >>>> Hello, >>>> >>>> Doing some checks on the NAS boxes at GridKa I noticed from the log files >>>> that 3/7 file servers were stalled. I restarted 2 of these and kept one >>>> stalled in case you want me to do some test. I appended below the logfiles >>>> for the relevant period. After that time, the "Manager appears to be dead" >>>> message continues till June 30th. >>>> >>>> I have another question: what is the status of xrd monitoring? Where can I >>>> find documentation on how to use it and would it have detected that the >>>> dataserver was stalled? >>>> >>>> -------- >>>> >>>> dataserver (f01-016-106): seems to be stalled since 25/06/2005, I kept the >>>> served stalled in case xrd experts ask me for doing any test for >>>> understanding this issue >>>> >>>> 050625 00:00:00 1124 olb_Config: (c) 2004 SLAC olbd version >>>> 20050328-0656_dbg executing as Server >>>> 050625 07:26:53 1130 olb_Manager: Manager l01-001-122.gridka.de appears to >>>> be dead. >>>> 050625 07:29:06 1130 olb_Manager: Manager appears to be dead. >>>> >>>> redirector l01-001-122.gridka.de: old log for this period (nothing in xrd >>>> log): >>>> >>>> 050625 07:07:29 14906 olb_Server: f01-014-103.gridka.de:1094 load=0; cpu=0 >>>> i/o=0 inq=0 mem=0 pag=0 dsk=0 tot=0 >>>> 050625 07:27:30 14906 olb_Manager: f01-016-108.gridka.de:1094 scheduled for >>>> removal; not responding >>>> 050625 07:27:30 14906 olb_Manager: f01-014-107.gridka.de:1094 scheduled for >>>> removal; not responding >>>> 050625 07:27:30 14906 olb_Manager: f01-016-106.gridka.de:1094 scheduled for >>>> removal; not responding >>>> 050625 07:27:30 14906 olb_Manager: f01-014-106.gridka.de:1094 scheduled for >>>> removal; not responding >>>> 050625 07:37:30 14906 olb_Server: f01-016-108.gridka.de:1094 dropped. >>>> 050625 07:37:30 14906 olb_Server: f01-014-107.gridka.de:1094 dropped. >>>> 050625 07:37:30 14906 olb_Server: f01-016-106.gridka.de:1094 dropped. >>>> 050625 07:37:30 14906 olb_Server: f01-014-106.gridka.de:1094 dropped. >>>> 050625 07:56:30 14906 olb_GetLine: Unable to read request; connection timed >>>> out >>>> 050625 07:56:30 14906 olb_GetLine: Unable to read request; connection timed >>>> out >>>> 050625 07:56:30 14906 olb_GetLine: Unable to read request; connection timed >>>> out >>>> 050625 07:56:30 14906 olb_Manager: server f01-016-106.gridka.de:1094 forced >>>> out. >>>> 050625 07:56:30 14906 olb_Manager: server f01-014-107.gridka.de:1094 forced >>>> out. >>>> 050625 07:56:30 14906 olb_Manager: server f01-016-108.gridka.de:1094 forced >>>> out. >>>> 050625 07:56:39 14906 olb_GetLine: Unable to read request; connection timed >>>> out >>>> 050625 07:56:39 14906 olb_Manager: server f01-014-106.gridka.de:1094 forced >>>> out. >>>> 050625 08:57:30 14906 olb_Server: 10.65.5.115:1094 load=0; cpu=0 i/o=0 >>>> inq=0 mem=0 pag=0 dsk=0 tot=0 >>>> 050625 08:57:30 14906 olb_Server: f01-016-109.gridka.de:1094 load=0; cpu=0 >>>> i/o=0 inq=0 mem=0 pag=0 dsk=0 tot=0 >>>> >>>> -------- >>>> >>>> Cheers, >>>> Gregory >>>> >>> > > > > ------------------------------------------------------------------------- > Peter Elmer E-mail: [log in to unmask] Phone: +41 (22) 767-4644 > Address: CERN Division PPE, Bat. 32 2C-14, CH-1211 Geneva 23, Switzerland > ------------------------------------------------------------------------- >