Hi Andrew, > I assume that you started xrootd with the -d option as the XrdCnsd is > producing a full trace. Yes? yes. > The only "problem" here that I see is that it claims it lost the connection > to the xrootd server. Did you kill the server or restart it at > 091009 22:23:31? No. I haven't kill the xrootd server. I think it received a event "" which caused it lost connection. Wen > > On Tue, 13 Oct 2009, wen guan wrote: > >> Hi Andrew, >> >> When using xrootd-20090904, XrdCnsd still has some problem. >> Sometimes(2 times in 3 days are found) xrdcnsd prints some errors >> below and then no response. >> (*) >> 091009 21:20:07 15685 XrdCnsdoEvents: Event: >> 'haifeng.32505:22@pcuwsun30 create 664 >> >> /xrootd/users/russlan/mc_112911_ttbar_mcnlo_10TeV_ee.e111_a86_dmp95_tid000908/DUMP/DUM.000908._00570005.pool.root.4?' >> 091009 21:20:07 15685 XrdCnsXeq: thread started >> 091009 21:20:07 15685 XrdCnsdoEvents: Event: >> 'haifeng.32505:22@pcuwsun30 closew >> >> /xrootd/users/russlan/mc_112911_ttbar_mcnlo_10TeV_ee.e111_a86_dmp95_tid000908/DUMP/DUM.000908._00570005.pool.root.4 >> 12317450' >> 091009 21:29:52 15685 XrdCnsdoEvents: Event: >> 'haifeng.6132:20@pcuwsun20 create 664 >> >> /xrootd/users/russlan/mc_112911_ttbar_mcnlo_10TeV_ee.e111_a86_dmp95_tid000908/AOD/AOD.000908._00570003.pool.root.4?' >> 091009 21:29:52 15685 XrdCnsXeq: thread started >> 091009 21:29:52 15685 XrdCnsdoEvents: Event: >> 'haifeng.6132:20@pcuwsun20 closew >> >> /xrootd/users/russlan/mc_112911_ttbar_mcnlo_10TeV_ee.e111_a86_dmp95_tid000908/AOD/AOD.000908._00570003.pool.root.4 >> 47660354' >> 091009 21:32:07 15685 XrdCnsSched: running underused thread monitor inq=0 >> 091009 21:32:07 15685 XrdCnsSched: 2 threads; 1 idle >> 091009 21:32:07 15685 XrdCnsSched: scheduling underused thread monitor >> in 780 seconds >> 091009 21:33:03 15685 XrdCnsdoEvents: Event: >> 'haifeng.11089:22@pcuwsun22 create 664 >> >> /xrootd/users/russlan/mc_112911_ttbar_mcnlo_10TeV_ee.e111_a86_dmp95_tid000908/DUMP/DUM.000908._00530003.pool.root.4?' >> 091009 21:33:03 15685 XrdCnsdoEvents: Event: >> 'haifeng.11089:22@pcuwsun22 closew >> >> /xrootd/users/russlan/mc_112911_ttbar_mcnlo_10TeV_ee.e111_a86_dmp95_tid000908/DUMP/DUM.000908._00530003.pool.root.4 >> 12482563' >> 091009 21:37:25 15685 XrdCnsdoEvents: Event: >> 'haifeng.11685:20@pcuwsun22 create 664 >> >> /xrootd/users/russlan/mc_112911_ttbar_mcnlo_10TeV_ee.e111_a86_dmp95_tid000908/LOG/sim.000908._00560015.log.4?' >> 091009 21:37:25 15685 XrdCnsdoEvents: Event: >> 'haifeng.11685:20@pcuwsun22 closew >> >> /xrootd/users/russlan/mc_112911_ttbar_mcnlo_10TeV_ee.e111_a86_dmp95_tid000908/LOG/sim.000908._00560015.log.4 >> 4568199' >> 091009 21:45:07 15685 XrdCnsSched: running underused thread monitor inq=0 >> 091009 21:45:07 15685 XrdCnsSched: 2 threads; 1 idle >> 091009 21:45:07 15685 XrdCnsSched: scheduling underused thread monitor >> in 780 seconds >> 091009 21:54:13 15685 XrdCnsdoEvents: Event: >> 'haifeng.11106:22@pcuwsun31 create 664 >> >> /xrootd/users/russlan/mc_112911_ttbar_mcnlo_10TeV_ee.e111_a86_dmp95_tid000908/DUMP/DUM.000908._00550005.pool.root.4?' >> 091009 21:54:13 15685 XrdCnsXeq: thread started >> 091009 21:54:13 15685 XrdCnsdoEvents: Event: >> 'haifeng.11106:22@pcuwsun31 closew >> >> /xrootd/users/russlan/mc_112911_ttbar_mcnlo_10TeV_ee.e111_a86_dmp95_tid000908/DUMP/DUM.000908._00550005.pool.root.4 >> 12460586' >> 091009 21:58:07 15685 XrdCnsSched: running underused thread monitor inq=0 >> 091009 21:58:07 15685 XrdCnsSched: 2 threads; 1 idle >> 091009 21:58:07 15685 XrdCnsSched: scheduling underused thread monitor >> in 780 seconds >> 091009 22:11:07 15685 XrdCnsSched: running underused thread monitor inq=0 >> 091009 22:11:07 15685 XrdCnsSched: 2 threads; 1 idle >> 091009 22:11:07 15685 XrdCnsSched: scheduling underused thread monitor >> in 780 seconds >> 091009 22:23:31 15685 XrdCnsdoEvents: Event: '' >> 091009 22:23:31 15685 XrdCnsdoEvents: traceid missing in event ? >> 091009 22:23:31 15685 XrdCnsdoEvents: Exiting; lost event connection to >> xrootd! >> >> >> Wen >> >> On Sun, Oct 11, 2009 at 4:23 PM, Andrew Hanushevsky >> <[log in to unmask]> wrote: >>> >>> Hi Wen, >>> >>> Could be but I will ask the client developer about it. For now, let's let >>> it >>> run idle for a while and hen see if it gets itself into that state. >>> >>> Andy >>> >>> On Sat, 10 Oct 2009, wen guan wrote: >>> >>>> Hi andrew, >>>> >>>> I only found it happened in slc5 machines with version >>>> xrootd-20090623(x86_64_linux_26,g++ (GCC) 3.4.6 20060404 (Red Hat >>>> 3.4.6-9) binary version) . Sometimes I even cannot kill "XrdCnsd" with >>>> "Kill" command. I need to use "Kill -9 " to kill it. >>>> Last night I downloaded xrootd-20090904 src version and compiled it >>>> locally. Until now I haven't found this problem again. Maybe the >>>> problem is related to gcc. >>>> >>>> Thanks >>>> Wen >>>> >>>> On Sat, Oct 10, 2009 at 3:17 AM, Andrew Hanushevsky >>>> <[log in to unmask]> wrote: >>>>> >>>>> Hi Wen, >>>>> >>>>> OK. Does this only happen in SLC5 or does it happen in all kernels. >>>>> >>>>> Andy >>>>> >>>>> On Fri, 9 Oct 2009, wen guan wrote: >>>>> >>>>>> Hi Andrew, >>>>>> >>>>>> For newly started XrdCnsd, when doing a xrdcp, the >>>>>> disappeared "poll" thread will appear again and an shadow will be >>>>>> created in cns server. But for old Xrdcnsd, when doing xrdcp, no >>>>>> "poll" thread will be started. no shadow will be created in cns >>>>>> server. >>>>>> >>>>>> Wen >>>>>> >>>>>> On Fri, Oct 9, 2009 at 12:58 AM, Andrew Hanushevsky <[log in to unmask]> >>>>>> wrote: >>>>>>> >>>>>>> Hi Wen, >>>>>>> >>>>>>> I guess I don't understand the actual problem. When you say 'not >>>>>>> respond" >>>>>>> what do you actually see to come to that conclusion. As for the >>>>>>> analysis, >>>>>>> and stack traces, these are all perfectly normal. The "poll" thread >>>>>>> need >>>>>>> not >>>>>>> exist at all times and, in fact, will disappear after a long period >>>>>>> of >>>>>>> inactivity. I will look some more but I need to know what "no >>>>>>> response" >>>>>>> appears like. >>>>>>> >>>>>>> Andy >>>>>>> >>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>> <[log in to unmask]> >>>>>>> To: <[log in to unmask]> >>>>>>> Sent: Tuesday, October 06, 2009 12:46 PM >>>>>>> Subject: XrdCnsd problem in slc5 with version xrootd-20090623 >>>>>>> >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> The XrdCnsd in xrootd-20090623 failed to create cns frequestly. >>>>>>>> When starting XrdCnsd, it works fine. But when trying it next day, >>>>>>>> it >>>>>>>> will not response. Below is the test result. >>>>>>>> >>>>>>>> It seems the "poll" thread cannot be started for old XrdCnsd(started >>>>>>>> more than one day and there is no active in long time(maybe 780s)). >>>>>>>> For (*), Process 10816 was an old XrdCnsd(started more than one day) >>>>>>>> process which didn't response. After killed it, a new process 13665 >>>>>>>> was created which worked fine. From here we can see process 13665 >>>>>>>> had >>>>>>>> a thread "poll"(thread 2) but process 10816 didn't have it. >>>>>>>> For (**), When there was no files copied to xrootd for 10 minutes, >>>>>>>> the "poll" thread in Process 13665 disappeared. But when I did a >>>>>>>> xrdcp to this server. We will saw this "poll" thread again. But for >>>>>>>> process which has on active for long time, the "poll" thread cannot >>>>>>>> be >>>>>>>> restart again. >>>>>>>> >>>>>>>> (*) >>>>>>>> [root@pcuwdata12 ~]# ps -ef|grep xrootd-2 >>>>>>>> xrootd 10798 1 0 Oct02 ? 00:21:21 >>>>>>>> /opt/xrootd-20090623/bin//xrootd -b -d -R xrootd -l >>>>>>>> /tmp/log/xrootd/xrd.log -c /opt/xrootd-20090623/conf/xrdcluster.cfg >>>>>>>> xrootd 10809 1 0 Oct02 ? 00:00:05 >>>>>>>> /opt/xrootd-20090623/bin//cmsd -b -d -R xrootd -l >>>>>>>> /tmp/log/xrootd/cms.log -c /opt/xrootd-20090623/conf/xrdcluster.cfg >>>>>>>> xrootd 10816 10798 0 Oct02 ? 00:00:00 >>>>>>>> /opt/xrootd-20090623/bin/XrdCnsd -d -l /tmp/log/xrootd/cnsd.log >>>>>>>> root://pcuwsun04.cern.ch:1095 >>>>>>>> root 13452 13340 0 20:59 pts/1 00:00:00 grep xrootd-2 >>>>>>>> [root@pcuwdata12 ~]# pstack 10816 >>>>>>>> Thread 7 (Thread 0x410cf940 (LWP 10826)): >>>>>>>> #0 0x000000370e40ae00 in pthread_cond_timedwait@@GLIBC_2.3.2 () >>>>>>>> #1 0x0000000000479707 in XrdSysCondVar::Wait () >>>>>>>> #2 0x000000000047b8e9 in XrdScheduler::TimeSched () >>>>>>>> #3 0x000000000047a21d in XrdStartTSched () >>>>>>>> #4 0x0000000000479637 in XrdSysThread_Xeq () >>>>>>>> #5 0x000000370e4064a7 in start_thread () from >>>>>>>> /lib64/libpthread.so.0 >>>>>>>> #6 0x000000370d8d3c2d in clone () from /lib64/libc.so.6 >>>>>>>> Thread 6 (Thread 0x416ed940 (LWP 10827)): >>>>>>>> #0 0x000000370e40c9b1 in sem_wait () from /lib64/libpthread.so.0 >>>>>>>> #1 0x000000000041ffb7 in XrdSysSemaphore::Wait () >>>>>>>> #2 0x000000000047ae11 in XrdScheduler::Run () >>>>>>>> #3 0x000000000047a241 in XrdStartWorking () >>>>>>>> #4 0x0000000000479637 in XrdSysThread_Xeq () >>>>>>>> #5 0x000000370e4064a7 in start_thread () from >>>>>>>> /lib64/libpthread.so.0 >>>>>>>> #6 0x000000370d8d3c2d in clone () from /lib64/libc.so.6 >>>>>>>> Thread 5 (Thread 0x417ee940 (LWP 10828)): >>>>>>>> #0 0x000000370e40c9b1 in sem_wait () from /lib64/libpthread.so.0 >>>>>>>> #1 0x000000000041ffb7 in XrdSysSemaphore::Wait () >>>>>>>> #2 0x000000000047ae11 in XrdScheduler::Run () >>>>>>>> #3 0x000000000047a241 in XrdStartWorking () >>>>>>>> #4 0x0000000000479637 in XrdSysThread_Xeq () >>>>>>>> #5 0x000000370e4064a7 in start_thread () from >>>>>>>> /lib64/libpthread.so.0 >>>>>>>> #6 0x000000370d8d3c2d in clone () from /lib64/libc.so.6 >>>>>>>> Thread 4 (Thread 0x418ef940 (LWP 10834)): >>>>>>>> #0 0x000000370e40d5cb in read () from /lib64/libpthread.so.0 >>>>>>>> #1 0x00000000004736b9 in XrdOucStream::GetLine () >>>>>>>> #2 0x000000000041eb06 in XrdCnsDaemon::getEvents () >>>>>>>> #3 0x000000000041dcfc in CnsEvents () >>>>>>>> #4 0x0000000000479637 in XrdSysThread_Xeq () >>>>>>>> #5 0x000000370e4064a7 in start_thread () from >>>>>>>> /lib64/libpthread.so.0 >>>>>>>> #6 0x000000370d8d3c2d in clone () from /lib64/libc.so.6 >>>>>>>> Thread 3 (Thread 0x419f0940 (LWP 10835)): >>>>>>>> #0 0x000000370e40d174 in __lll_lock_wait () from >>>>>>>> /lib64/libpthread.so.0 >>>>>>>> #1 0x000000370e408aca in _L_lock_1034 () from >>>>>>>> /lib64/libpthread.so.0 >>>>>>>> #2 0x000000370e40898c in pthread_mutex_lock () from >>>>>>>> /lib64/libpthread.so.0 >>>>>>>> #3 0x000000000041ff5b in XrdSysMutex::Lock () >>>>>>>> #4 0x000000000041f6d4 in XrdCnsEvent::Queue () >>>>>>>> #5 0x000000000041ee43 in XrdCnsDaemon::getEvents () >>>>>>>> #6 0x000000000041dcfc in CnsEvents () >>>>>>>> #7 0x0000000000479637 in XrdSysThread_Xeq () >>>>>>>> #8 0x000000370e4064a7 in start_thread () from >>>>>>>> /lib64/libpthread.so.0 >>>>>>>> #9 0x000000370d8d3c2d in clone () from /lib64/libc.so.6 >>>>>>>> Thread 2 (Thread 0x41af1940 (LWP 11659)): >>>>>>>> #0 0x000000370d89a0b1 in nanosleep () from /lib64/libc.so.6 >>>>>>>> #1 0x000000370d899ed4 in sleep () from /lib64/libc.so.6 >>>>>>>> #2 0x000000000044117a in GarbageCollectorThread () >>>>>>>> #3 0x000000000045413a in XrdClientThreadDispatcher () >>>>>>>> #4 0x0000000000479637 in XrdSysThread_Xeq () >>>>>>>> #5 0x000000370e4064a7 in start_thread () from >>>>>>>> /lib64/libpthread.so.0 >>>>>>>> #6 0x000000370d8d3c2d in clone () from /lib64/libc.so.6 >>>>>>>> Thread 1 (Thread 0x2b53e8698db0 (LWP 10816)): >>>>>>>> #0 0x000000370e40c9b1 in sem_wait () from /lib64/libpthread.so.0 >>>>>>>> #1 0x000000000041ffb7 in XrdSysSemaphore::Wait () >>>>>>>> #2 0x000000000041f8bd in XrdCnsEvent::Remove () >>>>>>>> #3 0x000000000041ea05 in XrdCnsDaemon::doRequests () >>>>>>>> #4 0x00000000004202d1 in main () >>>>>>>> [root@pcuwdata12 ~]# kill 10816 >>>>>>>> [root@pcuwdata12 ~]# kill 10816 >>>>>>>> -bash: kill: (10816) - No such process >>>>>>>> [root@pcuwdata12 ~]# ps -ef|grep xrootd-2 >>>>>>>> xrootd 10798 1 0 Oct02 ? 00:21:21 >>>>>>>> /opt/xrootd-20090623/bin//xrootd -b -d -R xrootd -l >>>>>>>> /tmp/log/xrootd/xrd.log -c /opt/xrootd-20090623/conf/xrdcluster.cfg >>>>>>>> xrootd 10809 1 0 Oct02 ? 00:00:05 >>>>>>>> /opt/xrootd-20090623/bin//cmsd -b -d -R xrootd -l >>>>>>>> /tmp/log/xrootd/cms.log -c /opt/xrootd-20090623/conf/xrdcluster.cfg >>>>>>>> xrootd 13665 10798 0 21:03 ? 00:00:00 >>>>>>>> /opt/xrootd-20090623/bin/XrdCnsd -d -l /tmp/log/xrootd/cnsd.log >>>>>>>> root://pcuwsun04.cern.ch:1095 >>>>>>>> root 13678 13340 0 21:03 pts/1 00:00:00 grep xrootd-2 >>>>>>>> [root@pcuwdata12 ~]# pstack 13665 >>>>>>>> Thread 8 (Thread 0x4074b940 (LWP 13666)): >>>>>>>> #0 0x000000370e40ae00 in pthread_cond_timedwait@@GLIBC_2.3.2 () >>>>>>>> #1 0x0000000000479707 in XrdSysCondVar::Wait () >>>>>>>> #2 0x000000000047b8e9 in XrdScheduler::TimeSched () >>>>>>>> #3 0x000000000047a21d in XrdStartTSched () >>>>>>>> #4 0x0000000000479637 in XrdSysThread_Xeq () >>>>>>>> #5 0x000000370e4064a7 in start_thread () from >>>>>>>> /lib64/libpthread.so.0 >>>>>>>> #6 0x000000370d8d3c2d in clone () from /lib64/libc.so.6 >>>>>>>> Thread 7 (Thread 0x41b0f940 (LWP 13667)): >>>>>>>> #0 0x000000370e40c9b1 in sem_wait () from /lib64/libpthread.so.0 >>>>>>>> #1 0x000000000041ffb7 in XrdSysSemaphore::Wait () >>>>>>>> #2 0x000000000047ae11 in XrdScheduler::Run () >>>>>>>> #3 0x000000000047a241 in XrdStartWorking () >>>>>>>> #4 0x0000000000479637 in XrdSysThread_Xeq () >>>>>>>> #5 0x000000370e4064a7 in start_thread () from >>>>>>>> /lib64/libpthread.so.0 >>>>>>>> #6 0x000000370d8d3c2d in clone () from /lib64/libc.so.6 >>>>>>>> Thread 6 (Thread 0x41d8c940 (LWP 13668)): >>>>>>>> #0 0x000000370e40c9b1 in sem_wait () from /lib64/libpthread.so.0 >>>>>>>> #1 0x000000000041ffb7 in XrdSysSemaphore::Wait () >>>>>>>> #2 0x000000000047ae11 in XrdScheduler::Run () >>>>>>>> #3 0x000000000047a241 in XrdStartWorking () >>>>>>>> #4 0x0000000000479637 in XrdSysThread_Xeq () >>>>>>>> #5 0x000000370e4064a7 in start_thread () from >>>>>>>> /lib64/libpthread.so.0 >>>>>>>> #6 0x000000370d8d3c2d in clone () from /lib64/libc.so.6 >>>>>>>> Thread 5 (Thread 0x41e8d940 (LWP 13669)): >>>>>>>> #0 0x000000370e40d5cb in read () from /lib64/libpthread.so.0 >>>>>>>> #1 0x00000000004736b9 in XrdOucStream::GetLine () >>>>>>>> #2 0x000000000041eb06 in XrdCnsDaemon::getEvents () >>>>>>>> #3 0x000000000041dcfc in CnsEvents () >>>>>>>> #4 0x0000000000479637 in XrdSysThread_Xeq () >>>>>>>> #5 0x000000370e4064a7 in start_thread () from >>>>>>>> /lib64/libpthread.so.0 >>>>>>>> #6 0x000000370d8d3c2d in clone () from /lib64/libc.so.6 >>>>>>>> Thread 4 (Thread 0x41f8e940 (LWP 13670)): >>>>>>>> #0 0x000000370e40d5cb in read () from /lib64/libpthread.so.0 >>>>>>>> #1 0x00000000004736b9 in XrdOucStream::GetLine () >>>>>>>> #2 0x000000000041eb06 in XrdCnsDaemon::getEvents () >>>>>>>> #3 0x000000000041dcfc in CnsEvents () >>>>>>>> #4 0x0000000000479637 in XrdSysThread_Xeq () >>>>>>>> #5 0x000000370e4064a7 in start_thread () from >>>>>>>> /lib64/libpthread.so.0 >>>>>>>> #6 0x000000370d8d3c2d in clone () from /lib64/libc.so.6 >>>>>>>> Thread 3 (Thread 0x4208f940 (LWP 13671)): >>>>>>>> #0 0x000000370d89a0b1 in nanosleep () from /lib64/libc.so.6 >>>>>>>> #1 0x000000370d899ed4 in sleep () from /lib64/libc.so.6 >>>>>>>> #2 0x000000000044117a in GarbageCollectorThread () >>>>>>>> #3 0x000000000045413a in XrdClientThreadDispatcher () >>>>>>>> #4 0x0000000000479637 in XrdSysThread_Xeq () >>>>>>>> #5 0x000000370e4064a7 in start_thread () from >>>>>>>> /lib64/libpthread.so.0 >>>>>>>> #6 0x000000370d8d3c2d in clone () from /lib64/libc.so.6 >>>>>>>> Thread 2 (Thread 0x41279940 (LWP 13672)): >>>>>>>> #0 0x000000370d8cae46 in poll () from /lib64/libc.so.6 >>>>>>>> #1 0x000000000046b945 in XrdClientSock::RecvRaw () >>>>>>>> #2 0x000000000044e08c in XrdClientPhyConnection::ReadRaw () >>>>>>>> #3 0x000000000044ac3e in XrdClientMessage::ReadRaw () >>>>>>>> #4 0x000000000044e84c in XrdClientPhyConnection::BuildMessage () >>>>>>>> #5 0x000000000044b730 in SocketReaderThread () >>>>>>>> #6 0x000000000045413a in XrdClientThreadDispatcher () >>>>>>>> #7 0x0000000000479637 in XrdSysThread_Xeq () >>>>>>>> #8 0x000000370e4064a7 in start_thread () from >>>>>>>> /lib64/libpthread.so.0 >>>>>>>> #9 0x000000370d8d3c2d in clone () from /lib64/libc.so.6 >>>>>>>> Thread 1 (Thread 0x2b06bb478db0 (LWP 13665)): >>>>>>>> #0 0x000000370e40c9b1 in sem_wait () from /lib64/libpthread.so.0 >>>>>>>> #1 0x000000000041ffb7 in XrdSysSemaphore::Wait () >>>>>>>> #2 0x000000000041f8bd in XrdCnsEvent::Remove () >>>>>>>> #3 0x000000000041ea05 in XrdCnsDaemon::doRequests () >>>>>>>> #4 0x00000000004202d1 in main () >>>>>>>> [root@pcuwdata12 ~]# >>>>>>>> [root@pcuwdata12 ~]# >>>>>>>> >>>>>>>> >>>>>>>> (***) cnsd.log >>>>>>>> 091006 20:56:10 1828 XrdCnsSched: running underused thread monitor >>>>>>>> inq=0 >>>>>>>> 091006 20:56:10 1828 XrdCnsSched: 2 threads; 1 idle >>>>>>>> 091006 20:56:10 1828 XrdCnsSched: scheduling underused thread >>>>>>>> monitor >>>>>>>> in 780 seconds >>>>>>>> >>>>>>>> >>>>>>>> Thanks >>>>>>>> Wen >>>>>>>> >>>>>>> >>>>>>> >>>>> >>> >