ciao marian, in my case the crash was when 2 very specific cmsd were connecting to the redirector cmsd. one was IFCA, I think the second was in Russia (maybe Jinr???). No real idea why that happened, though .... probably Andy has ;) in the case of US redirectors, maybe you were lucky and did not stumble on any problematic site? tom On Tue, Aug 25, 2015 at 5:18 PM, Marian Zvada <[log in to unmask]> wrote: > Hi Andy, Tom, > > I see commit in src/XrdCms/XrdCmsProtocol.cc, not that I'm lazy to read > changes but I'm afraid I won't understand details. So my naive questions > are: > > 1) What are conditions tigger this segfaults, do we know? > 2) I'm asking for 1) because we've quite couple of redirectors and servers > running 4.2.2. which do the work flawlessly. > 3) Do people running 4.2.2 do critical downgrade at this point? > > So what's the problem here, please? We simply don't want to shout now on > everyone don't upgrade to latest release 4.2.2 as we are fairly pushing > about the upgrade since a months ago. Especially those who still run 3.3.x > etc. > > Could you advise, please? > > Thanks, > Marian > > > > On 8/25/15 12:55 AM, Andrew Hanushevsky wrote: > >> Hi Tommaso, >> >> Thanks. It's what I thougt it was. I have already pushed a fix into git >> head. It's a long-standing issue that never was encountered before. We >> should ahve this ix in 4.2.3 which is comming soon. >> >> Andy >> >> On Tue, 25 Aug 2015, Tommaso Boccali wrote: >> >> ciao, andrew, I asked the spanish guys for the cmsd version. >>> in the meantime i replicated a machine with the 4.2.2 setup, and >>> analyzed a >>> core file. >>> >>> I get: >>> >>> [root@xrootd-redic-bck xrootd]# gdb cmsd core.63160 >>> GNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6) >>> Copyright (C) 2010 Free Software Foundation, Inc. >>> License GPLv3+: GNU GPL version 3 or later >>> <http://gnu.org/licenses/gpl.html >>> >>>> >>>> This is free software: you are free to change and redistribute it. >>> There is NO WARRANTY, to the extent permitted by law. Type "show >>> copying" >>> and "show warranty" for details. >>> This GDB was configured as "x86_64-redhat-linux-gnu". >>> For bug reporting instructions, please see: >>> <http://www.gnu.org/software/gdb/bugs/>... >>> Reading symbols from /usr/bin/cmsd...Reading symbols from >>> /usr/lib/debug/usr/bin/cmsd.debug...done. >>> done. >>> [New Thread 63535] >>> ... >>> ... >>> ... >>> Missing separate debuginfo for >>> Try: yum --disablerepo='*' --enablerepo='*-debug*' install >>> /usr/lib/debug/.build-id/9e/e239647f77340975f782611a1fa728c355ecda >>> Reading symbols from /usr/lib64/libXrdServer.so.2.0.0...Reading symbols >>> from /usr/lib/debug/usr/lib64/libXrdServer.so.2.0.0.debug...done. >>> done. >>> Loaded symbols for /usr/lib64/libXrdServer.so.2.0.0 >>> Reading symbols from /usr/lib64/libXrdUtils.so.2.0.0...Reading symbols >>> from >>> /usr/lib/debug/usr/lib64/libXrdUtils.so.2.0.0.debug...done. >>> done. >>> Loaded symbols for /usr/lib64/libXrdUtils.so.2.0.0 >>> Reading symbols from /lib64/libpthread-2.12.so...Reading symbols from >>> /usr/lib/debug/lib64/libpthread-2.12.so.debug...done. >>> [Thread debugging using libthread_db enabled] >>> done. >>> Loaded symbols for /lib64/libpthread-2.12.so >>> Reading symbols from /lib64/librt-2.12.so...Reading symbols from >>> /usr/lib/debug/lib64/librt-2.12.so.debug...done. >>> done. >>> Loaded symbols for /lib64/librt-2.12.so >>> Reading symbols from /usr/lib64/libstdc++.so.6.0.13...Reading symbols >>> from >>> /usr/lib/debug/usr/lib64/libstdc++.so.6.0.13.debug...done. >>> done. >>> Loaded symbols for /usr/lib64/libstdc++.so.6.0.13 >>> Reading symbols from /lib64/libm-2.12.so...Reading symbols from >>> /usr/lib/debug/lib64/libm-2.12.so.debug...done. >>> done. >>> Loaded symbols for /lib64/libm-2.12.so >>> Reading symbols from /lib64/libgcc_s-4.4.6-20120305.so.1...Reading >>> symbols >>> from /usr/lib/debug/lib64/libgcc_s-4.4.6-20120305.so.1.debug...done. >>> done. >>> Loaded symbols for /lib64/libgcc_s-4.4.6-20120305.so.1 >>> Reading symbols from /lib64/libc-2.12.so...Reading symbols from >>> /usr/lib/debug/lib64/libc-2.12.so.debug...done. >>> done. >>> Loaded symbols for /lib64/libc-2.12.so >>> Reading symbols from /lib64/libdl-2.12.so...Reading symbols from >>> /usr/lib/debug/lib64/libdl-2.12.so.debug...done. >>> done. >>> Loaded symbols for /lib64/libdl-2.12.so >>> Reading symbols from /lib64/ld-2.12.so...Reading symbols from >>> /usr/lib/debug/lib64/ld-2.12.so.debug...done. >>> done. >>> Loaded symbols for /lib64/ld-2.12.so >>> Reading symbols from /lib64/libnss_files-2.12.so...Reading symbols from >>> /usr/lib/debug/lib64/libnss_files-2.12.so.debug...done. >>> done. >>> Loaded symbols for /lib64/libnss_files-2.12.so >>> Reading symbols from /lib64/libnss_dns-2.12.so...Reading symbols from >>> /usr/lib/debug/lib64/libnss_dns-2.12.so.debug...done. >>> done. >>> Loaded symbols for /lib64/libnss_dns-2.12.so >>> Reading symbols from /lib64/libresolv-2.12.so...Reading symbols from >>> /usr/lib/debug/lib64/libresolv-2.12.so.debug...done. >>> done. >>> Loaded symbols for /lib64/libresolv-2.12.so >>> Core was generated by `/usr/bin/cmsd -l /var/log/xrootd/cmsd.log -c >>> /etc/xrootd/xrootd-redir-cms.cfg -'. >>> Program terminated with signal 11, Segmentation fault. >>> #0 __pthread_mutex_lock (mutex=0x30) at pthread_mutex_lock.c:50 >>> 50 unsigned int type = PTHREAD_MUTEX_TYPE (mutex); >>> (gdb) where >>> #0 __pthread_mutex_lock (mutex=0x30) at pthread_mutex_lock.c:50 >>> #1 0x0000000000437519 in Lock (this=0x7f3038002160, lp=<value optimized >>> out>) >>> at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:149 >>> #2 Lock (this=0x7f3038002160, lp=<value optimized out>) at >>> /usr/src/debug/xrootd/xrootd/src/XrdCms/XrdCmsNode.hh:143 >>> #3 XrdCmsProtocol::Process (this=0x7f3038002160, lp=<value optimized >>> out>) >>> at /usr/src/debug/xrootd/xrootd/src/XrdCms/XrdCmsProtocol.cc:480 >>> #4 0x00007f30fa1e6149 in XrdLink::DoIt (this=0x7f3038000b98) at >>> /usr/src/debug/xrootd/xrootd/src/Xrd/XrdLink.cc:397 >>> #5 0x00007f30fa1e9625 in XrdScheduler::Run (this=0x647478) at >>> /usr/src/debug/xrootd/xrootd/src/Xrd/XrdScheduler.cc:333 >>> #6 0x00007f30fa1e9819 in XrdStartWorking (carg=<value optimized out>) at >>> /usr/src/debug/xrootd/xrootd/src/Xrd/XrdScheduler.cc:85 >>> #7 0x00007f30fa1ae3af in XrdSysThread_Xeq (myargs=0x7f30340008c0) at >>> /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.cc:86 >>> #8 0x00007f30f9f6f9d1 in start_thread (arg=0x7f30f0aca700) at >>> pthread_create.c:301 >>> #9 0x00007f30f9308b6d in clone () at >>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 >>> >>> >>> >>> does it help? >>> >>> I have many core files, they should be all ~ equivalent. >>> >>> I have put one on >>> https://www.dropbox.com/s/6eqyf2th3oxmk4b/core.64945?dl=0 >>> >>> >>> >>> >>> tom >>> >>> On Mon, Aug 24, 2015 at 10:14 PM, Andrew Hanushevsky >>> <[log in to unmask]> >>> wrote: >>> >>> Hi Tommaso, >>>> >>>> OK, this is now not critical as I seem to have found the problem. >>>> However, >>>> a backtrace would still be good to have as assurance. Yes, it is a >>>> bug when >>>> invalid login data is encountered. Now we have tto find out why this >>>> actually happened. >>>> >>>> What is wngw.ifca.es actually running (i.e. what cmsd version). >>>> >>>> >>>> Andy >>>> >>>> On Mon, 24 Aug 2015, Tommaso Boccali wrote: >>>> >>>> ciao andrew, tomorrow I can try but the fact is taht today in the end >>>> I had >>>> >>>>> to downgrade, since it is a production server. >>>>> >>>>> so i have to reupgrade, take the snapshot and go back as fast as >>>>> possible >>>>> :( >>>>> >>>>> ciao ciao >>>>> >>>>> tom >>>>> >>>>> On Mon, Aug 24, 2015 at 9:47 PM, Andrew Hanushevsky < >>>>> [log in to unmask]> >>>>> wrote: >>>>> >>>>> Hi Tommaso, >>>>> >>>>>> >>>>>> Both daemons (xrootd and cmsd) will exit if you attempt to run them as >>>>>> root. This is a security feature. You can run them as root but only >>>>>> after >>>>>> specifically confirming this via command line options (i.e. you accept >>>>>> the >>>>>> risks). As for the SEGV, that's clearly a bug. Is it possible to get a >>>>>> stack trace of the thread that got the SEGV? Please make sure to >>>>>> install >>>>>> the debug RPM so we can get actual line numbers. >>>>>> >>>>>> Andy >>>>>> >>>>>> >>>>>> On Mon, 24 Aug 2015, Tommaso Boccali wrote: >>>>>> >>>>>> uhm, >>>>>> >>>>>> >>>>>>> - the strace line was just my fault, I was trying running as root >>>>>>> on the >>>>>>> command line >>>>>>> - when I retried with user xrootd, I get instead the lines below, >>>>>>> which >>>>>>> terminate with a segv (*) >>>>>>> >>>>>>> so the last message is consistent with the one in the logs: >>>>>>> >>>>>>> [pid 65395] writev(2, [{"150824 10:09:53 65395 ", 22}, {"Pup", 3}, >>>>>>> {": >>>>>>> ", >>>>>>> 2}, {"buffer overrun unpacking", 24}, {" ", 1}, {"short arg 0: >>>>>>> ident.", >>>>>>> 19}, {"\n", 1}], 7) = 72 >>>>>>> [pid 65395] gettid() = 65395 >>>>>>> [pid 65395] writev(2, [{"150824 10:09:53 65395 ", 22}, {"Login", >>>>>>> 5}, {": >>>>>>> ", >>>>>>> 2}, {"wngw.ifca.es", 12}, {" ", 1}, {"login failed;", 13}, {" ", 1}, >>>>>>> {"invalid login data", 18}, {"\n", 1}], 9) = 75 >>>>>>> >>>>>>> >>>>>>> *: >>>>>>> [pid 65395] <... gettid resumed> ) = 65395 >>>>>>> [pid 65395] write(2, "150824 10:09:53 65395 ", 22) = 22 >>>>>>> [pid 65395] write(2, "Xrd", 3) = 3 >>>>>>> [pid 65395] write(2, "Inet", 4) = 4 >>>>>>> [pid 65395] write(2, ": ", 2) = 2 >>>>>>> [pid 65395] write(2, "Accepted connection from ", 25) = 25 >>>>>>> [pid 65395] write(2, "23", 2) = 2 >>>>>>> [pid 65395] write(2, "@", 1) = 1 >>>>>>> [pid 65395] write(2, "wngw.ifca.es", 12) = 12 >>>>>>> [pid 65395] write(2, "\n", 1) = 1 >>>>>>> [pid 65395] futex(0x6473c8, FUTEX_WAKE_PRIVATE, 1) = 0 >>>>>>> [pid 65395] poll([{fd=23, events=POLLIN|POLLRDNORM}], 1, 1000) = 1 >>>>>>> ([{fd=23, revents=POLLIN|POLLRDNORM}]) >>>>>>> [pid 65395] recvfrom(23, "\0\0\0\0\0\0\0\0", 8, MSG_PEEK, NULL, >>>>>>> NULL) = >>>>>>> 8 >>>>>>> [pid 65395] gettid() = 65395 >>>>>>> [pid 65395] write(2, "150824 10:09:53 65395 ", 22) = 22 >>>>>>> [pid 65395] write(2, "Xrd", 3) = 3 >>>>>>> [pid 65395] write(2, "Protocol", 8) = 8 >>>>>>> [pid 65395] write(2, ": ", 2) = 2 >>>>>>> [pid 65395] write(2, "matched protocol ", 17) = 17 >>>>>>> [pid 65395] write(2, "cmsd", 4) = 4 >>>>>>> [pid 65395] write(2, "\n", 1) = 1 >>>>>>> [pid 65395] epoll_ctl(12, EPOLL_CTL_ADD, 23, {0, {u32=4160757352, >>>>>>> u64=140071634214504}}) = 0 >>>>>>> [pid 65395] gettid() = 65395 >>>>>>> [pid 65395] write(2, "150824 10:09:53 65395 ", 22) = 22 >>>>>>> [pid 65395] write(2, "?:[log in to unmask]", 17) = 17 >>>>>>> [pid 65395] write(2, " ", 1) = 1 >>>>>>> [pid 65395] write(2, "Xrd", 3) = 3 >>>>>>> [pid 65395] write(2, "Poll", 4) = 4 >>>>>>> [pid 65395] write(2, ": ", 2) = 2 >>>>>>> [pid 65395] write(2, "FD ", 3) = 3 >>>>>>> [pid 65395] write(2, "23", 2) = 2 >>>>>>> [pid 65395] write(2, " attached to poller ", 20) = 20 >>>>>>> [pid 65395] write(2, "2", 1) = 1 >>>>>>> [pid 65395] write(2, "; num=", 6) = 6 >>>>>>> [pid 65395] write(2, "1", 1) = 1 >>>>>>> [pid 65395] write(2, "\n", 1) = 1 >>>>>>> [pid 65395] poll([{fd=23, events=POLLIN|POLLRDNORM}], 1, 5000) = 1 >>>>>>> ([{fd=23, revents=POLLIN|POLLRDNORM}]) >>>>>>> [pid 65395] recvfrom(23, "\0\0\0\0\0\0\0\0", 8, 0, NULL, NULL) = 8 >>>>>>> [pid 65395] gettid() = 65395 >>>>>>> [pid 65395] writev(2, [{"150824 10:09:53 65395 ", 22}, {"Pup", 3}, >>>>>>> {": >>>>>>> ", >>>>>>> 2}, {"buffer overrun unpacking", 24}, {" ", 1}, {"short arg 0: >>>>>>> ident.", >>>>>>> 19}, {"\n", 1}], 7) = 72 >>>>>>> [pid 65395] gettid() = 65395 >>>>>>> [pid 65395] writev(2, [{"150824 10:09:53 65395 ", 22}, {"Login", >>>>>>> 5}, {": >>>>>>> ", >>>>>>> 2}, {"wngw.ifca.es", 12}, {" ", 1}, {"login failed;", 13}, {" ", 1}, >>>>>>> {"invalid login data", 18}, {"\n", 1}], 9) = 75 >>>>>>> [pid 65395] --- SIGSEGV (Segmentation fault) @ 0 (0) --- >>>>>>> Process 65395 detached >>>>>>> [pid 65394] +++ killed by SIGSEGV (core dumped) +++ >>>>>>> [pid 65389] +++ killed by SIGSEGV (core dumped) +++ >>>>>>> [pid 65397] +++ killed by SIGSEGV (core dumped) +++ >>>>>>> [pid 65396] +++ killed by SIGSEGV (core dumped) +++ >>>>>>> [pid 65379] +++ killed by SIGSEGV (core dumped) +++ >>>>>>> [pid 65383] +++ killed by SIGSEGV (core dumped) +++ >>>>>>> [pid 65391] +++ killed by SIGSEGV (core dumped) +++ >>>>>>> [pid 65392] +++ killed by SIGSEGV (core dumped) +++ >>>>>>> [pid 65393] +++ killed by SIGSEGV (core dumped) +++ >>>>>>> [pid 65390] +++ killed by SIGSEGV (core dumped) +++ >>>>>>> [pid 65386] +++ killed by SIGSEGV (core dumped) +++ >>>>>>> [pid 65385] +++ killed by SIGSEGV (core dumped) +++ >>>>>>> [pid 65388] +++ killed by SIGSEGV (core dumped) +++ >>>>>>> [pid 65387] +++ killed by SIGSEGV (core dumped) +++ >>>>>>> [pid 65384] +++ killed by SIGSEGV (core dumped) +++ >>>>>>> [pid 65382] +++ killed by SIGSEGV (core dumped) +++ >>>>>>> [pid 65381] +++ killed by SIGSEGV (core dumped) +++ >>>>>>> [pid 65380] +++ killed by SIGSEGV (core dumped) +++ >>>>>>> [pid 65378] +++ killed by SIGSEGV (core dumped) +++ >>>>>>> [pid 65377] +++ killed by SIGSEGV (core dumped) +++ >>>>>>> [pid 65376] +++ killed by SIGSEGV (core dumped) +++ >>>>>>> [pid 65375] +++ killed by SIGSEGV (core dumped) +++ >>>>>>> +++ killed by SIGSEGV (core dumped) +++ >>>>>>> >>>>>>> On Mon, Aug 24, 2015 at 10:02 AM, Jan Iven <[log in to unmask]> wrote: >>>>>>> >>>>>>> On 08/24/2015 09:51 AM, Tommaso Boccali wrote: >>>>>>> >>>>>>> >>>>>>>> Ciao, I am trying to upgrade one of the CMS EU redirectors from >>>>>>>> 3.3.6 >>>>>>>> to >>>>>>>> >>>>>>>> 4.2.2 (with no configuration changed at first approx) >>>>>>>>> >>>>>>>>> The problem is that the main cmsd seems to die soon after start. No >>>>>>>>> real >>>>>>>>> message in the logs, but with strace I see a very suspect >>>>>>>>> >>>>>>>>> writev(2, [{"Copr. 2007 Stanford University/"..., 42}, {"\n", 1}], >>>>>>>>> 2) = >>>>>>>>> 43 >>>>>>>>> geteuid() = 0 >>>>>>>>> gettid() = 53572 >>>>>>>>> writev(2, [{"150824 09:47:27 53572 ", 22}, {"Config", 6}, {": ", >>>>>>>>> 2}, >>>>>>>>> {"Security reasons prohibit cmsd r"..., 73}, {"\n", 1}], 5) = 104 >>>>>>>>> >>>>>>>>> >>>>>>>>> Well, that line ought to end up either on STDERR somewhere or in >>>>>>>>> some >>>>>>>>> >>>>>>>> log >>>>>>>> file. Alternatively, suggest "strace -s 1024" to get the full error >>>>>>>> message.. >>>>>>>> >>>>>>>> Cheers >>>>>>>> jan >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>> Tommaso Boccali >>>>>>> INFN Pisa >>>>>>> >>>>>>> >>>>>>> ######################################################################## >>>>>>> >>>>>>> Use REPLY-ALL to reply to list >>>>>>> >>>>>>> To unsubscribe from the XROOTD-L list, click the following link: >>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>> -- >>>>> Tommaso Boccali >>>>> INFN Pisa >>>>> >>>>> >>>>> >>> >>> -- >>> Tommaso Boccali >>> INFN Pisa >>> >>> >> ######################################################################## >> Use REPLY-ALL to reply to list >> >> To unsubscribe from the XROOTD-L list, click the following link: >> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >> > -- Tommaso Boccali INFN Pisa ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-L list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1