Print

Print


ciao marian, in my case the crash was when 2 very specific cmsd were
connecting to the redirector cmsd.

one was IFCA, I think the second was in Russia (maybe Jinr???).
No real idea why that happened, though .... probably Andy has ;)

in the case of US redirectors, maybe you were lucky and did not stumble on
any problematic site?

tom

On Tue, Aug 25, 2015 at 5:18 PM, Marian Zvada <[log in to unmask]> wrote:

> Hi Andy, Tom,
>
> I see commit in src/XrdCms/XrdCmsProtocol.cc, not that I'm lazy to read
> changes but I'm afraid I won't understand details. So my naive questions
> are:
>
> 1) What are conditions tigger this segfaults, do we know?
> 2) I'm asking for 1) because we've quite couple of redirectors and servers
> running 4.2.2. which do the work flawlessly.
> 3) Do people running 4.2.2 do critical downgrade at this point?
>
> So what's the problem here, please? We simply don't want to shout now on
> everyone don't upgrade to latest release 4.2.2 as we are fairly pushing
> about the upgrade since a months ago. Especially those who still run 3.3.x
> etc.
>
> Could you advise, please?
>
> Thanks,
> Marian
>
>
>
> On 8/25/15 12:55 AM, Andrew Hanushevsky wrote:
>
>> Hi Tommaso,
>>
>> Thanks. It's what I thougt it was. I have already pushed a fix into git
>> head. It's a long-standing issue that never was encountered before. We
>> should ahve this ix in 4.2.3 which is comming soon.
>>
>> Andy
>>
>> On Tue, 25 Aug 2015, Tommaso Boccali wrote:
>>
>> ciao, andrew, I asked the spanish guys for the cmsd version.
>>> in the meantime i replicated a machine with the 4.2.2 setup, and
>>> analyzed a
>>> core file.
>>>
>>> I get:
>>>
>>> [root@xrootd-redic-bck xrootd]# gdb cmsd core.63160
>>> GNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6)
>>> Copyright (C) 2010 Free Software Foundation, Inc.
>>> License GPLv3+: GNU GPL version 3 or later
>>> <http://gnu.org/licenses/gpl.html
>>>
>>>>
>>>> This is free software: you are free to change and redistribute it.
>>> There is NO WARRANTY, to the extent permitted by law.  Type "show
>>> copying"
>>> and "show warranty" for details.
>>> This GDB was configured as "x86_64-redhat-linux-gnu".
>>> For bug reporting instructions, please see:
>>> <http://www.gnu.org/software/gdb/bugs/>...
>>> Reading symbols from /usr/bin/cmsd...Reading symbols from
>>> /usr/lib/debug/usr/bin/cmsd.debug...done.
>>> done.
>>> [New Thread 63535]
>>> ...
>>> ...
>>> ...
>>> Missing separate debuginfo for
>>> Try: yum --disablerepo='*' --enablerepo='*-debug*' install
>>> /usr/lib/debug/.build-id/9e/e239647f77340975f782611a1fa728c355ecda
>>> Reading symbols from /usr/lib64/libXrdServer.so.2.0.0...Reading symbols
>>> from /usr/lib/debug/usr/lib64/libXrdServer.so.2.0.0.debug...done.
>>> done.
>>> Loaded symbols for /usr/lib64/libXrdServer.so.2.0.0
>>> Reading symbols from /usr/lib64/libXrdUtils.so.2.0.0...Reading symbols
>>> from
>>> /usr/lib/debug/usr/lib64/libXrdUtils.so.2.0.0.debug...done.
>>> done.
>>> Loaded symbols for /usr/lib64/libXrdUtils.so.2.0.0
>>> Reading symbols from /lib64/libpthread-2.12.so...Reading symbols from
>>> /usr/lib/debug/lib64/libpthread-2.12.so.debug...done.
>>> [Thread debugging using libthread_db enabled]
>>> done.
>>> Loaded symbols for /lib64/libpthread-2.12.so
>>> Reading symbols from /lib64/librt-2.12.so...Reading symbols from
>>> /usr/lib/debug/lib64/librt-2.12.so.debug...done.
>>> done.
>>> Loaded symbols for /lib64/librt-2.12.so
>>> Reading symbols from /usr/lib64/libstdc++.so.6.0.13...Reading symbols
>>> from
>>> /usr/lib/debug/usr/lib64/libstdc++.so.6.0.13.debug...done.
>>> done.
>>> Loaded symbols for /usr/lib64/libstdc++.so.6.0.13
>>> Reading symbols from /lib64/libm-2.12.so...Reading symbols from
>>> /usr/lib/debug/lib64/libm-2.12.so.debug...done.
>>> done.
>>> Loaded symbols for /lib64/libm-2.12.so
>>> Reading symbols from /lib64/libgcc_s-4.4.6-20120305.so.1...Reading
>>> symbols
>>> from /usr/lib/debug/lib64/libgcc_s-4.4.6-20120305.so.1.debug...done.
>>> done.
>>> Loaded symbols for /lib64/libgcc_s-4.4.6-20120305.so.1
>>> Reading symbols from /lib64/libc-2.12.so...Reading symbols from
>>> /usr/lib/debug/lib64/libc-2.12.so.debug...done.
>>> done.
>>> Loaded symbols for /lib64/libc-2.12.so
>>> Reading symbols from /lib64/libdl-2.12.so...Reading symbols from
>>> /usr/lib/debug/lib64/libdl-2.12.so.debug...done.
>>> done.
>>> Loaded symbols for /lib64/libdl-2.12.so
>>> Reading symbols from /lib64/ld-2.12.so...Reading symbols from
>>> /usr/lib/debug/lib64/ld-2.12.so.debug...done.
>>> done.
>>> Loaded symbols for /lib64/ld-2.12.so
>>> Reading symbols from /lib64/libnss_files-2.12.so...Reading symbols from
>>> /usr/lib/debug/lib64/libnss_files-2.12.so.debug...done.
>>> done.
>>> Loaded symbols for /lib64/libnss_files-2.12.so
>>> Reading symbols from /lib64/libnss_dns-2.12.so...Reading symbols from
>>> /usr/lib/debug/lib64/libnss_dns-2.12.so.debug...done.
>>> done.
>>> Loaded symbols for /lib64/libnss_dns-2.12.so
>>> Reading symbols from /lib64/libresolv-2.12.so...Reading symbols from
>>> /usr/lib/debug/lib64/libresolv-2.12.so.debug...done.
>>> done.
>>> Loaded symbols for /lib64/libresolv-2.12.so
>>> Core was generated by `/usr/bin/cmsd -l /var/log/xrootd/cmsd.log -c
>>> /etc/xrootd/xrootd-redir-cms.cfg -'.
>>> Program terminated with signal 11, Segmentation fault.
>>> #0  __pthread_mutex_lock (mutex=0x30) at pthread_mutex_lock.c:50
>>> 50  unsigned int type = PTHREAD_MUTEX_TYPE (mutex);
>>> (gdb) where
>>> #0  __pthread_mutex_lock (mutex=0x30) at pthread_mutex_lock.c:50
>>> #1  0x0000000000437519 in Lock (this=0x7f3038002160, lp=<value optimized
>>> out>)
>>>    at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:149
>>> #2  Lock (this=0x7f3038002160, lp=<value optimized out>) at
>>> /usr/src/debug/xrootd/xrootd/src/XrdCms/XrdCmsNode.hh:143
>>> #3  XrdCmsProtocol::Process (this=0x7f3038002160, lp=<value optimized
>>> out>)
>>> at /usr/src/debug/xrootd/xrootd/src/XrdCms/XrdCmsProtocol.cc:480
>>> #4  0x00007f30fa1e6149 in XrdLink::DoIt (this=0x7f3038000b98) at
>>> /usr/src/debug/xrootd/xrootd/src/Xrd/XrdLink.cc:397
>>> #5  0x00007f30fa1e9625 in XrdScheduler::Run (this=0x647478) at
>>> /usr/src/debug/xrootd/xrootd/src/Xrd/XrdScheduler.cc:333
>>> #6  0x00007f30fa1e9819 in XrdStartWorking (carg=<value optimized out>) at
>>> /usr/src/debug/xrootd/xrootd/src/Xrd/XrdScheduler.cc:85
>>> #7  0x00007f30fa1ae3af in XrdSysThread_Xeq (myargs=0x7f30340008c0) at
>>> /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.cc:86
>>> #8  0x00007f30f9f6f9d1 in start_thread (arg=0x7f30f0aca700) at
>>> pthread_create.c:301
>>> #9  0x00007f30f9308b6d in clone () at
>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
>>>
>>>
>>>
>>> does it help?
>>>
>>> I have many core files, they should be all ~ equivalent.
>>>
>>> I have put one on
>>> https://www.dropbox.com/s/6eqyf2th3oxmk4b/core.64945?dl=0
>>>
>>>
>>>
>>>
>>> tom
>>>
>>> On Mon, Aug 24, 2015 at 10:14 PM, Andrew Hanushevsky
>>> <[log in to unmask]>
>>> wrote:
>>>
>>> Hi Tommaso,
>>>>
>>>> OK, this is now not critical as I seem to have found the problem.
>>>> However,
>>>> a backtrace would still be good to have as assurance. Yes, it is a
>>>> bug when
>>>> invalid login data is encountered. Now we have tto find out why this
>>>> actually happened.
>>>>
>>>> What is wngw.ifca.es actually running (i.e. what cmsd version).
>>>>
>>>>
>>>> Andy
>>>>
>>>> On Mon, 24 Aug 2015, Tommaso Boccali wrote:
>>>>
>>>> ciao andrew, tomorrow I can try but the fact is taht today in the end
>>>> I had
>>>>
>>>>> to downgrade, since it is a production server.
>>>>>
>>>>> so i have to reupgrade, take the snapshot and go back as fast as
>>>>> possible
>>>>> :(
>>>>>
>>>>> ciao ciao
>>>>>
>>>>> tom
>>>>>
>>>>> On Mon, Aug 24, 2015 at 9:47 PM, Andrew Hanushevsky <
>>>>> [log in to unmask]>
>>>>> wrote:
>>>>>
>>>>> Hi Tommaso,
>>>>>
>>>>>>
>>>>>> Both daemons (xrootd and cmsd) will exit if you attempt to run them as
>>>>>> root. This is a security feature. You can run them as root but only
>>>>>> after
>>>>>> specifically confirming this via command line options (i.e. you accept
>>>>>> the
>>>>>> risks). As for the SEGV, that's clearly a bug. Is it possible to get a
>>>>>> stack trace of the thread that got the SEGV? Please make sure to
>>>>>> install
>>>>>> the debug RPM so we can get actual line numbers.
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>>
>>>>>> On Mon, 24 Aug 2015, Tommaso Boccali wrote:
>>>>>>
>>>>>> uhm,
>>>>>>
>>>>>>
>>>>>>> - the strace line was just my fault, I was trying running as root
>>>>>>> on the
>>>>>>> command line
>>>>>>> - when I retried with user xrootd, I get instead the lines below,
>>>>>>> which
>>>>>>> terminate with a segv (*)
>>>>>>>
>>>>>>> so the last message is consistent with the one in the logs:
>>>>>>>
>>>>>>> [pid 65395] writev(2, [{"150824 10:09:53 65395 ", 22}, {"Pup", 3},
>>>>>>> {":
>>>>>>> ",
>>>>>>> 2}, {"buffer overrun unpacking", 24}, {" ", 1}, {"short arg 0:
>>>>>>> ident.",
>>>>>>> 19}, {"\n", 1}], 7) = 72
>>>>>>> [pid 65395] gettid()                    = 65395
>>>>>>> [pid 65395] writev(2, [{"150824 10:09:53 65395 ", 22}, {"Login",
>>>>>>> 5}, {":
>>>>>>> ",
>>>>>>> 2}, {"wngw.ifca.es", 12}, {" ", 1}, {"login failed;", 13}, {" ", 1},
>>>>>>> {"invalid login data", 18}, {"\n", 1}], 9) = 75
>>>>>>>
>>>>>>>
>>>>>>> *:
>>>>>>> [pid 65395] <... gettid resumed> )      = 65395
>>>>>>> [pid 65395] write(2, "150824 10:09:53 65395 ", 22) = 22
>>>>>>> [pid 65395] write(2, "Xrd", 3)          = 3
>>>>>>> [pid 65395] write(2, "Inet", 4)         = 4
>>>>>>> [pid 65395] write(2, ": ", 2)           = 2
>>>>>>> [pid 65395] write(2, "Accepted connection from ", 25) = 25
>>>>>>> [pid 65395] write(2, "23", 2)           = 2
>>>>>>> [pid 65395] write(2, "@", 1)            = 1
>>>>>>> [pid 65395] write(2, "wngw.ifca.es", 12) = 12
>>>>>>> [pid 65395] write(2, "\n", 1)           = 1
>>>>>>> [pid 65395] futex(0x6473c8, FUTEX_WAKE_PRIVATE, 1) = 0
>>>>>>> [pid 65395] poll([{fd=23, events=POLLIN|POLLRDNORM}], 1, 1000) = 1
>>>>>>> ([{fd=23, revents=POLLIN|POLLRDNORM}])
>>>>>>> [pid 65395] recvfrom(23, "\0\0\0\0\0\0\0\0", 8, MSG_PEEK, NULL,
>>>>>>> NULL) =
>>>>>>> 8
>>>>>>> [pid 65395] gettid()                    = 65395
>>>>>>> [pid 65395] write(2, "150824 10:09:53 65395 ", 22) = 22
>>>>>>> [pid 65395] write(2, "Xrd", 3)          = 3
>>>>>>> [pid 65395] write(2, "Protocol", 8)     = 8
>>>>>>> [pid 65395] write(2, ": ", 2)           = 2
>>>>>>> [pid 65395] write(2, "matched protocol ", 17) = 17
>>>>>>> [pid 65395] write(2, "cmsd", 4)         = 4
>>>>>>> [pid 65395] write(2, "\n", 1)           = 1
>>>>>>> [pid 65395] epoll_ctl(12, EPOLL_CTL_ADD, 23, {0, {u32=4160757352,
>>>>>>> u64=140071634214504}}) = 0
>>>>>>> [pid 65395] gettid()                    = 65395
>>>>>>> [pid 65395] write(2, "150824 10:09:53 65395 ", 22) = 22
>>>>>>> [pid 65395] write(2, "?:[log in to unmask]", 17) = 17
>>>>>>> [pid 65395] write(2, " ", 1)            = 1
>>>>>>> [pid 65395] write(2, "Xrd", 3)          = 3
>>>>>>> [pid 65395] write(2, "Poll", 4)         = 4
>>>>>>> [pid 65395] write(2, ": ", 2)           = 2
>>>>>>> [pid 65395] write(2, "FD ", 3)          = 3
>>>>>>> [pid 65395] write(2, "23", 2)           = 2
>>>>>>> [pid 65395] write(2, " attached to poller ", 20) = 20
>>>>>>> [pid 65395] write(2, "2", 1)            = 1
>>>>>>> [pid 65395] write(2, "; num=", 6)       = 6
>>>>>>> [pid 65395] write(2, "1", 1)            = 1
>>>>>>> [pid 65395] write(2, "\n", 1)           = 1
>>>>>>> [pid 65395] poll([{fd=23, events=POLLIN|POLLRDNORM}], 1, 5000) = 1
>>>>>>> ([{fd=23, revents=POLLIN|POLLRDNORM}])
>>>>>>> [pid 65395] recvfrom(23, "\0\0\0\0\0\0\0\0", 8, 0, NULL, NULL) = 8
>>>>>>> [pid 65395] gettid()                    = 65395
>>>>>>> [pid 65395] writev(2, [{"150824 10:09:53 65395 ", 22}, {"Pup", 3},
>>>>>>> {":
>>>>>>> ",
>>>>>>> 2}, {"buffer overrun unpacking", 24}, {" ", 1}, {"short arg 0:
>>>>>>> ident.",
>>>>>>> 19}, {"\n", 1}], 7) = 72
>>>>>>> [pid 65395] gettid()                    = 65395
>>>>>>> [pid 65395] writev(2, [{"150824 10:09:53 65395 ", 22}, {"Login",
>>>>>>> 5}, {":
>>>>>>> ",
>>>>>>> 2}, {"wngw.ifca.es", 12}, {" ", 1}, {"login failed;", 13}, {" ", 1},
>>>>>>> {"invalid login data", 18}, {"\n", 1}], 9) = 75
>>>>>>> [pid 65395] --- SIGSEGV (Segmentation fault) @ 0 (0) ---
>>>>>>> Process 65395 detached
>>>>>>> [pid 65394] +++ killed by SIGSEGV (core dumped) +++
>>>>>>> [pid 65389] +++ killed by SIGSEGV (core dumped) +++
>>>>>>> [pid 65397] +++ killed by SIGSEGV (core dumped) +++
>>>>>>> [pid 65396] +++ killed by SIGSEGV (core dumped) +++
>>>>>>> [pid 65379] +++ killed by SIGSEGV (core dumped) +++
>>>>>>> [pid 65383] +++ killed by SIGSEGV (core dumped) +++
>>>>>>> [pid 65391] +++ killed by SIGSEGV (core dumped) +++
>>>>>>> [pid 65392] +++ killed by SIGSEGV (core dumped) +++
>>>>>>> [pid 65393] +++ killed by SIGSEGV (core dumped) +++
>>>>>>> [pid 65390] +++ killed by SIGSEGV (core dumped) +++
>>>>>>> [pid 65386] +++ killed by SIGSEGV (core dumped) +++
>>>>>>> [pid 65385] +++ killed by SIGSEGV (core dumped) +++
>>>>>>> [pid 65388] +++ killed by SIGSEGV (core dumped) +++
>>>>>>> [pid 65387] +++ killed by SIGSEGV (core dumped) +++
>>>>>>> [pid 65384] +++ killed by SIGSEGV (core dumped) +++
>>>>>>> [pid 65382] +++ killed by SIGSEGV (core dumped) +++
>>>>>>> [pid 65381] +++ killed by SIGSEGV (core dumped) +++
>>>>>>> [pid 65380] +++ killed by SIGSEGV (core dumped) +++
>>>>>>> [pid 65378] +++ killed by SIGSEGV (core dumped) +++
>>>>>>> [pid 65377] +++ killed by SIGSEGV (core dumped) +++
>>>>>>> [pid 65376] +++ killed by SIGSEGV (core dumped) +++
>>>>>>> [pid 65375] +++ killed by SIGSEGV (core dumped) +++
>>>>>>> +++ killed by SIGSEGV (core dumped) +++
>>>>>>>
>>>>>>> On Mon, Aug 24, 2015 at 10:02 AM, Jan Iven <[log in to unmask]> wrote:
>>>>>>>
>>>>>>> On 08/24/2015 09:51 AM, Tommaso Boccali wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Ciao, I am trying to upgrade one of the CMS EU redirectors from
>>>>>>>> 3.3.6
>>>>>>>> to
>>>>>>>>
>>>>>>>> 4.2.2 (with no configuration changed at first approx)
>>>>>>>>>
>>>>>>>>> The problem is that the main cmsd seems to die soon after start. No
>>>>>>>>> real
>>>>>>>>> message in the logs, but with strace I see a very suspect
>>>>>>>>>
>>>>>>>>> writev(2, [{"Copr.  2007 Stanford University/"..., 42}, {"\n", 1}],
>>>>>>>>> 2) =
>>>>>>>>> 43
>>>>>>>>> geteuid()                               = 0
>>>>>>>>> gettid()                                = 53572
>>>>>>>>> writev(2, [{"150824 09:47:27 53572 ", 22}, {"Config", 6}, {": ",
>>>>>>>>> 2},
>>>>>>>>> {"Security reasons prohibit cmsd r"..., 73}, {"\n", 1}], 5) = 104
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Well, that line ought to end up either on STDERR somewhere or in
>>>>>>>>> some
>>>>>>>>>
>>>>>>>> log
>>>>>>>> file. Alternatively, suggest "strace -s 1024" to get the full error
>>>>>>>> message..
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>> jan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>> Tommaso Boccali
>>>>>>> INFN Pisa
>>>>>>>
>>>>>>>
>>>>>>> ########################################################################
>>>>>>>
>>>>>>> Use REPLY-ALL to reply to list
>>>>>>>
>>>>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> --
>>>>> Tommaso Boccali
>>>>> INFN Pisa
>>>>>
>>>>>
>>>>>
>>>
>>> --
>>> Tommaso Boccali
>>> INFN Pisa
>>>
>>>
>> ########################################################################
>> Use REPLY-ALL to reply to list
>>
>> To unsubscribe from the XROOTD-L list, click the following link:
>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>
>


-- 
Tommaso Boccali
INFN Pisa

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1