Print

Print


.... and here is the logging output of

thread apply all bt full

https://www.dropbox.com/s/imdvj8ba5knlvbg/andy.log?dl=0

On Tue, Aug 25, 2015 at 7:24 AM, Tommaso Boccali <[log in to unmask]>
wrote:

> ciao, andrew, I asked the spanish guys for the cmsd version.
> in the meantime i replicated a machine with the 4.2.2 setup, and analyzed
> a core file.
>
> I get:
>
> [root@xrootd-redic-bck xrootd]# gdb cmsd core.63160
> GNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6)
> Copyright (C) 2010 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <
> http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> Reading symbols from /usr/bin/cmsd...Reading symbols from
> /usr/lib/debug/usr/bin/cmsd.debug...done.
> done.
> [New Thread 63535]
> ...
> ...
> ...
> Missing separate debuginfo for
> Try: yum --disablerepo='*' --enablerepo='*-debug*' install
> /usr/lib/debug/.build-id/9e/e239647f77340975f782611a1fa728c355ecda
> Reading symbols from /usr/lib64/libXrdServer.so.2.0.0...Reading symbols
> from /usr/lib/debug/usr/lib64/libXrdServer.so.2.0.0.debug...done.
> done.
> Loaded symbols for /usr/lib64/libXrdServer.so.2.0.0
> Reading symbols from /usr/lib64/libXrdUtils.so.2.0.0...Reading symbols
> from /usr/lib/debug/usr/lib64/libXrdUtils.so.2.0.0.debug...done.
> done.
> Loaded symbols for /usr/lib64/libXrdUtils.so.2.0.0
> Reading symbols from /lib64/libpthread-2.12.so...Reading symbols from
> /usr/lib/debug/lib64/libpthread-2.12.so.debug...done.
> [Thread debugging using libthread_db enabled]
> done.
> Loaded symbols for /lib64/libpthread-2.12.so
> Reading symbols from /lib64/librt-2.12.so...Reading symbols from
> /usr/lib/debug/lib64/librt-2.12.so.debug...done.
> done.
> Loaded symbols for /lib64/librt-2.12.so
> Reading symbols from /usr/lib64/libstdc++.so.6.0.13...Reading symbols from
> /usr/lib/debug/usr/lib64/libstdc++.so.6.0.13.debug...done.
> done.
> Loaded symbols for /usr/lib64/libstdc++.so.6.0.13
> Reading symbols from /lib64/libm-2.12.so...Reading symbols from
> /usr/lib/debug/lib64/libm-2.12.so.debug...done.
> done.
> Loaded symbols for /lib64/libm-2.12.so
> Reading symbols from /lib64/libgcc_s-4.4.6-20120305.so.1...Reading symbols
> from /usr/lib/debug/lib64/libgcc_s-4.4.6-20120305.so.1.debug...done.
> done.
> Loaded symbols for /lib64/libgcc_s-4.4.6-20120305.so.1
> Reading symbols from /lib64/libc-2.12.so...Reading symbols from
> /usr/lib/debug/lib64/libc-2.12.so.debug...done.
> done.
> Loaded symbols for /lib64/libc-2.12.so
> Reading symbols from /lib64/libdl-2.12.so...Reading symbols from
> /usr/lib/debug/lib64/libdl-2.12.so.debug...done.
> done.
> Loaded symbols for /lib64/libdl-2.12.so
> Reading symbols from /lib64/ld-2.12.so...Reading symbols from
> /usr/lib/debug/lib64/ld-2.12.so.debug...done.
> done.
> Loaded symbols for /lib64/ld-2.12.so
> Reading symbols from /lib64/libnss_files-2.12.so...Reading symbols from
> /usr/lib/debug/lib64/libnss_files-2.12.so.debug...done.
> done.
> Loaded symbols for /lib64/libnss_files-2.12.so
> Reading symbols from /lib64/libnss_dns-2.12.so...Reading symbols from
> /usr/lib/debug/lib64/libnss_dns-2.12.so.debug...done.
> done.
> Loaded symbols for /lib64/libnss_dns-2.12.so
> Reading symbols from /lib64/libresolv-2.12.so...Reading symbols from
> /usr/lib/debug/lib64/libresolv-2.12.so.debug...done.
> done.
> Loaded symbols for /lib64/libresolv-2.12.so
> Core was generated by `/usr/bin/cmsd -l /var/log/xrootd/cmsd.log -c
> /etc/xrootd/xrootd-redir-cms.cfg -'.
> Program terminated with signal 11, Segmentation fault.
> #0  __pthread_mutex_lock (mutex=0x30) at pthread_mutex_lock.c:50
> 50  unsigned int type = PTHREAD_MUTEX_TYPE (mutex);
> (gdb) where
> #0  __pthread_mutex_lock (mutex=0x30) at pthread_mutex_lock.c:50
> #1  0x0000000000437519 in Lock (this=0x7f3038002160, lp=<value optimized
> out>)
>     at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:149
> #2  Lock (this=0x7f3038002160, lp=<value optimized out>) at
> /usr/src/debug/xrootd/xrootd/src/XrdCms/XrdCmsNode.hh:143
> #3  XrdCmsProtocol::Process (this=0x7f3038002160, lp=<value optimized
> out>) at /usr/src/debug/xrootd/xrootd/src/XrdCms/XrdCmsProtocol.cc:480
> #4  0x00007f30fa1e6149 in XrdLink::DoIt (this=0x7f3038000b98) at
> /usr/src/debug/xrootd/xrootd/src/Xrd/XrdLink.cc:397
> #5  0x00007f30fa1e9625 in XrdScheduler::Run (this=0x647478) at
> /usr/src/debug/xrootd/xrootd/src/Xrd/XrdScheduler.cc:333
> #6  0x00007f30fa1e9819 in XrdStartWorking (carg=<value optimized out>) at
> /usr/src/debug/xrootd/xrootd/src/Xrd/XrdScheduler.cc:85
> #7  0x00007f30fa1ae3af in XrdSysThread_Xeq (myargs=0x7f30340008c0) at
> /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.cc:86
> #8  0x00007f30f9f6f9d1 in start_thread (arg=0x7f30f0aca700) at
> pthread_create.c:301
> #9  0x00007f30f9308b6d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
>
>
>
> does it help?
>
> I have many core files, they should be all ~ equivalent.
>
> I have put one on
> https://www.dropbox.com/s/6eqyf2th3oxmk4b/core.64945?dl=0
>
>
>
>
> tom
>
> On Mon, Aug 24, 2015 at 10:14 PM, Andrew Hanushevsky <
> [log in to unmask]> wrote:
>
>> Hi Tommaso,
>>
>> OK, this is now not critical as I seem to have found the problem.
>> However, a backtrace would still be good to have as assurance. Yes, it is a
>> bug when invalid login data is encountered. Now we have tto find out why
>> this actually happened.
>>
>> What is wngw.ifca.es actually running (i.e. what cmsd version).
>>
>>
>> Andy
>>
>> On Mon, 24 Aug 2015, Tommaso Boccali wrote:
>>
>> ciao andrew, tomorrow I can try but the fact is taht today in the end I
>>> had
>>> to downgrade, since it is a production server.
>>>
>>> so i have to reupgrade, take the snapshot and go back as fast as
>>> possible :(
>>>
>>> ciao ciao
>>>
>>> tom
>>>
>>> On Mon, Aug 24, 2015 at 9:47 PM, Andrew Hanushevsky <
>>> [log in to unmask]>
>>> wrote:
>>>
>>> Hi Tommaso,
>>>>
>>>> Both daemons (xrootd and cmsd) will exit if you attempt to run them as
>>>> root. This is a security feature. You can run them as root but only
>>>> after
>>>> specifically confirming this via command line options (i.e. you accept
>>>> the
>>>> risks). As for the SEGV, that's clearly a bug. Is it possible to get a
>>>> stack trace of the thread that got the SEGV? Please make sure to install
>>>> the debug RPM so we can get actual line numbers.
>>>>
>>>> Andy
>>>>
>>>>
>>>> On Mon, 24 Aug 2015, Tommaso Boccali wrote:
>>>>
>>>> uhm,
>>>>
>>>>>
>>>>> - the strace line was just my fault, I was trying running as root on
>>>>> the
>>>>> command line
>>>>> - when I retried with user xrootd, I get instead the lines below, which
>>>>> terminate with a segv (*)
>>>>>
>>>>> so the last message is consistent with the one in the logs:
>>>>>
>>>>> [pid 65395] writev(2, [{"150824 10:09:53 65395 ", 22}, {"Pup", 3}, {":
>>>>> ",
>>>>> 2}, {"buffer overrun unpacking", 24}, {" ", 1}, {"short arg 0: ident.",
>>>>> 19}, {"\n", 1}], 7) = 72
>>>>> [pid 65395] gettid()                    = 65395
>>>>> [pid 65395] writev(2, [{"150824 10:09:53 65395 ", 22}, {"Login", 5},
>>>>> {":
>>>>> ",
>>>>> 2}, {"wngw.ifca.es", 12}, {" ", 1}, {"login failed;", 13}, {" ", 1},
>>>>> {"invalid login data", 18}, {"\n", 1}], 9) = 75
>>>>>
>>>>>
>>>>> *:
>>>>> [pid 65395] <... gettid resumed> )      = 65395
>>>>> [pid 65395] write(2, "150824 10:09:53 65395 ", 22) = 22
>>>>> [pid 65395] write(2, "Xrd", 3)          = 3
>>>>> [pid 65395] write(2, "Inet", 4)         = 4
>>>>> [pid 65395] write(2, ": ", 2)           = 2
>>>>> [pid 65395] write(2, "Accepted connection from ", 25) = 25
>>>>> [pid 65395] write(2, "23", 2)           = 2
>>>>> [pid 65395] write(2, "@", 1)            = 1
>>>>> [pid 65395] write(2, "wngw.ifca.es", 12) = 12
>>>>> [pid 65395] write(2, "\n", 1)           = 1
>>>>> [pid 65395] futex(0x6473c8, FUTEX_WAKE_PRIVATE, 1) = 0
>>>>> [pid 65395] poll([{fd=23, events=POLLIN|POLLRDNORM}], 1, 1000) = 1
>>>>> ([{fd=23, revents=POLLIN|POLLRDNORM}])
>>>>> [pid 65395] recvfrom(23, "\0\0\0\0\0\0\0\0", 8, MSG_PEEK, NULL, NULL)
>>>>> = 8
>>>>> [pid 65395] gettid()                    = 65395
>>>>> [pid 65395] write(2, "150824 10:09:53 65395 ", 22) = 22
>>>>> [pid 65395] write(2, "Xrd", 3)          = 3
>>>>> [pid 65395] write(2, "Protocol", 8)     = 8
>>>>> [pid 65395] write(2, ": ", 2)           = 2
>>>>> [pid 65395] write(2, "matched protocol ", 17) = 17
>>>>> [pid 65395] write(2, "cmsd", 4)         = 4
>>>>> [pid 65395] write(2, "\n", 1)           = 1
>>>>> [pid 65395] epoll_ctl(12, EPOLL_CTL_ADD, 23, {0, {u32=4160757352,
>>>>> u64=140071634214504}}) = 0
>>>>> [pid 65395] gettid()                    = 65395
>>>>> [pid 65395] write(2, "150824 10:09:53 65395 ", 22) = 22
>>>>> [pid 65395] write(2, "?:[log in to unmask]", 17) = 17
>>>>> [pid 65395] write(2, " ", 1)            = 1
>>>>> [pid 65395] write(2, "Xrd", 3)          = 3
>>>>> [pid 65395] write(2, "Poll", 4)         = 4
>>>>> [pid 65395] write(2, ": ", 2)           = 2
>>>>> [pid 65395] write(2, "FD ", 3)          = 3
>>>>> [pid 65395] write(2, "23", 2)           = 2
>>>>> [pid 65395] write(2, " attached to poller ", 20) = 20
>>>>> [pid 65395] write(2, "2", 1)            = 1
>>>>> [pid 65395] write(2, "; num=", 6)       = 6
>>>>> [pid 65395] write(2, "1", 1)            = 1
>>>>> [pid 65395] write(2, "\n", 1)           = 1
>>>>> [pid 65395] poll([{fd=23, events=POLLIN|POLLRDNORM}], 1, 5000) = 1
>>>>> ([{fd=23, revents=POLLIN|POLLRDNORM}])
>>>>> [pid 65395] recvfrom(23, "\0\0\0\0\0\0\0\0", 8, 0, NULL, NULL) = 8
>>>>> [pid 65395] gettid()                    = 65395
>>>>> [pid 65395] writev(2, [{"150824 10:09:53 65395 ", 22}, {"Pup", 3}, {":
>>>>> ",
>>>>> 2}, {"buffer overrun unpacking", 24}, {" ", 1}, {"short arg 0: ident.",
>>>>> 19}, {"\n", 1}], 7) = 72
>>>>> [pid 65395] gettid()                    = 65395
>>>>> [pid 65395] writev(2, [{"150824 10:09:53 65395 ", 22}, {"Login", 5},
>>>>> {":
>>>>> ",
>>>>> 2}, {"wngw.ifca.es", 12}, {" ", 1}, {"login failed;", 13}, {" ", 1},
>>>>> {"invalid login data", 18}, {"\n", 1}], 9) = 75
>>>>> [pid 65395] --- SIGSEGV (Segmentation fault) @ 0 (0) ---
>>>>> Process 65395 detached
>>>>> [pid 65394] +++ killed by SIGSEGV (core dumped) +++
>>>>> [pid 65389] +++ killed by SIGSEGV (core dumped) +++
>>>>> [pid 65397] +++ killed by SIGSEGV (core dumped) +++
>>>>> [pid 65396] +++ killed by SIGSEGV (core dumped) +++
>>>>> [pid 65379] +++ killed by SIGSEGV (core dumped) +++
>>>>> [pid 65383] +++ killed by SIGSEGV (core dumped) +++
>>>>> [pid 65391] +++ killed by SIGSEGV (core dumped) +++
>>>>> [pid 65392] +++ killed by SIGSEGV (core dumped) +++
>>>>> [pid 65393] +++ killed by SIGSEGV (core dumped) +++
>>>>> [pid 65390] +++ killed by SIGSEGV (core dumped) +++
>>>>> [pid 65386] +++ killed by SIGSEGV (core dumped) +++
>>>>> [pid 65385] +++ killed by SIGSEGV (core dumped) +++
>>>>> [pid 65388] +++ killed by SIGSEGV (core dumped) +++
>>>>> [pid 65387] +++ killed by SIGSEGV (core dumped) +++
>>>>> [pid 65384] +++ killed by SIGSEGV (core dumped) +++
>>>>> [pid 65382] +++ killed by SIGSEGV (core dumped) +++
>>>>> [pid 65381] +++ killed by SIGSEGV (core dumped) +++
>>>>> [pid 65380] +++ killed by SIGSEGV (core dumped) +++
>>>>> [pid 65378] +++ killed by SIGSEGV (core dumped) +++
>>>>> [pid 65377] +++ killed by SIGSEGV (core dumped) +++
>>>>> [pid 65376] +++ killed by SIGSEGV (core dumped) +++
>>>>> [pid 65375] +++ killed by SIGSEGV (core dumped) +++
>>>>> +++ killed by SIGSEGV (core dumped) +++
>>>>>
>>>>> On Mon, Aug 24, 2015 at 10:02 AM, Jan Iven <[log in to unmask]> wrote:
>>>>>
>>>>> On 08/24/2015 09:51 AM, Tommaso Boccali wrote:
>>>>>
>>>>>>
>>>>>> Ciao, I am trying to upgrade one of the CMS EU redirectors from 3.3.6
>>>>>> to
>>>>>>
>>>>>>> 4.2.2 (with no configuration changed at first approx)
>>>>>>>
>>>>>>> The problem is that the main cmsd seems to die soon after start. No
>>>>>>> real
>>>>>>> message in the logs, but with strace I see a very suspect
>>>>>>>
>>>>>>> writev(2, [{"Copr.  2007 Stanford University/"..., 42}, {"\n", 1}],
>>>>>>> 2) =
>>>>>>> 43
>>>>>>> geteuid()                               = 0
>>>>>>> gettid()                                = 53572
>>>>>>> writev(2, [{"150824 09:47:27 53572 ", 22}, {"Config", 6}, {": ", 2},
>>>>>>> {"Security reasons prohibit cmsd r"..., 73}, {"\n", 1}], 5) = 104
>>>>>>>
>>>>>>>
>>>>>>> Well, that line ought to end up either on STDERR somewhere or in
>>>>>> some log
>>>>>> file. Alternatively, suggest "strace -s 1024" to get the full error
>>>>>> message..
>>>>>>
>>>>>> Cheers
>>>>>> jan
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> --
>>>>> Tommaso Boccali
>>>>> INFN Pisa
>>>>>
>>>>>
>>>>> ########################################################################
>>>>> Use REPLY-ALL to reply to list
>>>>>
>>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>>>
>>>>>
>>>>>
>>>
>>> --
>>> Tommaso Boccali
>>> INFN Pisa
>>>
>>>
>
>
> --
> Tommaso Boccali
> INFN Pisa
>



-- 
Tommaso Boccali
INFN Pisa

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1