Print

Print


Hello,

A segfault with the following stack trace was recently reported on an EOS disk server node at cern (running custom xrootd build, and with EOS plugins, approximately equivalent to xrootd 5.5.2):

Core was generated by `/opt/eos/xrootd/bin/xrootd -n fst -c /etc/xrd.cf.fst -l /var/log/eos/xrdlog.fst'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f8a36b66a4b in XrdXrootdMonFile::GetSlot (slotSZ=<optimized out>) at /usr/include/bits/byteswap.h:47
[...]
(gdb) where
#0  0x00007f8a36b66a4b in XrdXrootdMonFile::GetSlot (slotSZ=<optimized out>) at /usr/include/bits/byteswap.h:47
#1  0x00007f8a36b66bd3 in XrdXrootdMonFile::Close (fsP=<optimized out>, isDisc=<optimized out>)
    at /usr/src/debug/xrootd-5.5.5/src/XrdXrootd/XrdXrootdMonFile.cc:169
#2  0x00007f8a36b6409b in XrdXrootdFileTable::Del (this=0x7f82a27db520, monP=<optimized out>, 
    fnum=<optimized out>, fnum@entry=0, dodel=<optimized out>)
    at /usr/src/debug/xrootd-5.5.5/src/XrdXrootd/XrdXrootdFile.cc:309
#3  0x00007f8a36b79e88 in XrdXrootdProtocol::do_Close (this=this@entry=0x7f8792088500)
    at /usr/src/debug/xrootd-5.5.5/src/XrdXrootd/XrdXrootdMonitor.hh:192
#4  0x00007f8a36b6fffe in XrdXrootdProtocol::Process2 (this=0x7f8792088500)
    at /usr/src/debug/xrootd-5.5.5/src/XrdXrootd/XrdXrootdProtocol.cc:492
#5  0x00007f8a368af8c0 in XrdLinkXeq::DoIt (this=<optimized out>)
    at /usr/src/debug/xrootd-5.5.5/src/Xrd/XrdLinkXeq.cc:320
#6  XrdLinkXeq::DoIt (this=0x7f8a0f8d0c50) at /usr/src/debug/xrootd-5.5.5/src/Xrd/XrdLinkXeq.cc:308
#7  0x00007f8a368b2867 in XrdScheduler::Run (this=0x615e80 <XrdGlobal::Sched>)
    at /usr/src/debug/xrootd-5.5.5/src/Xrd/XrdScheduler.cc:406
#8  0x00007f8a368b2989 in XrdStartWorking (carg=<optimized out>)
    at /usr/src/debug/xrootd-5.5.5/src/Xrd/XrdScheduler.cc:89
#9  0x00007f8a36842e27 in XrdSysThread_Xeq (myargs=0x7f847b7897a0)
    at /usr/src/debug/xrootd-5.5.5/src/XrdSys/XrdSysPthread.cc:86
#10 0x00007f8a359a7ea5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f8a356d0b0d in clone () from /lib64/libc.so.6

It's suspected to be a race condition inside the xrootd server during a delayed file close(). I'm attaching a diff to build a slightly modified server (on top of current mater brach) that seems to reproduce the issue. I've also prepared a possible fix, but would it would be great to see if there's perhaps another opinion about the best approach - if wanted I can open a pull request with a fix and have discussion there about it; or if e.g. Andy would like to look into, I can leave it with only this ticket in the meantime.

Attaching reproducer-modification, some instructions and possible fix shortly.


Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.Message ID: <xrootd/xrootd/issues/1898@github.com>

[ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/xrootd/xrootd/issues/1898", "url": "https://github.com/xrootd/xrootd/issues/1898", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1