Hello,
A segfault with the following stack trace was recently reported on an EOS disk server node at cern (running custom xrootd build, and with EOS plugins, approximately equivalent to xrootd 5.5.2):
Core was generated by `/opt/eos/xrootd/bin/xrootd -n fst -c /etc/xrd.cf.fst -l /var/log/eos/xrdlog.fst'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f8a36b66a4b in XrdXrootdMonFile::GetSlot (slotSZ=<optimized out>) at /usr/include/bits/byteswap.h:47
[...]
(gdb) where
#0 0x00007f8a36b66a4b in XrdXrootdMonFile::GetSlot (slotSZ=<optimized out>) at /usr/include/bits/byteswap.h:47
#1 0x00007f8a36b66bd3 in XrdXrootdMonFile::Close (fsP=<optimized out>, isDisc=<optimized out>)
at /usr/src/debug/xrootd-5.5.5/src/XrdXrootd/XrdXrootdMonFile.cc:169
#2 0x00007f8a36b6409b in XrdXrootdFileTable::Del (this=0x7f82a27db520, monP=<optimized out>,
fnum=<optimized out>, fnum@entry=0, dodel=<optimized out>)
at /usr/src/debug/xrootd-5.5.5/src/XrdXrootd/XrdXrootdFile.cc:309
#3 0x00007f8a36b79e88 in XrdXrootdProtocol::do_Close (this=this@entry=0x7f8792088500)
at /usr/src/debug/xrootd-5.5.5/src/XrdXrootd/XrdXrootdMonitor.hh:192
#4 0x00007f8a36b6fffe in XrdXrootdProtocol::Process2 (this=0x7f8792088500)
at /usr/src/debug/xrootd-5.5.5/src/XrdXrootd/XrdXrootdProtocol.cc:492
#5 0x00007f8a368af8c0 in XrdLinkXeq::DoIt (this=<optimized out>)
at /usr/src/debug/xrootd-5.5.5/src/Xrd/XrdLinkXeq.cc:320
#6 XrdLinkXeq::DoIt (this=0x7f8a0f8d0c50) at /usr/src/debug/xrootd-5.5.5/src/Xrd/XrdLinkXeq.cc:308
#7 0x00007f8a368b2867 in XrdScheduler::Run (this=0x615e80 <XrdGlobal::Sched>)
at /usr/src/debug/xrootd-5.5.5/src/Xrd/XrdScheduler.cc:406
#8 0x00007f8a368b2989 in XrdStartWorking (carg=<optimized out>)
at /usr/src/debug/xrootd-5.5.5/src/Xrd/XrdScheduler.cc:89
#9 0x00007f8a36842e27 in XrdSysThread_Xeq (myargs=0x7f847b7897a0)
at /usr/src/debug/xrootd-5.5.5/src/XrdSys/XrdSysPthread.cc:86
#10 0x00007f8a359a7ea5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f8a356d0b0d in clone () from /lib64/libc.so.6
It's suspected to be a race condition inside the xrootd server during a delayed file close(). I'm attaching a diff to build a slightly modified server (on top of current mater brach) that seems to reproduce the issue. I've also prepared a possible fix, but would it would be great to see if there's perhaps another opinion about the best approach - if wanted I can open a pull request with a fix and have discussion there about it; or if e.g. Andy would like to look into, I can leave it with only this ticket in the meantime.
Attaching reproducer-modification, some instructions and possible fix shortly.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1