We started to move file stageout from GridFTP to XrdHttp and overnight began to see segfaults. The traceback looks like this:
#6 <signal handler called>
#7 getErrText (ecode=@0x7f6d1cffec7c: 32621, this=0x8) at /usr/src/debug/xrootd/xrootd/src/XrdOuc/XrdOucErrInfo.hh:273
#8 XrdXrootdProtocol::fsError (this=0x7f6d4484d808, rc=-1, opC=0 '\000', myError=..., Path=0x0, Cgi=0x0) at /usr/src/debug/xrootd/xrootd/src/XrdXrootd/XrdXrootdXeq.cc:3185
#9 0x00007f6d635304e0 in XrdXrootdTransit::Process (this=0x7f6d4484d800, lp=0x7f6d27082028) at /usr/src/debug/xrootd/xrootd/src/XrdXrootd/XrdXrootdTransit.cc:370
#10 0x00007f6d632b2e19 in XrdLink::DoIt (this=0x7f6d27082028) at /usr/src/debug/xrootd/xrootd/src/Xrd/XrdLink.cc:441
#11 0x00007f6d632b61cf in XrdScheduler::Run (this=0x610e78 <XrdMain::Config+440>) at /usr/src/debug/xrootd/xrootd/src/Xrd/XrdScheduler.cc:357
#12 0x00007f6d632b6319 in XrdStartWorking (carg=<optimized out>) at /usr/src/debug/xrootd/xrootd/src/Xrd/XrdScheduler.cc:87
#13 0x00007f6d63274947 in XrdSysThread_Xeq (myargs=0x7f6d1d00e040) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.cc:86
#14 0x00007f6d62e30e25 in start_thread (arg=0x7f6d1cfff700) at pthread_create.c:308
#15 0x00007f6d62133bad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
So, it's in the core xrootd code. Note that inlining is going on as the actual stack trace doesn't exist in the code. The key insight is this one:
#7 getErrText (ecode=@0x7f6d1cffec7c: 32621, this=0x8) at /usr/src/debug/xrootd/xrootd/src/XrdOuc/XrdOucErrInfo.hh:273
That indicates the XrdOucErrInfo
object is an 8 byte offset of a null pointer. That is, something is doing foo->myError
where foo
is unexpectedly NULL
. Further guessing this isn't in the read code paths, we get the following code:
https://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdXeq.cc#L2846
or
https://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdXeq.cc#L547
Looking at the myFile
object I can confirm that XrdSfsp
is null:
(gdb) p myFile->XrdSfsp
$6 = (XrdSfsFile *) 0x0
and, searching for the path corresponding to myFile
in the log:
180910 08:24:20 19163 acc_Audit: http grant uscmsPool001@[::ffff:XXX.YYY.ZZZ.AAA] create /store/.../BBbar_JpsiFilter_SoftQCD_GEN_SIM_2441.root
Resulting PFN: /user/uscms01/pnfs/unl.edu/data4/cms/store/.../BBbar_JpsiFilter_SoftQCD_GEN_SIM_2441.root
Resulting PFN: /user/uscms01/pnfs/unl.edu/data4/cms/store/.../BBbar_JpsiFilter_SoftQCD_GEN_SIM_2441.root
File we will access: /user/uscms01/pnfs/unl.edu/data4/cms/store/.../BBbar_JpsiFilter_SoftQCD_GEN_SIM_2441.root
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/uscms01/pnfs/unl.edu/data4/cms/store/.../BBbar_JpsiFilter_SoftQCD_GEN_SIM_2441.root could only be replicated to 0 nodes instead of minReplication (=1). There are 207 datanode(s) running and no node(s) are excluded in this operation.
RemoteException: File /user/uscms01/pnfs/unl.edu/data4/cms/store/.../BBbar_JpsiFilter_SoftQCD_GEN_SIM_2441.root could only be replicated to 0 nodes instead of minReplication (=1). There are 207 datanode(s) running and no node(s) are excluded in this operation.
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/uscms01/pnfs/unl.edu/data4/cms/store/.../BBbar_JpsiFilter_SoftQCD_GEN_SIM_2441.root could only be replicated to 0 nodes instead of minReplication (=1). There are 207 datanode(s) running and no node(s) are excluded in this operation.
180910 08:24:20 4341 hdfs_close: Unable to close /user/uscms01/pnfs/unl.edu/data4/cms/store/.../BBbar_JpsiFilter_SoftQCD_GEN_SIM_2441.root; Unknown error 255
(log snippet lightly edited to remove user name and IP addresses)
So, the last error message is about hdfs_close
. Hence, I think the failure is in the close code.
Indeed, here we delete the object:
https://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdXeq.cc#L541
That eventually calls XrdXrootdFile
's destructor which indeed zeros out fp->XrdSfsp
, causing the segfault.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1