Print

Print


We started to move file stageout from GridFTP to XrdHttp and overnight began to see segfaults.  The traceback looks like this:

```
#6  <signal handler called>
#7  getErrText (ecode=@0x7f6d1cffec7c: 32621, this=0x8) at /usr/src/debug/xrootd/xrootd/src/XrdOuc/XrdOucErrInfo.hh:273
#8  XrdXrootdProtocol::fsError (this=0x7f6d4484d808, rc=-1, opC=0 '\000', myError=..., Path=0x0, Cgi=0x0) at /usr/src/debug/xrootd/xrootd/src/XrdXrootd/XrdXrootdXeq.cc:3185
#9  0x00007f6d635304e0 in XrdXrootdTransit::Process (this=0x7f6d4484d800, lp=0x7f6d27082028) at /usr/src/debug/xrootd/xrootd/src/XrdXrootd/XrdXrootdTransit.cc:370
#10 0x00007f6d632b2e19 in XrdLink::DoIt (this=0x7f6d27082028) at /usr/src/debug/xrootd/xrootd/src/Xrd/XrdLink.cc:441
#11 0x00007f6d632b61cf in XrdScheduler::Run (this=0x610e78 <XrdMain::Config+440>) at /usr/src/debug/xrootd/xrootd/src/Xrd/XrdScheduler.cc:357
#12 0x00007f6d632b6319 in XrdStartWorking (carg=<optimized out>) at /usr/src/debug/xrootd/xrootd/src/Xrd/XrdScheduler.cc:87
#13 0x00007f6d63274947 in XrdSysThread_Xeq (myargs=0x7f6d1d00e040) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.cc:86
#14 0x00007f6d62e30e25 in start_thread (arg=0x7f6d1cfff700) at pthread_create.c:308
#15 0x00007f6d62133bad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
```

So, it's in the core xrootd code.  Note that inlining is going on as the actual stack trace doesn't exist in the code.  The key insight is this one:

```
#7  getErrText (ecode=@0x7f6d1cffec7c: 32621, this=0x8) at /usr/src/debug/xrootd/xrootd/src/XrdOuc/XrdOucErrInfo.hh:273
```

That indicates the `XrdOucErrInfo` object is an 8 byte offset of a null pointer.  That is, something is doing `foo->myError` where `foo` is unexpectedly `NULL`.  Further guessing this isn't in the read code paths, we get the following code:

https://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdXeq.cc#L2846

or

https://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdXeq.cc#L547

Looking at the `myFile` object I can confirm that `XrdSfsp` is null:

```
(gdb) p myFile->XrdSfsp
$6 = (XrdSfsFile *) 0x0
```

and, searching for the path corresponding to `myFile` in the log:

```
180910 08:24:20 19163 acc_Audit: http grant  uscmsPool001@[::ffff:XXX.YYY.ZZZ.AAA] create /store/.../BBbar_JpsiFilter_SoftQCD_GEN_SIM_2441.root
Resulting PFN: /user/uscms01/pnfs/unl.edu/data4/cms/store/.../BBbar_JpsiFilter_SoftQCD_GEN_SIM_2441.root
Resulting PFN: /user/uscms01/pnfs/unl.edu/data4/cms/store/.../BBbar_JpsiFilter_SoftQCD_GEN_SIM_2441.root
File we will access: /user/uscms01/pnfs/unl.edu/data4/cms/store/.../BBbar_JpsiFilter_SoftQCD_GEN_SIM_2441.root
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/uscms01/pnfs/unl.edu/data4/cms/store/.../BBbar_JpsiFilter_SoftQCD_GEN_SIM_2441.root could only be replicated to 0 nodes instead of minReplication (=1).  There are 207 datanode(s) running and no node(s) are excluded in this operation.
RemoteException: File /user/uscms01/pnfs/unl.edu/data4/cms/store/.../BBbar_JpsiFilter_SoftQCD_GEN_SIM_2441.root could only be replicated to 0 nodes instead of minReplication (=1).  There are 207 datanode(s) running and no node(s) are excluded in this operation.
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/uscms01/pnfs/unl.edu/data4/cms/store/.../BBbar_JpsiFilter_SoftQCD_GEN_SIM_2441.root could only be replicated to 0 nodes instead of minReplication (=1).  There are 207 datanode(s) running and no node(s) are excluded in this operation.
180910 08:24:20 4341 hdfs_close: Unable to close /user/uscms01/pnfs/unl.edu/data4/cms/store/.../BBbar_JpsiFilter_SoftQCD_GEN_SIM_2441.root; Unknown error 255
```
(log snippet lightly edited to remove user name and IP addresses)

So, the last error message is about `hdfs_close`.  Hence, I think the failure is in the close code.

Indeed, here we delete the object:

https://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdXeq.cc#L541

That eventually calls `XrdXrootdFile`'s destructor which indeed zeros out `fp->XrdSfsp`, causing the segfault.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/xrootd/xrootd/issues/818

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1