Print

Print


I have an OFS plugin that is trying to reply with SFS_STARTED to a close request for files bigger than a certain size. There seems to be some race condition in the code that makes this crash from time to time. In 99% of the cases this works fine. I believe the problem comes from early deletion of the XrdOfsFile object here:

https://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdCallBack.cc#L206

Since later on we still access the callback object that was specially created for the close operation. This happens a few lines after calling DoClose, here:

https://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdCallBack.cc#L169

In my code the lifetime of the XrdOucCallBack object for the close response is tied to the lifetime of the XrdOfsFile, which kind of makes sense to me. For completeness here is the code that I use:

```
XrdFstOfsFile::close()
{
  // Reset the error.getErrInfo() value to 0 since this was hijacked by the
  // XrdXrootdFile object to store the actual file descriptor corresponding to
  // the current object. This was confusing when logging the error.getErrInfo()
  // value at the end of the close.
  error.setErrCode(0);

  // Close happening the in the same XRootD thread
  if (viaDelete || mWrDelete || mIsDevNull || (mIsRW == false) ||
      (mIsRW && (mMaxOffsetWritten <= msMinSizeAsyncClose))) {
    return _close();
  }

  // Delegate close to a different thread while the client is waiting for the
  // callback (SFS_STARTED). This only happens for written files with size
  // bigger than msMinSizeAsyncClose (2GB).
  eos_info("msg=\"close delegated to async thread \" fxid=%08llx "
           "ns_path=\"%s\" fs_path=\"%s\"", mFileId, mNsPath.c_str(),
           mFstPath.c_str());
  // Create a close callback and put the client in waiting mode
  mCloseCb.reset(new XrdOucCallBack());
  mCloseCb->Init(&error);
  error.setErrInfo(1800, "delay client up to 30 minutes");
  gOFS.mCloseThreadPool.PushTask<void>([&]() -> void {
    eos_info("msg=\"doing close in the async thread\" fxid=%08llx", mFileId);
    int rc = _close();
    int reply_rc = mCloseCb->Reply(rc, (rc ? error.getErrInfo() : 0),
    (rc ? error.getErrText() : ""));

    if (reply_rc == 0) {
      eos_err("%s", "msg=\"callback reply failed\" fid=%llu", mFileId);
    }
  });
  return SFS_STARTED;
}
```

Any variables starting with m... are member variables attached to the XrdOfsFile object. Especially the mCloseCb object is deleted at the same time as the file object.

Therefore,  I believe accessing this callback object after the file object is deleted is a source of race conditions (depending on how the memory is overwritten) and explains the crashes that we see. 

For example, a stack trace of such a crash is below:
```
(gdb) bt
#0  0x00007fb1bbd005a3 in XrdXrootdCBJob::DoIt (this=0x7fb168019b00) at /usr/src/debug/xrootd-4.11.2/src/XrdXrootd/XrdXrootdCallBack.cc:169
#1  0x00007fb1bba94def in XrdScheduler::Run (this=0x610e78 <XrdMain::Config+440>) at /usr/src/debug/xrootd-4.11.2/src/Xrd/XrdScheduler.cc:357
#2  0x00007fb1bba94f39 in XrdStartWorking (carg=<optimized out>) at /usr/src/debug/xrootd-4.11.2/src/Xrd/XrdScheduler.cc:87
#3  0x00007fb1bba5aa67 in XrdSysThread_Xeq (myargs=0x7fb172167120) at /usr/src/debug/xrootd-4.11.2/src/XrdSys/XrdSysPthread.cc:86
#4  0x00007fb1bb60ee65 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fb1ba91088d in clone () from /lib64/libc.so.6
(gdb) f 0
#0  0x00007fb1bbd005a3 in XrdXrootdCBJob::DoIt (this=0x7fb168019b00) at /usr/src/debug/xrootd-4.11.2/src/XrdXrootd/XrdXrootdCallBack.cc:169
169	   if (eInfo->getErrCB()) eInfo->getErrCB()->Done(Result, eInfo);
(gdb) p eInfo
$7 = (XrdOucErrInfo *) 0x7fb1a7ff8360
(gdb) p *(eInfo)
$8 = {_vptr.XrdOucErrInfo = 0x7fb1bbf86d50 <vtable for XrdOucErrInfo+16>, ErrInfo = {static Max_Error_Len = 2048, static Path_Offset = 1024, user = 0x7fb16b8b00c0 "", ucap = 0, code = 0, 
    message = "\000\000\000\000\261\177\000\000Њ\377\247\261\177", '\000' <repeats 774 times>..., static uVMask = 65535, static uAsync = -2147483648, static uUrlOK = 1073741824, static uMProt = 536870912, static uReadR = 268435456, static uIPv4 = 134217728, 
    static uIPv64 = 67108864, static uPrip = 33554432, static uLclF = 16777216, static u48pls = 8388608}, ErrCB = 0x7fb16b8b0080, {ErrCBarg = 281904473442169, ErrEnv = 0x1006400000779}, mID = 0, dOff = -1, reserved = 0, dataBuff = 0x0}
(gdb) p *(eInfo->ErrCB)
$9 = {_vptr.XrdOucEICB = 0x0}
(gdb) 
```


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/xrootd/xrootd/issues/1148
########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1