Print

Print


XRootD client fails (repeatedly) with the following back trace in eos/pps/storage:

#0 0x00007fa7ee07f250 in ?? ()
#1 0x000000335a65e477 in ~XRootDMsgHandler (this=0x7fa7c5038800, __in_chrg=) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClXRootDMsgHandler.hh:103
#2 XrdCl::XRootDMsgHandler::~XRootDMsgHandler (this=0x7fa7c5038800, __in_chrg=) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClXRootDMsgHandler.hh:108
#3 0x000000335a65b49a in XrdCl::XRootDMsgHandler::HandleResponse (this=0x7fa7c5038800) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClXRootDMsgHandler.cc:1075
#4 0x000000335a65bcd3 in XrdCl::XRootDMsgHandler::HandleError (this=0x7fa7c5038800, status=..., msg=) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClXRootDMsgHandler.cc:1799
#5 0x000000335a65c209 in XrdCl::XRootDMsgHandler::OnStatusReady (this=0x7fa7c5038800, message=, status=...) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClXRootDMsgHandler.cc:965
#6 0x000000335a64e079 in XrdCl::OutQueue::Report (this=0x7fa7f44fbc00, status=...) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClOutQueue.cc:79
#7 0x000000335a6433d4 in XrdCl::Stream::OnFatalError (this=0x7fa7f7a01c80, subStream=, status=..., lock=...) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClStream.cc:814
#8 0x000000335a6448a5 in XrdCl::Stream::OnConnectError (this=0x7fa7f7a01c80, subStream=0, status=) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClStream.cc:684
#9 0x000000335a6862d7 in XrdCl::AsyncSocketHandler::OnFaultWhileHandshaking (this=0x7fa7f7a18500, st=...) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClAsyncSocketHandler.cc:736
#10 0x000000335a63db03 in (anonymous namespace)::SocketCallBack::Event (this=0x7fa7f09d70a0, chP=, cbArg=, evFlags=) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClPollerBuiltIn.cc:82
#11 0x000000335a22aaa8 in XrdSys::IOEvents::Poller::CbkXeq (this=0x7fa7f79e48e0, cP=0x7fa7c1453060, events=8, eNum=0, eTxt=0x0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:692
#12 0x000000335a22ad37 in XrdSys::IOEvents::Poller::CbkTMO (this=0x7fa7f79e48e0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:597
#13 0x000000335a22ae42 in XrdSys::IOEvents::PollE::Begin (this=0x7fa7f79e48e0, syncsem=, retcode=, eTxt=) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEventsPollE.icc:217
#14 0x000000335a2296fd in XrdSys::IOEvents::BootStrap::Start (parg=0x7ffe5495db60) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:131
#15 0x000000335a2276ff in XrdSysThread_Xeq (myargs=0x7fa7f7a5efa0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.cc:86
#16 0x0000003a64207aa1 in start_thread () from /lib64/libpthread.so.0
#17 0x0000003a63ee893d in clone () from /lib64/libc.so.6

After investigating the problem I came up with the following scenario (which I was able to reproduce with GDB):

  1. client issues an open
  2. a XRootDMsgHandler object is created and holds a pointer to a Message object (open request) in pRequest
  3. the handshake is successful (AsyncSocketHandler)
  4. the AsyncSocketHandler::OnWrite is being called in order to send the open request
  5. lets assume the connections was dropped
  6. a read timeout occurs, which in turn triggers a reconnect and as a result the pHandShakeDone flag is being set to false
  7. a write timeout occurs, since the pHandShakeDone is false the AsyncSocketHandler::OnTimeoutWhileHandshaking routine is being called, which in turn calls AsyncSocketHandler::OnFaultWhileHandshaking, now it becomes interesting because the pOutgoing pointer that still points to the open request (a Message obejct, the same one that is being pointed at by pRequest in XRootDMsgHandler) is being deleted
  8. after a series of calls (Stream::OnConnectError -> Stream::OnFatalError -> OutQueue::Report -> XRootDMsgHandler::OnStatusReady -> XRootDMsgHandler::HandleError -> XRootDMsgHandler::HandleResponse) we end up in XRootDMsgHandler::HandleResponse where the object deletes itself
  9. since open does not have a session ID, in the destructor ~XRootDMsgHandler(), the pRequest (open request) pointer is being deleted, meaning that the same object is being deleted for the second time (exactly the same address that has been deleted in 7.)

This is very difficult to reproduce because various timeouts need to happen in a very specific order.
(Also, the theory has a weak point: a segfault could also happen earlier when the open request is being unmarshalled (XrdClXRootDMsgHandler.cc:1043), which for some reason never happens in eos/pps/storage.)

I can see two possible solutions:


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub



Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1