Hi Brian, >Is it possible something racy is happening with the tBound member? I.e., >can two threads enter the if (tBound) section at once? In theory, Linux signal delivery is thread-safe, though I have heard of some problems in that area. That's why the call is protected by a mutex. So, the cerr addition would be good. Since this has occurred under an extremely heavy load, it may be that Linux is dropping signals. More likely, I suspect that Linux is now enforcing the Posix-ness of SIGSTOP and applying it to the whole process instead of the thread in question (that code was added all the way back in RHEL3). So, if the process is slow enough (which it appears to be) the STOP is handled before the CONT can be sent. The only immediate choice you have is to kill that section of code. That may result in some sockets getting stuck in a semi-permanent open state but that's better than having the whole process get stopped. I will research alternate linux-specific ways of getting around this problem. Perhaps I won't even need to if current versions of Linux do not place a socket operation in pending state when the socket is closed. I'll find out when you comment out the code :-) Andy > Andy > > -----Original Message----- From: Brian Bockelman > Sent: Tuesday, August 23, 2011 1:26 PM > To: xrootd-dev > Subject: xrootd redirector repeatedly "crashing" > > Hi, > > Our global redirector is stops responding every 30 minutes or so; it's > actually not crashing, but appears to be getting SIGSTOP. > > There's nothing on the system that would be sending this signal. However, > I see the following code in XrdLink: > > if (tBound) > {tBound = 0; > #ifdef __linux__ > if (!XrdSysThread::Same(curTID, XrdSysThread::ID())) > {XrdSysThread::Signal(curTID, SIGSTOP); > XrdSysThread::Signal(curTID, SIGCONT); > } > #endif > } > > Are we 100% sure that's the right thing, and there's no way that SIGSTOP > is delivered to the wrong thread? > > Brian