Print

Print


Hi Brian,

>Is it possible something racy is happening with the tBound member?  I.e., 
>can two threads enter the if (tBound) section at once?
In theory, Linux signal delivery is thread-safe, though I have heard of some 
problems in that area. That's why the call is protected by a mutex. So, the 
cerr addition would be good. Since this has occurred under an extremely 
heavy load, it may be that Linux is dropping signals. More likely, I suspect 
that Linux is now enforcing the Posix-ness of SIGSTOP and applying it to the 
whole process instead of the thread in question (that code was added all the 
way back in RHEL3). So, if the process is slow enough (which it appears to 
be) the STOP is handled before the CONT can be sent.

The only immediate choice you have is to kill that section of code. That may 
result in some sockets getting stuck in a semi-permanent open state but 
that's better than having the whole process get stopped. I will research 
alternate linux-specific ways of getting around this problem. Perhaps I 
won't even need to if current versions of Linux do not place a socket 
operation in pending state when the socket is closed. I'll find out when you 
comment out the code :-)

Andy

> Andy
>
> -----Original Message----- From: Brian Bockelman
> Sent: Tuesday, August 23, 2011 1:26 PM
> To: xrootd-dev
> Subject: xrootd redirector repeatedly "crashing"
>
> Hi,
>
> Our global redirector is stops responding every 30 minutes or so; it's 
> actually not crashing, but appears to be getting SIGSTOP.
>
> There's nothing on the system that would be sending this signal.  However, 
> I see the following code in XrdLink:
>
>  if (tBound)
>     {tBound = 0;
> #ifdef __linux__
>      if (!XrdSysThread::Same(curTID, XrdSysThread::ID()))
>         {XrdSysThread::Signal(curTID, SIGSTOP);
>          XrdSysThread::Signal(curTID, SIGCONT);
>         }
> #endif
>     }
>
> Are we 100% sure that's the right thing, and there's no way that SIGSTOP 
> is delivered to the wrong thread?
>
> Brian