Print

Print


On Aug 23, 2011, at 3:41 PM, Andrew Hanushevsky wrote:

> Hi Brian,
> 
> Seems like this just started happening, did it not?

Correct.  One site had its SE dump from underneath it, and this particular site is the only Heavy-Ion site in the US.  Hence, there are many clients, all requesting files that can't be found.

> If so, what else has changed (e.g. Linux patches)?

Nope, no new versions of anything.

> Anyway, that snippet of code was added to get around a nasty Linux-only "feature". Could you add a cerr there to see if we are actually getting to that code (display the two thread values)? There should be no way that the code stops itself without something else going on.
> 

If I understand the code, it does the SIGSTOP/SIGCONT only if the thread ID is not the current running thread?

Is it possible something racy is happening with the tBound member?  I.e., can two threads enter the if (tBound) section at once?

Is the xrootd signal delivery thread-safe?

> Andy
> 
> -----Original Message----- From: Brian Bockelman
> Sent: Tuesday, August 23, 2011 1:26 PM
> To: xrootd-dev
> Subject: xrootd redirector repeatedly "crashing"
> 
> Hi,
> 
> Our global redirector is stops responding every 30 minutes or so; it's actually not crashing, but appears to be getting SIGSTOP.
> 
> There's nothing on the system that would be sending this signal.  However, I see the following code in XrdLink:
> 
>  if (tBound)
>     {tBound = 0;
> #ifdef __linux__
>      if (!XrdSysThread::Same(curTID, XrdSysThread::ID()))
>         {XrdSysThread::Signal(curTID, SIGSTOP);
>          XrdSysThread::Signal(curTID, SIGCONT);
>         }
> #endif
>     }
> 
> Are we 100% sure that's the right thing, and there's no way that SIGSTOP is delivered to the wrong thread?
> 
> Brian