LISTSERV 16.5 - XROOTD-DEV Archives

Hi Brian,

OK, so at the moment there really isn't anything you can do. The message 
looks more ominous than need be. It just means that there is now an internal 
queue of requests building up. So,things aren't as responsive as they could 
be. The message gets repeated every 4K tries of getting a new thread. Would 
be interesting to see how often the message goes out.

The longer term solution is to run more than one global redirector and set 
then up in load balancing mode.

An even longer term solution is to not run each select in a separate thread 
but simply have a fixed pool of threads that execute that code. This is the 
first time I've seen you over-run the redirector which means you should be 
getting thousands of requests per second. Can you send me two summary 
statistics each separated by about 10 seconds? Use the xrd command to 
connect to the redirector xrootd and issue "query 1 a".

Andy

-----Original Message----- 
From: Brian Bockelman
Sent: Wednesday, July 06, 2011 3:34 PM
To: Andrew Hanushevsky
Cc: xrootd-dev
Subject: Re: Hitting thread limits?

Yes - sorry, I missed that question the first time.


On Jul 6, 2011, at 5:23 PM, Andrew Hanushevsky wrote:

> Hi Brian,
> So, this is the global redirector, yes?
>
> Andy
>
> -----Original Message----- From: Brian Bockelman
> Sent: Wednesday, July 06, 2011 3:19 PM
> To: Andrew Hanushevsky
> Cc: xrootd-dev
> Subject: Re: Hitting thread limits?
>
>
> On Jul 6, 2011, at 5:14 PM, Andrew Hanushevsky wrote:
>
>> Hi Brian,
>>
>> Hmmm, are you specifying the xrd.sched maxt directive? If so, shame on 
>> you and immediately remove it!
>>
>
> No, actually.
>
>> If not, is your OS limit set to 500? It shouldn't be, typically it should 
>> at least 1K and usually 2k. Is the message coming from the xrootd or the 
>> cmsd? It makes a big difference. For the xrootd, the limit can be reached 
>> depending on how fast one can turn around a transaction. Internally, it's 
>> set to no less than 5 seconds to avoid rescheduling if the client has 
>> another request in the queue. For the redirector that may be longer than 
>> need be. If it's the cmsd then we need to look where the requests are 
>> coming from.  This is just a local redirector, yes? Or is this the global 
>> one?
>>
>
> This is from the cmsd: it turns out that one T2 has a completely 
> broken-down storage, and all requests were going to the redirector. 
> Unfortunately, the broken T2 is the only site in the US with heavy-ion 
> data... meaning the redirector searched pointlessly for files for the 500 
> clients, and easily hitting 2048 threads.
>
> Not sure what we can do about this?
>
> Brian