Print

Print


Hi Patrick,

Based on the fixes, it took a long time for the server to recycle a socket. 
The slower or more highly loaded the system, the longer it took. Two 
problems here, a) the code was optimized for opening new connections not 
closing old ones, amd b) a bug in the scheduler made matters even worse. So, 
I am not suprised. Additionally, once the server gets into this situation, 
the client goes into error recovery mode which causes even more "stale" 
connections; so everything spirals downwards as the machine gets slower and 
slower as it overloads.

Andy

----- Original Message ----- 
From: "Patrick McGuigan" <[log in to unmask]>
To: "Andrew Hanushevsky" <[log in to unmask]>
Cc: "Tanya Levshina" <[log in to unmask]>; <[log in to unmask]>; 
<[log in to unmask]>
Sent: Monday, April 12, 2010 3:45 PM
Subject: Re: Overloaded Xrootd dataserver?


> Hi Andy,
>
> We are using Linux (2.6.18-164.2.1.el5).  I will look at rolling out a 
> newer version of Xrootd and increasing the FD limit.
>
> In the current configuration the data being written into the system is 
> sent to only two dataservers.  The second data server (bigger disks, more 
> memory, more cores) is not having the same problem.  Any suspicions on why 
> only one data server is getting crushed?
>
> Patrick
>
>
> Andrew Hanushevsky wrote:
>> Yes, I would recommend upgrading to 20100315-1007 as it fixes a couple of 
>> issues in this area which would allow sockets for closed connections to 
>> remain open far longer than they should be. The issue was very pronounced 
>> in Solaris, not as much in Linux (which OS are you using?). In any case, 
>> *please* increase the FD hard limit to at least 8-16K (32K would be 
>> best).
>>
>> Andy
>>
>> ----- Original Message ----- From: "Tanya Levshina" <[log in to unmask]>
>> To: "'Andrew Hanushevsky'" <[log in to unmask]>; "'Patrick McGuigan'" 
>> <[log in to unmask]>; <[log in to unmask]>
>> Cc: <[log in to unmask]>
>> Sent: Monday, April 12, 2010 3:11 PM
>> Subject: RE: Overloaded Xrootd dataserver?
>>
>>
>>> Hi,
>>>
>>> We should add these recommendation to OSG Release Documentation.
>>> Patrick, if the increasing the number opened files will not help  and if 
>>> the
>>> "CLOSE_WAIT" problem has been solved for  xrootd 20100315-1007 release, 
>>> you
>>> can probably upgrade xrootd from ITB cache.
>>>
>>> Thanks,
>>> Tanya
>>>
>>>
>>> -----Original Message-----
>>> From: [log in to unmask]
>>> [mailto:[log in to unmask]] On Behalf Of Andrew
>>> Hanushevsky
>>> Sent: Monday, April 12, 2010 4:58 PM
>>> To: Patrick McGuigan; [log in to unmask]
>>> Cc: [log in to unmask]
>>> Subject: Re: Overloaded Xrootd dataserver?
>>>
>>> Hi Patrick,
>>>
>>> Please tell me the release you are running. We did put in a CLOSE_WAIT 
>>> fix
>>> recently. That aside, we always recommed setting the FD limit to as high 
>>> as
>>> practical for your OS (at least 8K and preferably 16K to 32K). 1K is not
>>> recommended and will likely lead to problems regardless of any extant 
>>> bugs.
>>>
>>> Andy
>>>
>>> ----- Original Message ----- From: "Patrick McGuigan" <[log in to unmask]>
>>> To: <[log in to unmask]>
>>> Cc: <[log in to unmask]>
>>> Sent: Monday, April 12, 2010 2:08 PM
>>> Subject: Overloaded Xrootd dataserver?
>>>
>>>
>>>> Hi,
>>>>
>>>> I am having an issue with one of our data servers and it may be getting
>>>> overloaded with requests from clients.
>>>>
>>>> The symptoms are that the load on the SRM machine will get very large
>>>> because threads there are talking through XrootdFS for various 
>>>> connections
>>>
>>>> to the dataserver.  Various activities related to Xrootd will fail (SRM
>>>> get's hung, gridftp servers won't send data).
>>>>
>>>> When logged into the dataserver and running strace on the xrootd 
>>>> service I
>>>
>>>> see that it has a problem in accept() because of too many open files.
>>>>
>>>> If I do a netstat I see that xrootd is holding a large number of 
>>>> sockets
>>>> in a CLOSE_WAIT state.
>>>>
>>>> I am trying to understand if the problems that I am seeing are because 
>>>> the
>>>
>>>> limits (1024 open FD's) given to xrootd are too small or if the problem
>>>> with xrootd is that the service is too overloaded and this is causing
>>>> xrootd to hang on to too many sockets.
>>>>
>>>> Regards,
>>>>
>>>> Patrick
>>>>
>>>
>>>
>>
>