Hi Patrick, Based on the fixes, it took a long time for the server to recycle a socket. The slower or more highly loaded the system, the longer it took. Two problems here, a) the code was optimized for opening new connections not closing old ones, amd b) a bug in the scheduler made matters even worse. So, I am not suprised. Additionally, once the server gets into this situation, the client goes into error recovery mode which causes even more "stale" connections; so everything spirals downwards as the machine gets slower and slower as it overloads. Andy ----- Original Message ----- From: "Patrick McGuigan" <[log in to unmask]> To: "Andrew Hanushevsky" <[log in to unmask]> Cc: "Tanya Levshina" <[log in to unmask]>; <[log in to unmask]>; <[log in to unmask]> Sent: Monday, April 12, 2010 3:45 PM Subject: Re: Overloaded Xrootd dataserver? > Hi Andy, > > We are using Linux (2.6.18-164.2.1.el5). I will look at rolling out a > newer version of Xrootd and increasing the FD limit. > > In the current configuration the data being written into the system is > sent to only two dataservers. The second data server (bigger disks, more > memory, more cores) is not having the same problem. Any suspicions on why > only one data server is getting crushed? > > Patrick > > > Andrew Hanushevsky wrote: >> Yes, I would recommend upgrading to 20100315-1007 as it fixes a couple of >> issues in this area which would allow sockets for closed connections to >> remain open far longer than they should be. The issue was very pronounced >> in Solaris, not as much in Linux (which OS are you using?). In any case, >> *please* increase the FD hard limit to at least 8-16K (32K would be >> best). >> >> Andy >> >> ----- Original Message ----- From: "Tanya Levshina" <[log in to unmask]> >> To: "'Andrew Hanushevsky'" <[log in to unmask]>; "'Patrick McGuigan'" >> <[log in to unmask]>; <[log in to unmask]> >> Cc: <[log in to unmask]> >> Sent: Monday, April 12, 2010 3:11 PM >> Subject: RE: Overloaded Xrootd dataserver? >> >> >>> Hi, >>> >>> We should add these recommendation to OSG Release Documentation. >>> Patrick, if the increasing the number opened files will not help and if >>> the >>> "CLOSE_WAIT" problem has been solved for xrootd 20100315-1007 release, >>> you >>> can probably upgrade xrootd from ITB cache. >>> >>> Thanks, >>> Tanya >>> >>> >>> -----Original Message----- >>> From: [log in to unmask] >>> [mailto:[log in to unmask]] On Behalf Of Andrew >>> Hanushevsky >>> Sent: Monday, April 12, 2010 4:58 PM >>> To: Patrick McGuigan; [log in to unmask] >>> Cc: [log in to unmask] >>> Subject: Re: Overloaded Xrootd dataserver? >>> >>> Hi Patrick, >>> >>> Please tell me the release you are running. We did put in a CLOSE_WAIT >>> fix >>> recently. That aside, we always recommed setting the FD limit to as high >>> as >>> practical for your OS (at least 8K and preferably 16K to 32K). 1K is not >>> recommended and will likely lead to problems regardless of any extant >>> bugs. >>> >>> Andy >>> >>> ----- Original Message ----- From: "Patrick McGuigan" <[log in to unmask]> >>> To: <[log in to unmask]> >>> Cc: <[log in to unmask]> >>> Sent: Monday, April 12, 2010 2:08 PM >>> Subject: Overloaded Xrootd dataserver? >>> >>> >>>> Hi, >>>> >>>> I am having an issue with one of our data servers and it may be getting >>>> overloaded with requests from clients. >>>> >>>> The symptoms are that the load on the SRM machine will get very large >>>> because threads there are talking through XrootdFS for various >>>> connections >>> >>>> to the dataserver. Various activities related to Xrootd will fail (SRM >>>> get's hung, gridftp servers won't send data). >>>> >>>> When logged into the dataserver and running strace on the xrootd >>>> service I >>> >>>> see that it has a problem in accept() because of too many open files. >>>> >>>> If I do a netstat I see that xrootd is holding a large number of >>>> sockets >>>> in a CLOSE_WAIT state. >>>> >>>> I am trying to understand if the problems that I am seeing are because >>>> the >>> >>>> limits (1024 open FD's) given to xrootd are too small or if the problem >>>> with xrootd is that the service is too overloaded and this is causing >>>> xrootd to hang on to too many sockets. >>>> >>>> Regards, >>>> >>>> Patrick >>>> >>> >>> >> >