Hi Andrew, The unresponsive server problem has been fixed by updating the lcmaps scas client plugin package. The strace output was showing a ton of repeated attempts at lcmaps logging and we noticed that the package wasn't at its newest revision. We were running this version: lcmaps-plugins-scas-client-0.3.4-1.3.osg.el6.x86_64 We updated to this version: lcmaps-plugins-scas-client-0.4.0-1.2.osg32.el6.x86_64 And things have been running find for 2 days now with no issues. We will implement the idle option to see if we can get around the time_wait issue and let you know. Thanks for the help, -Erik On Fri, 2014-05-02 at 10:37 -0700, Andrew Hanushevsky wrote: > Hi Erik, > > I immediately see the problem from lsof. You have run out of file > descriptors. This is caused by thousands of TCP connection going into > close-wait state. That means that the client closed the tCP connection but > the server was never notified that the connection closed. We've seen this > problem in certain version of Linux and we have definitely seen this > problem when clients run in certian virtual machines. There is an > immediate bypass for this problem, that is to enable connection timeouts > in the server. You can find out how to set it here: > > http://xrootd.org/doc/prod/xrd_config.htm#_Toc310725348 > > specifically, the "idle" option. Usually 1 hour is good enough but it > really depends on how much load you get. The log file for that day would > be extremely helpful in deciding. > > So, two other questions: > > 1) What version of linux (please include uname -a output), > 2) Are your clients running in virtual machines? > > Andy > > On Fri, 2 May 2014, Matevz Tadel wrote: > > > Hi Erik, > > > > We had problems with CRLs at UCSD, also affecting xrootd, last couple of days > > but it never caused the servers to lock up. > > > > I see you are running 3.3.6 on all servers other than > > cms-a026.rcac.purdue.edu that is running 3.3.1: > > http://xrootd.t2.ucsd.edu/dump_cache.jsp?pred=%25%2FCMS%3A%3APurdue%3A%3AXrdReport%2F%25%2Fver&submit=Filter > > > > When did this start? Is it correlated to an upgrade of some sort? > > > > You say servers stop logging anything and the only solution is to restart > > them ... does that means the state is unrecoverable? Do you see any process > > activity at all? > > > > The thing that would really help is output of gcore. > > > > Cheers, > > Matevz > > > > On 05/01/14 08:38, Erik Gough wrote: > >> Hello, > >> > >> We have been experiencing an issue at Purdue where our redirector and > >> servers will stop responding to client requests to read data. It looks > >> like the issue happens during the authentication process. The xrootd > >> process stops logging anything during this time. The only solution I > >> have found is to restart the xrootd process. After that, things start > >> working normally again. > >> > >> I attached output from strace, netstat, lsof and limits. Strace shows a > >> bunch of read/writes for what looks like lcmaps logging. Netstat shows > >> a ton of connections in a CLOSE_WAIT state, but not to the point were > >> the process is going to run out of FDs. > >> > >> Also I attached two attempts at xrdcp from an unresponsive server. One > >> when getting redirected from our redirector and one when copying > >> directly from the server. > >> > >> Other processes on the same server like gridftp are still authenticating > >> with gums properly during this time. > >> > >> Can you help? We are facing servers becoming responsive on a daily > >> basis. Please let me know if you need more information. > >> > >> There is also an OSG ticket on this issue: > >> https://ticket.grid.iu.edu/20867 > >> > >> Thanks, > >> -Erik > >> > >> > >> ######################################################################## > >> Use REPLY-ALL to reply to list > >> > >> To unsubscribe from the XROOTD-L list, click the following link: > >> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 > >> > > > > ######################################################################## > > Use REPLY-ALL to reply to list > > > > To unsubscribe from the XROOTD-L list, click the following link: > > https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 > > -- Erik Gough IT Systems Specialist Bindley Bioscience Center Discovery Park at Purdue University (765) 496-3975 ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-L list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1