Hi Eric, Good to know! Ah, these plugins are always a thorn, sigh. Andy On Mon, 5 May 2014, Erik Gough wrote: > Hi Andrew, > > The unresponsive server problem has been fixed by updating the lcmaps > scas client plugin package. The strace output was showing a ton of > repeated attempts at lcmaps logging and we noticed that the package > wasn't at its newest revision. > > We were running this version: > lcmaps-plugins-scas-client-0.3.4-1.3.osg.el6.x86_64 > > We updated to this version: > lcmaps-plugins-scas-client-0.4.0-1.2.osg32.el6.x86_64 > > And things have been running find for 2 days now with no issues. > > We will implement the idle option to see if we can get around the > time_wait issue and let you know. > > Thanks for the help, > -Erik > > On Fri, 2014-05-02 at 10:37 -0700, Andrew Hanushevsky wrote: >> Hi Erik, >> >> I immediately see the problem from lsof. You have run out of file >> descriptors. This is caused by thousands of TCP connection going into >> close-wait state. That means that the client closed the tCP connection but >> the server was never notified that the connection closed. We've seen this >> problem in certain version of Linux and we have definitely seen this >> problem when clients run in certian virtual machines. There is an >> immediate bypass for this problem, that is to enable connection timeouts >> in the server. You can find out how to set it here: >> >> http://xrootd.org/doc/prod/xrd_config.htm#_Toc310725348 >> >> specifically, the "idle" option. Usually 1 hour is good enough but it >> really depends on how much load you get. The log file for that day would >> be extremely helpful in deciding. >> >> So, two other questions: >> >> 1) What version of linux (please include uname -a output), >> 2) Are your clients running in virtual machines? >> >> Andy >> >> On Fri, 2 May 2014, Matevz Tadel wrote: >> >>> Hi Erik, >>> >>> We had problems with CRLs at UCSD, also affecting xrootd, last couple of days >>> but it never caused the servers to lock up. >>> >>> I see you are running 3.3.6 on all servers other than >>> cms-a026.rcac.purdue.edu that is running 3.3.1: >>> http://xrootd.t2.ucsd.edu/dump_cache.jsp?pred=%25%2FCMS%3A%3APurdue%3A%3AXrdReport%2F%25%2Fver&submit=Filter >>> >>> When did this start? Is it correlated to an upgrade of some sort? >>> >>> You say servers stop logging anything and the only solution is to restart >>> them ... does that means the state is unrecoverable? Do you see any process >>> activity at all? >>> >>> The thing that would really help is output of gcore. >>> >>> Cheers, >>> Matevz >>> >>> On 05/01/14 08:38, Erik Gough wrote: >>>> Hello, >>>> >>>> We have been experiencing an issue at Purdue where our redirector and >>>> servers will stop responding to client requests to read data. It looks >>>> like the issue happens during the authentication process. The xrootd >>>> process stops logging anything during this time. The only solution I >>>> have found is to restart the xrootd process. After that, things start >>>> working normally again. >>>> >>>> I attached output from strace, netstat, lsof and limits. Strace shows a >>>> bunch of read/writes for what looks like lcmaps logging. Netstat shows >>>> a ton of connections in a CLOSE_WAIT state, but not to the point were >>>> the process is going to run out of FDs. >>>> >>>> Also I attached two attempts at xrdcp from an unresponsive server. One >>>> when getting redirected from our redirector and one when copying >>>> directly from the server. >>>> >>>> Other processes on the same server like gridftp are still authenticating >>>> with gums properly during this time. >>>> >>>> Can you help? We are facing servers becoming responsive on a daily >>>> basis. Please let me know if you need more information. >>>> >>>> There is also an OSG ticket on this issue: >>>> https://ticket.grid.iu.edu/20867 >>>> >>>> Thanks, >>>> -Erik >>>> >>>> >>>> ######################################################################## >>>> Use REPLY-ALL to reply to list >>>> >>>> To unsubscribe from the XROOTD-L list, click the following link: >>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >>>> >>> >>> ######################################################################## >>> Use REPLY-ALL to reply to list >>> >>> To unsubscribe from the XROOTD-L list, click the following link: >>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >>> > > -- > Erik Gough > IT Systems Specialist > Bindley Bioscience Center > Discovery Park at Purdue University > (765) 496-3975 > ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-L list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1