Print

Print


Hi Erik,

I immediately see the problem from lsof. You have run out of file 
descriptors. This is caused by thousands of TCP connection going into 
close-wait state. That means that the client closed the tCP connection but 
the server was never notified that the connection closed. We've seen this 
problem in certain version of Linux and we have definitely seen this 
problem when clients run in certian virtual machines. There is an 
immediate bypass for this problem, that is to enable connection timeouts 
in the server. You can find out how to set it here:

http://xrootd.org/doc/prod/xrd_config.htm#_Toc310725348

specifically, the "idle" option. Usually 1 hour is good enough but it 
really depends on how much load you get.  The log file for that day would 
be extremely helpful in deciding.

So, two other questions:

1) What version of linux (please include uname -a output),
2) Are your clients running in virtual machines?

Andy

On Fri, 2 May 2014, Matevz Tadel wrote:

> Hi Erik,
>
> We had problems with CRLs at UCSD, also affecting xrootd, last couple of days 
> but it never caused the servers to lock up.
>
> I see you are running 3.3.6 on all servers other than 
> cms-a026.rcac.purdue.edu that is running 3.3.1:
> http://xrootd.t2.ucsd.edu/dump_cache.jsp?pred=%25%2FCMS%3A%3APurdue%3A%3AXrdReport%2F%25%2Fver&submit=Filter
>
> When did this start? Is it correlated to an upgrade of some sort?
>
> You say servers stop logging anything and the only solution is to restart 
> them ... does that means the state is unrecoverable? Do you see any process 
> activity at all?
>
> The thing that would really help is output of gcore.
>
> Cheers,
> Matevz
>
> On 05/01/14 08:38, Erik Gough wrote:
>> Hello,
>> 
>> We have been experiencing an issue at Purdue where our redirector and
>> servers will stop responding to client requests to read data.  It looks
>> like the issue happens during the authentication process.  The xrootd
>> process stops logging anything during this time.  The only solution I
>> have found is to restart the xrootd process.  After that, things start
>> working normally again.
>> 
>> I attached output from strace, netstat, lsof and limits.  Strace shows a
>> bunch of read/writes for what looks like lcmaps logging.  Netstat shows
>> a ton of connections in a CLOSE_WAIT state, but not to the point were
>> the process is going to run out of FDs.
>> 
>> Also I attached two attempts at xrdcp from an unresponsive server.  One
>> when getting redirected from our redirector and one when copying
>> directly from the server.
>> 
>> Other processes on the same server like gridftp are still authenticating
>> with gums properly during this time.
>> 
>> Can you help?  We are facing servers becoming responsive on a daily
>> basis.  Please let me know if you need more information.
>> 
>> There is also an OSG ticket on this issue:
>> https://ticket.grid.iu.edu/20867
>> 
>> Thanks,
>> -Erik
>> 
>> 
>> ########################################################################
>> Use REPLY-ALL to reply to list
>> 
>> To unsubscribe from the XROOTD-L list, click the following link:
>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>> 
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the XROOTD-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1