Print

Print


Hi Andrew,

I agree the time_wait connections are an issue, which led us to up the
FD limit awhile back.  Now, the limits for the xrootd process are
configured so max open files is set really high, ~60k.  I don't think we
are hitting that limit anymore.  When we did hit the max open file
descriptors we would still see an error printed in the log. Is there
another fd limit you think we might be hitting?  

We can implement the 1h timeout config change and see what happens with
the time_wait connections.  I attached a log from yesterday for one of
our servers if you can use it to determine a better value.

Linux version: 
Linux cms-g004.rcac.purdue.edu 2.6.32-431.11.2.el6.x86_64 #1 SMP Mon Mar
3 13:32:45 EST 2014 x86_64 x86_64 x86_64 GNU/Linux

We are not running any VMs.

Matevz, for your questions: 

Fetch-crl is set to run every 6 hours on our xrootd servers and I don't
see any errors when it runs.  We saw this in both the 3.3.1 and 3.3.6
versions.  I can't correlate it to a single event.  Maybe it is a load
issue?  The number of xrootd servers we have keeps getting smaller as we
retire old nodes.  In the past year we have gone from 180 servers to
around 90.  

There is activity on the process, but clients can't interact with the
service.  The CPU for xrootd jumps to 100%+.  I don't know if they will
ever come back from this state.  When I check in the morning I notice
some that have not logged anything for 5+ hours.

I put a core dump here:
srm://srm.rcac.purdue.edu:8443/srm/v2/server?SFN=/mnt/hadoop/store/user/goughes/xrootd_gcore.out.20902

-Erik

On Fri, 2014-05-02 at 10:37 -0700, Andrew Hanushevsky wrote:
> Hi Erik,
> 
> I immediately see the problem from lsof. You have run out of file 
> descriptors. This is caused by thousands of TCP connection going into 
> close-wait state. That means that the client closed the tCP connection but 
> the server was never notified that the connection closed. We've seen this 
> problem in certain version of Linux and we have definitely seen this 
> problem when clients run in certian virtual machines. There is an 
> immediate bypass for this problem, that is to enable connection timeouts 
> in the server. You can find out how to set it here:
> 
> http://xrootd.org/doc/prod/xrd_config.htm#_Toc310725348
> 
> specifically, the "idle" option. Usually 1 hour is good enough but it 
> really depends on how much load you get.  The log file for that day would 
> be extremely helpful in deciding.
> 
> So, two other questions:
> 
> 1) What version of linux (please include uname -a output),
> 2) Are your clients running in virtual machines?
> 
> Andy
> 
> On Fri, 2 May 2014, Matevz Tadel wrote:
> 
> > Hi Erik,
> >
> > We had problems with CRLs at UCSD, also affecting xrootd, last couple of days 
> > but it never caused the servers to lock up.
> >
> > I see you are running 3.3.6 on all servers other than 
> > cms-a026.rcac.purdue.edu that is running 3.3.1:
> > http://xrootd.t2.ucsd.edu/dump_cache.jsp?pred=%25%2FCMS%3A%3APurdue%3A%3AXrdReport%2F%25%2Fver&submit=Filter
> >
> > When did this start? Is it correlated to an upgrade of some sort?
> >
> > You say servers stop logging anything and the only solution is to restart 
> > them ... does that means the state is unrecoverable? Do you see any process 
> > activity at all?
> >
> > The thing that would really help is output of gcore.
> >
> > Cheers,
> > Matevz
> >
> > On 05/01/14 08:38, Erik Gough wrote:
> >> Hello,
> >> 
> >> We have been experiencing an issue at Purdue where our redirector and
> >> servers will stop responding to client requests to read data.  It looks
> >> like the issue happens during the authentication process.  The xrootd
> >> process stops logging anything during this time.  The only solution I
> >> have found is to restart the xrootd process.  After that, things start
> >> working normally again.
> >> 
> >> I attached output from strace, netstat, lsof and limits.  Strace shows a
> >> bunch of read/writes for what looks like lcmaps logging.  Netstat shows
> >> a ton of connections in a CLOSE_WAIT state, but not to the point were
> >> the process is going to run out of FDs.
> >> 
> >> Also I attached two attempts at xrdcp from an unresponsive server.  One
> >> when getting redirected from our redirector and one when copying
> >> directly from the server.
> >> 
> >> Other processes on the same server like gridftp are still authenticating
> >> with gums properly during this time.
> >> 
> >> Can you help?  We are facing servers becoming responsive on a daily
> >> basis.  Please let me know if you need more information.
> >> 
> >> There is also an OSG ticket on this issue:
> >> https://ticket.grid.iu.edu/20867
> >> 
> >> Thanks,
> >> -Erik
> >> 
> >> 
> >> ########################################################################
> >> Use REPLY-ALL to reply to list
> >> 
> >> To unsubscribe from the XROOTD-L list, click the following link:
> >> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
> >> 
> >
> > ########################################################################
> > Use REPLY-ALL to reply to list
> >
> > To unsubscribe from the XROOTD-L list, click the following link:
> > https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
> >



########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1