Print

Print


Hi Andrew,

The unresponsive server problem has been fixed by updating the lcmaps
scas client plugin package.  The strace output was showing a ton of
repeated attempts at lcmaps logging and we noticed that the package
wasn't at its newest revision.

We were running this version:
lcmaps-plugins-scas-client-0.3.4-1.3.osg.el6.x86_64

We updated to this version:
lcmaps-plugins-scas-client-0.4.0-1.2.osg32.el6.x86_64

And things have been running find for 2 days now with no issues.

We will implement the idle option to see if we can get around the
time_wait issue and let you know.

Thanks for the help,
-Erik

On Fri, 2014-05-02 at 10:37 -0700, Andrew Hanushevsky wrote:
> Hi Erik,
> 
> I immediately see the problem from lsof. You have run out of file 
> descriptors. This is caused by thousands of TCP connection going into 
> close-wait state. That means that the client closed the tCP connection but 
> the server was never notified that the connection closed. We've seen this 
> problem in certain version of Linux and we have definitely seen this 
> problem when clients run in certian virtual machines. There is an 
> immediate bypass for this problem, that is to enable connection timeouts 
> in the server. You can find out how to set it here:
> 
> http://xrootd.org/doc/prod/xrd_config.htm#_Toc310725348
> 
> specifically, the "idle" option. Usually 1 hour is good enough but it 
> really depends on how much load you get.  The log file for that day would 
> be extremely helpful in deciding.
> 
> So, two other questions:
> 
> 1) What version of linux (please include uname -a output),
> 2) Are your clients running in virtual machines?
> 
> Andy
> 
> On Fri, 2 May 2014, Matevz Tadel wrote:
> 
> > Hi Erik,
> >
> > We had problems with CRLs at UCSD, also affecting xrootd, last couple of days 
> > but it never caused the servers to lock up.
> >
> > I see you are running 3.3.6 on all servers other than 
> > cms-a026.rcac.purdue.edu that is running 3.3.1:
> > http://xrootd.t2.ucsd.edu/dump_cache.jsp?pred=%25%2FCMS%3A%3APurdue%3A%3AXrdReport%2F%25%2Fver&submit=Filter
> >
> > When did this start? Is it correlated to an upgrade of some sort?
> >
> > You say servers stop logging anything and the only solution is to restart 
> > them ... does that means the state is unrecoverable? Do you see any process 
> > activity at all?
> >
> > The thing that would really help is output of gcore.
> >
> > Cheers,
> > Matevz
> >
> > On 05/01/14 08:38, Erik Gough wrote:
> >> Hello,
> >> 
> >> We have been experiencing an issue at Purdue where our redirector and
> >> servers will stop responding to client requests to read data.  It looks
> >> like the issue happens during the authentication process.  The xrootd
> >> process stops logging anything during this time.  The only solution I
> >> have found is to restart the xrootd process.  After that, things start
> >> working normally again.
> >> 
> >> I attached output from strace, netstat, lsof and limits.  Strace shows a
> >> bunch of read/writes for what looks like lcmaps logging.  Netstat shows
> >> a ton of connections in a CLOSE_WAIT state, but not to the point were
> >> the process is going to run out of FDs.
> >> 
> >> Also I attached two attempts at xrdcp from an unresponsive server.  One
> >> when getting redirected from our redirector and one when copying
> >> directly from the server.
> >> 
> >> Other processes on the same server like gridftp are still authenticating
> >> with gums properly during this time.
> >> 
> >> Can you help?  We are facing servers becoming responsive on a daily
> >> basis.  Please let me know if you need more information.
> >> 
> >> There is also an OSG ticket on this issue:
> >> https://ticket.grid.iu.edu/20867
> >> 
> >> Thanks,
> >> -Erik
> >> 
> >> 
> >> ########################################################################
> >> Use REPLY-ALL to reply to list
> >> 
> >> To unsubscribe from the XROOTD-L list, click the following link:
> >> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
> >> 
> >
> > ########################################################################
> > Use REPLY-ALL to reply to list
> >
> > To unsubscribe from the XROOTD-L list, click the following link:
> > https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
> >

-- 
Erik Gough
IT Systems Specialist
Bindley Bioscience Center
Discovery Park at Purdue University
(765) 496-3975

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1