Hi Andrew,
The unresponsive server problem has been fixed by updating the lcmaps
scas client plugin package. The strace output was showing a ton of
repeated attempts at lcmaps logging and we noticed that the package
wasn't at its newest revision.
We were running this version:
lcmaps-plugins-scas-client-0.3.4-1.3.osg.el6.x86_64
We updated to this version:
lcmaps-plugins-scas-client-0.4.0-1.2.osg32.el6.x86_64
And things have been running find for 2 days now with no issues.
We will implement the idle option to see if we can get around the
time_wait issue and let you know.
Thanks for the help,
-Erik
On Fri, 2014-05-02 at 10:37 -0700, Andrew Hanushevsky wrote:
> Hi Erik,
>
> I immediately see the problem from lsof. You have run out of file
> descriptors. This is caused by thousands of TCP connection going into
> close-wait state. That means that the client closed the tCP connection but
> the server was never notified that the connection closed. We've seen this
> problem in certain version of Linux and we have definitely seen this
> problem when clients run in certian virtual machines. There is an
> immediate bypass for this problem, that is to enable connection timeouts
> in the server. You can find out how to set it here:
>
> http://xrootd.org/doc/prod/xrd_config.htm#_Toc310725348
>
> specifically, the "idle" option. Usually 1 hour is good enough but it
> really depends on how much load you get. The log file for that day would
> be extremely helpful in deciding.
>
> So, two other questions:
>
> 1) What version of linux (please include uname -a output),
> 2) Are your clients running in virtual machines?
>
> Andy
>
> On Fri, 2 May 2014, Matevz Tadel wrote:
>
> > Hi Erik,
> >
> > We had problems with CRLs at UCSD, also affecting xrootd, last couple of days
> > but it never caused the servers to lock up.
> >
> > I see you are running 3.3.6 on all servers other than
> > cms-a026.rcac.purdue.edu that is running 3.3.1:
> > http://xrootd.t2.ucsd.edu/dump_cache.jsp?pred=%25%2FCMS%3A%3APurdue%3A%3AXrdReport%2F%25%2Fver&submit=Filter
> >
> > When did this start? Is it correlated to an upgrade of some sort?
> >
> > You say servers stop logging anything and the only solution is to restart
> > them ... does that means the state is unrecoverable? Do you see any process
> > activity at all?
> >
> > The thing that would really help is output of gcore.
> >
> > Cheers,
> > Matevz
> >
> > On 05/01/14 08:38, Erik Gough wrote:
> >> Hello,
> >>
> >> We have been experiencing an issue at Purdue where our redirector and
> >> servers will stop responding to client requests to read data. It looks
> >> like the issue happens during the authentication process. The xrootd
> >> process stops logging anything during this time. The only solution I
> >> have found is to restart the xrootd process. After that, things start
> >> working normally again.
> >>
> >> I attached output from strace, netstat, lsof and limits. Strace shows a
> >> bunch of read/writes for what looks like lcmaps logging. Netstat shows
> >> a ton of connections in a CLOSE_WAIT state, but not to the point were
> >> the process is going to run out of FDs.
> >>
> >> Also I attached two attempts at xrdcp from an unresponsive server. One
> >> when getting redirected from our redirector and one when copying
> >> directly from the server.
> >>
> >> Other processes on the same server like gridftp are still authenticating
> >> with gums properly during this time.
> >>
> >> Can you help? We are facing servers becoming responsive on a daily
> >> basis. Please let me know if you need more information.
> >>
> >> There is also an OSG ticket on this issue:
> >> https://ticket.grid.iu.edu/20867
> >>
> >> Thanks,
> >> -Erik
> >>
> >>
> >> ########################################################################
> >> Use REPLY-ALL to reply to list
> >>
> >> To unsubscribe from the XROOTD-L list, click the following link:
> >> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
> >>
> >
> > ########################################################################
> > Use REPLY-ALL to reply to list
> >
> > To unsubscribe from the XROOTD-L list, click the following link:
> > https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
> >
--
Erik Gough
IT Systems Specialist
Bindley Bioscience Center
Discovery Park at Purdue University
(765) 496-3975
########################################################################
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
|