Print

Print


Hi Eric,

Good to know! Ah, these plugins are always a thorn, sigh.

Andy

On Mon, 5 May 2014, Erik Gough wrote:

> Hi Andrew,
>
> The unresponsive server problem has been fixed by updating the lcmaps
> scas client plugin package.  The strace output was showing a ton of
> repeated attempts at lcmaps logging and we noticed that the package
> wasn't at its newest revision.
>
> We were running this version:
> lcmaps-plugins-scas-client-0.3.4-1.3.osg.el6.x86_64
>
> We updated to this version:
> lcmaps-plugins-scas-client-0.4.0-1.2.osg32.el6.x86_64
>
> And things have been running find for 2 days now with no issues.
>
> We will implement the idle option to see if we can get around the
> time_wait issue and let you know.
>
> Thanks for the help,
> -Erik
>
> On Fri, 2014-05-02 at 10:37 -0700, Andrew Hanushevsky wrote:
>> Hi Erik,
>>
>> I immediately see the problem from lsof. You have run out of file
>> descriptors. This is caused by thousands of TCP connection going into
>> close-wait state. That means that the client closed the tCP connection but
>> the server was never notified that the connection closed. We've seen this
>> problem in certain version of Linux and we have definitely seen this
>> problem when clients run in certian virtual machines. There is an
>> immediate bypass for this problem, that is to enable connection timeouts
>> in the server. You can find out how to set it here:
>>
>> http://xrootd.org/doc/prod/xrd_config.htm#_Toc310725348
>>
>> specifically, the "idle" option. Usually 1 hour is good enough but it
>> really depends on how much load you get.  The log file for that day would
>> be extremely helpful in deciding.
>>
>> So, two other questions:
>>
>> 1) What version of linux (please include uname -a output),
>> 2) Are your clients running in virtual machines?
>>
>> Andy
>>
>> On Fri, 2 May 2014, Matevz Tadel wrote:
>>
>>> Hi Erik,
>>>
>>> We had problems with CRLs at UCSD, also affecting xrootd, last couple of days
>>> but it never caused the servers to lock up.
>>>
>>> I see you are running 3.3.6 on all servers other than
>>> cms-a026.rcac.purdue.edu that is running 3.3.1:
>>> http://xrootd.t2.ucsd.edu/dump_cache.jsp?pred=%25%2FCMS%3A%3APurdue%3A%3AXrdReport%2F%25%2Fver&submit=Filter
>>>
>>> When did this start? Is it correlated to an upgrade of some sort?
>>>
>>> You say servers stop logging anything and the only solution is to restart
>>> them ... does that means the state is unrecoverable? Do you see any process
>>> activity at all?
>>>
>>> The thing that would really help is output of gcore.
>>>
>>> Cheers,
>>> Matevz
>>>
>>> On 05/01/14 08:38, Erik Gough wrote:
>>>> Hello,
>>>>
>>>> We have been experiencing an issue at Purdue where our redirector and
>>>> servers will stop responding to client requests to read data.  It looks
>>>> like the issue happens during the authentication process.  The xrootd
>>>> process stops logging anything during this time.  The only solution I
>>>> have found is to restart the xrootd process.  After that, things start
>>>> working normally again.
>>>>
>>>> I attached output from strace, netstat, lsof and limits.  Strace shows a
>>>> bunch of read/writes for what looks like lcmaps logging.  Netstat shows
>>>> a ton of connections in a CLOSE_WAIT state, but not to the point were
>>>> the process is going to run out of FDs.
>>>>
>>>> Also I attached two attempts at xrdcp from an unresponsive server.  One
>>>> when getting redirected from our redirector and one when copying
>>>> directly from the server.
>>>>
>>>> Other processes on the same server like gridftp are still authenticating
>>>> with gums properly during this time.
>>>>
>>>> Can you help?  We are facing servers becoming responsive on a daily
>>>> basis.  Please let me know if you need more information.
>>>>
>>>> There is also an OSG ticket on this issue:
>>>> https://ticket.grid.iu.edu/20867
>>>>
>>>> Thanks,
>>>> -Erik
>>>>
>>>>
>>>> ########################################################################
>>>> Use REPLY-ALL to reply to list
>>>>
>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>>
>>>
>>> ########################################################################
>>> Use REPLY-ALL to reply to list
>>>
>>> To unsubscribe from the XROOTD-L list, click the following link:
>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>
>
> -- 
> Erik Gough
> IT Systems Specialist
> Bindley Bioscience Center
> Discovery Park at Purdue University
> (765) 496-3975
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1