Print

Print


Why don't you just set a very conservative idle timeout on the server by default. This does not really harm and cleans stale connections of VMs, right?

xrd.timeout idle 604800

Cheers Andreas.





On Tue, Sep 2, 2014 at 3:59 PM, Brian Bockelman <[log in to unmask]> wrote:
On Sep 2, 2014, at 2:34 AM, Andrew Hanushevsky <[log in to unmask]> wrote:

> I guess I don't get what keepalive's would solve relative to the client other than somewhat faster recovery in the rare case that a server goes away. A lot of work for handling a <10% problem. The bigger problem is client's going away and the server not being told that this has happened. This is particularly bad when the client is a virtual machine as some hyervisors handle this correctly and some do not. Firewalls and NAT boxes make this even more problematic.
>
> I see the point of enabling keepalive by default. However,as a practical measure, this actually is a big change as the one would need to implement a way to turn it off (the current implementation simply allows you to turn it on); let alone allowing a keepalive time specification.
>
> Additionally, I am not at all convinced that, at scale, it would actually solve the problem. Brian are you always running with keepalive on and it actually solves all of your vaporozing client issues?

Well, saying it solves "all" is a big claim (and HTCondor doesn't provide enough statistics for me to back up the claim anyway).  It does, however, mitigate this to the point where we haven't had to spend time on the issue for several months (since we deployed the relevant version).  When the problem was originally fixed, we did collect enough statistics to say this "solved" things at problem sites.

*Note* that this doesn't solve the problem of an overloaded site network - it just helps the server to not have to track broken connections.  If the network device is overloaded, detecting and re-establishing a TCP connection will not help.

I agree the client-side change is mostly just allowing a quicker recovery.  However, I think the server-side change is worth the hassle to clear up dead connections.

Since dead connections only cause problems in aggregate (i.e., we don't need to tune keepalive down to 1 minute), why not:

a) Always turn keepalive on; remove this as an option, and
b) Provide no mechanism to provide a keepalive time specification.

Seems simpler and I can't think of any large downsides (although maybe that's because I've only had 1 cup of coffee today).

Brian

>
> Andy
>
> P.S. I agrre that the keepalive mechanism in TCPwon't cause a scalability issue, This is a particular issue with proxies and NAT boxes that can't track all of the connections in real time. In this case you may get a false indication that the client is dead. As I said, in the xroot world that shouldn't matter as the client would simply reconnect.
>
> On Thu, 28 Aug 2014, Brian Bockelman wrote:
>
>> If you're going to enable keepalive in the client -
>>
>> You might want to think about manually tuning the keepalive timeouts down from the defaults (2 hours).  I recently adjusted it down to around 5 minutes in HTCondor because 2 hours was "too late" to detect the disconnect to recover the jobs.
>>
>> There's a socket option to do this in Linux (which travels under a different name in Mac OS X... not sure about Solaris).  Again, we've not seen any kernel scalability issues from doing this.
>>
>> Brian
>>
>> On Aug 28, 2014, at 5:09 AM, Lukasz Janyst <[log in to unmask]> wrote:
>>
>>> Hi Brian,
>>>
>>>   for the server-side, it is Andy's call.
>>>
>>>   We have seen silent disconnection problems with ALICE sites in the past, this is why I set up the keepalive functionality for sockets in the old client. I will do the same for the new one as well.
>>>
>>> Chers,
>>>  Lukasz
>>>
>>> On 08/25/2014 02:52 PM, Brian Bockelman wrote:
>>>> Hi Lukasz, all,
>>>>
>>>> Can we enable keepalive by default?  I don't look forward to the task of asking every site for a configuration change.
>>>>
>>>> At least on the linux platform, we have observed the kernel is able to handle tens-of-thousands of sockets with keepalive enabled; it doesn't appear to be a scalability issue.  There doesn't appear to be any protocol built-in features we could use on the server side (although this doesn't appear to be needed on the client side).
>>>>
>>>> Brian
>>>>
>>>> On Aug 25, 2014, at 2:08 AM, Lukasz Janyst <[log in to unmask]> wrote:
>>>>
>>>>> On 08/22/2014 06:59 PM, Matevz Tadel wrote:
>>>>>>> Does the Xrootd server at least enable TCP keepalive?  That'll close
>>>>>>> out dead connections after 2 hours.
>>>>>>
>>>>>> I don't think so ... I see things hanging up to 24 hours easily (when
>>>>>> collector decides to give up on the session). Can this timeout be set at
>>>>>> socket creation time?
>>>>>
>>>>>  Typically, this is handled by the TCP stack, but the routers/firewalls on the way often mess things up. To enable the OS keepalive for xrootd sockets you need to ask for it: http://xrootd.org/doc/prod/xrd_config.htm#_Toc310725344
>>>>>
>>>>> Cheers,
>>>>>  Lukasz
>>>>>
>>>>> ########################################################################
>>>>> Use REPLY-ALL to reply to list
>>>>>
>>>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>>>
>>
>> ########################################################################
>> Use REPLY-ALL to reply to list
>>
>> To unsubscribe from the XROOTD-DEV list, click the following link:
>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1



Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1