Print

Print


Yes, that what we do for the BaBar farm at SLAC. It run's VM's and the way 
they stop a stop is by issuing a kell -9 on the VM. The hypervisor doesn't 
cleanup the connection in that case. The timeout then closes idle 
connections. The suggested value is a bit too large. I think we use 
something like a day as most jobs don't sit on an idle connection for that 
long. That said, I think Brian just doesn't want to deal with another 
config change.

Andy

On Tue, 2 Sep 2014, Andreas-Joachim Peters wrote:

> Why don't you just set a very conservative idle timeout on the server by
> default. This does not really harm and cleans stale connections of VMs,
> right?
>
> xrd.timeout idle 604800
>
> Cheers Andreas.
>
>
>
>
>
> On Tue, Sep 2, 2014 at 3:59 PM, Brian Bockelman <[log in to unmask]>
> wrote:
>
>> On Sep 2, 2014, at 2:34 AM, Andrew Hanushevsky <[log in to unmask]>
>> wrote:
>>
>>> I guess I don't get what keepalive's would solve relative to the client
>> other than somewhat faster recovery in the rare case that a server goes
>> away. A lot of work for handling a <10% problem. The bigger problem is
>> client's going away and the server not being told that this has happened.
>> This is particularly bad when the client is a virtual machine as some
>> hyervisors handle this correctly and some do not. Firewalls and NAT boxes
>> make this even more problematic.
>>>
>>> I see the point of enabling keepalive by default. However,as a practical
>> measure, this actually is a big change as the one would need to implement a
>> way to turn it off (the current implementation simply allows you to turn it
>> on); let alone allowing a keepalive time specification.
>>>
>>> Additionally, I am not at all convinced that, at scale, it would
>> actually solve the problem. Brian are you always running with keepalive on
>> and it actually solves all of your vaporozing client issues?
>>
>> Well, saying it solves "all" is a big claim (and HTCondor doesn't provide
>> enough statistics for me to back up the claim anyway).  It does, however,
>> mitigate this to the point where we haven't had to spend time on the issue
>> for several months (since we deployed the relevant version).  When the
>> problem was originally fixed, we did collect enough statistics to say this
>> "solved" things at problem sites.
>>
>> *Note* that this doesn't solve the problem of an overloaded site network -
>> it just helps the server to not have to track broken connections.  If the
>> network device is overloaded, detecting and re-establishing a TCP
>> connection will not help.
>>
>> I agree the client-side change is mostly just allowing a quicker
>> recovery.  However, I think the server-side change is worth the hassle to
>> clear up dead connections.
>>
>> Since dead connections only cause problems in aggregate (i.e., we don't
>> need to tune keepalive down to 1 minute), why not:
>>
>> a) Always turn keepalive on; remove this as an option, and
>> b) Provide no mechanism to provide a keepalive time specification.
>>
>> Seems simpler and I can't think of any large downsides (although maybe
>> that's because I've only had 1 cup of coffee today).
>>
>> Brian
>>
>>>
>>> Andy
>>>
>>> P.S. I agrre that the keepalive mechanism in TCPwon't cause a
>> scalability issue, This is a particular issue with proxies and NAT boxes
>> that can't track all of the connections in real time. In this case you may
>> get a false indication that the client is dead. As I said, in the xroot
>> world that shouldn't matter as the client would simply reconnect.
>>>
>>> On Thu, 28 Aug 2014, Brian Bockelman wrote:
>>>
>>>> If you're going to enable keepalive in the client -
>>>>
>>>> You might want to think about manually tuning the keepalive timeouts
>> down from the defaults (2 hours).  I recently adjusted it down to around 5
>> minutes in HTCondor because 2 hours was "too late" to detect the disconnect
>> to recover the jobs.
>>>>
>>>> There's a socket option to do this in Linux (which travels under a
>> different name in Mac OS X... not sure about Solaris).  Again, we've not
>> seen any kernel scalability issues from doing this.
>>>>
>>>> Brian
>>>>
>>>> On Aug 28, 2014, at 5:09 AM, Lukasz Janyst <[log in to unmask]> wrote:
>>>>
>>>>> Hi Brian,
>>>>>
>>>>>   for the server-side, it is Andy's call.
>>>>>
>>>>>   We have seen silent disconnection problems with ALICE sites in the
>> past, this is why I set up the keepalive functionality for sockets in the
>> old client. I will do the same for the new one as well.
>>>>>
>>>>> Chers,
>>>>>  Lukasz
>>>>>
>>>>> On 08/25/2014 02:52 PM, Brian Bockelman wrote:
>>>>>> Hi Lukasz, all,
>>>>>>
>>>>>> Can we enable keepalive by default?  I don't look forward to the task
>> of asking every site for a configuration change.
>>>>>>
>>>>>> At least on the linux platform, we have observed the kernel is able
>> to handle tens-of-thousands of sockets with keepalive enabled; it doesn't
>> appear to be a scalability issue.  There doesn't appear to be any protocol
>> built-in features we could use on the server side (although this doesn't
>> appear to be needed on the client side).
>>>>>>
>>>>>> Brian
>>>>>>
>>>>>> On Aug 25, 2014, at 2:08 AM, Lukasz Janyst <[log in to unmask]> wrote:
>>>>>>
>>>>>>> On 08/22/2014 06:59 PM, Matevz Tadel wrote:
>>>>>>>>> Does the Xrootd server at least enable TCP keepalive?  That'll
>> close
>>>>>>>>> out dead connections after 2 hours.
>>>>>>>>
>>>>>>>> I don't think so ... I see things hanging up to 24 hours easily
>> (when
>>>>>>>> collector decides to give up on the session). Can this timeout be
>> set at
>>>>>>>> socket creation time?
>>>>>>>
>>>>>>>  Typically, this is handled by the TCP stack, but the
>> routers/firewalls on the way often mess things up. To enable the OS
>> keepalive for xrootd sockets you need to ask for it:
>> http://xrootd.org/doc/prod/xrd_config.htm#_Toc310725344
>>>>>>>
>>>>>>> Cheers,
>>>>>>>  Lukasz
>>>>>>>
>>>>>>>
>> ########################################################################
>>>>>>> Use REPLY-ALL to reply to list
>>>>>>>
>>>>>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>>>>>
>>>>
>>>> ########################################################################
>>>> Use REPLY-ALL to reply to list
>>>>
>>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>>>
>>
>> ########################################################################
>> Use REPLY-ALL to reply to list
>>
>> To unsubscribe from the XROOTD-DEV list, click the following link:
>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1