Yes, that what we do for the BaBar farm at SLAC. It run's VM's and the way they stop a stop is by issuing a kell -9 on the VM. The hypervisor doesn't cleanup the connection in that case. The timeout then closes idle connections. The suggested value is a bit too large. I think we use something like a day as most jobs don't sit on an idle connection for that long. That said, I think Brian just doesn't want to deal with another config change. Andy On Tue, 2 Sep 2014, Andreas-Joachim Peters wrote: > Why don't you just set a very conservative idle timeout on the server by > default. This does not really harm and cleans stale connections of VMs, > right? > > xrd.timeout idle 604800 > > Cheers Andreas. > > > > > > On Tue, Sep 2, 2014 at 3:59 PM, Brian Bockelman <[log in to unmask]> > wrote: > >> On Sep 2, 2014, at 2:34 AM, Andrew Hanushevsky <[log in to unmask]> >> wrote: >> >>> I guess I don't get what keepalive's would solve relative to the client >> other than somewhat faster recovery in the rare case that a server goes >> away. A lot of work for handling a <10% problem. The bigger problem is >> client's going away and the server not being told that this has happened. >> This is particularly bad when the client is a virtual machine as some >> hyervisors handle this correctly and some do not. Firewalls and NAT boxes >> make this even more problematic. >>> >>> I see the point of enabling keepalive by default. However,as a practical >> measure, this actually is a big change as the one would need to implement a >> way to turn it off (the current implementation simply allows you to turn it >> on); let alone allowing a keepalive time specification. >>> >>> Additionally, I am not at all convinced that, at scale, it would >> actually solve the problem. Brian are you always running with keepalive on >> and it actually solves all of your vaporozing client issues? >> >> Well, saying it solves "all" is a big claim (and HTCondor doesn't provide >> enough statistics for me to back up the claim anyway). It does, however, >> mitigate this to the point where we haven't had to spend time on the issue >> for several months (since we deployed the relevant version). When the >> problem was originally fixed, we did collect enough statistics to say this >> "solved" things at problem sites. >> >> *Note* that this doesn't solve the problem of an overloaded site network - >> it just helps the server to not have to track broken connections. If the >> network device is overloaded, detecting and re-establishing a TCP >> connection will not help. >> >> I agree the client-side change is mostly just allowing a quicker >> recovery. However, I think the server-side change is worth the hassle to >> clear up dead connections. >> >> Since dead connections only cause problems in aggregate (i.e., we don't >> need to tune keepalive down to 1 minute), why not: >> >> a) Always turn keepalive on; remove this as an option, and >> b) Provide no mechanism to provide a keepalive time specification. >> >> Seems simpler and I can't think of any large downsides (although maybe >> that's because I've only had 1 cup of coffee today). >> >> Brian >> >>> >>> Andy >>> >>> P.S. I agrre that the keepalive mechanism in TCPwon't cause a >> scalability issue, This is a particular issue with proxies and NAT boxes >> that can't track all of the connections in real time. In this case you may >> get a false indication that the client is dead. As I said, in the xroot >> world that shouldn't matter as the client would simply reconnect. >>> >>> On Thu, 28 Aug 2014, Brian Bockelman wrote: >>> >>>> If you're going to enable keepalive in the client - >>>> >>>> You might want to think about manually tuning the keepalive timeouts >> down from the defaults (2 hours). I recently adjusted it down to around 5 >> minutes in HTCondor because 2 hours was "too late" to detect the disconnect >> to recover the jobs. >>>> >>>> There's a socket option to do this in Linux (which travels under a >> different name in Mac OS X... not sure about Solaris). Again, we've not >> seen any kernel scalability issues from doing this. >>>> >>>> Brian >>>> >>>> On Aug 28, 2014, at 5:09 AM, Lukasz Janyst <[log in to unmask]> wrote: >>>> >>>>> Hi Brian, >>>>> >>>>> for the server-side, it is Andy's call. >>>>> >>>>> We have seen silent disconnection problems with ALICE sites in the >> past, this is why I set up the keepalive functionality for sockets in the >> old client. I will do the same for the new one as well. >>>>> >>>>> Chers, >>>>> Lukasz >>>>> >>>>> On 08/25/2014 02:52 PM, Brian Bockelman wrote: >>>>>> Hi Lukasz, all, >>>>>> >>>>>> Can we enable keepalive by default? I don't look forward to the task >> of asking every site for a configuration change. >>>>>> >>>>>> At least on the linux platform, we have observed the kernel is able >> to handle tens-of-thousands of sockets with keepalive enabled; it doesn't >> appear to be a scalability issue. There doesn't appear to be any protocol >> built-in features we could use on the server side (although this doesn't >> appear to be needed on the client side). >>>>>> >>>>>> Brian >>>>>> >>>>>> On Aug 25, 2014, at 2:08 AM, Lukasz Janyst <[log in to unmask]> wrote: >>>>>> >>>>>>> On 08/22/2014 06:59 PM, Matevz Tadel wrote: >>>>>>>>> Does the Xrootd server at least enable TCP keepalive? That'll >> close >>>>>>>>> out dead connections after 2 hours. >>>>>>>> >>>>>>>> I don't think so ... I see things hanging up to 24 hours easily >> (when >>>>>>>> collector decides to give up on the session). Can this timeout be >> set at >>>>>>>> socket creation time? >>>>>>> >>>>>>> Typically, this is handled by the TCP stack, but the >> routers/firewalls on the way often mess things up. To enable the OS >> keepalive for xrootd sockets you need to ask for it: >> http://xrootd.org/doc/prod/xrd_config.htm#_Toc310725344 >>>>>>> >>>>>>> Cheers, >>>>>>> Lukasz >>>>>>> >>>>>>> >> ######################################################################## >>>>>>> Use REPLY-ALL to reply to list >>>>>>> >>>>>>> To unsubscribe from the XROOTD-DEV list, click the following link: >>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1 >>>>>> >>>> >>>> ######################################################################## >>>> Use REPLY-ALL to reply to list >>>> >>>> To unsubscribe from the XROOTD-DEV list, click the following link: >>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1 >>>> >> >> ######################################################################## >> Use REPLY-ALL to reply to list >> >> To unsubscribe from the XROOTD-DEV list, click the following link: >> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1 >> > ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-DEV list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1