On 08/22/14 09:53, Brian Bockelman wrote:
> On Aug 22, 2014, at 10:45 AM, Matevz Tadel <[log in to unmask]> wrote:
>
>> Hi everybody,
>>
>> As things scale up there are more and more cases where users' jobs get -9 killed by the batch system (or by the user realizing they did something stupid).
>>
>> Servers know nothing about this as xrootd never checks the sockets to see if there's anybody still at the other end. Consequentially, monitoring thinks the file is still open ... the inactivity cut off I have in XrdMon collector is 1 day! Whatever happens, the close time is wildly off.
>>
>> At the moment I have like 80% of open files on collector in this state ... close to 10,000 coming just form EOS at FNAL. Grrr, etc.
>>
>> Does it make sense to add a configuration option to make servers perform aliveness checks on connected clients?
>>
>> I know, client applications should be shutdown properly ...
>
> Hi Matevz,
>
> I'm not sure this explanation completely makes sense. "kill -9" of a client just kills the process; the TCP socket is closed on the network level by the OS. So, there should no longer be a valid TCP socket at the server.
Thanks Brian! Yes, I admit I don't know -- that's why I asked. But this also
happens for Carl's tests on all sites, at least it used to, and he told me he
forcefully kills the jobs after the test is done. We talked about it and he said
he'll try to make shutdown gentler, not sure what's happened there.
(For those of you in CMS VOMS:
https://sentry.t2.ucsd.edu:4242/xuser/?fqhn&user_re=Vuosalo
)
> However - if there's a stateful firewall in the way (which doesn't apply to FNAL FWIW), it may do all sorts of screwy things to the connection state.
>
> Does the Xrootd server at least enable TCP keepalive? That'll close out dead connections after 2 hours.
I don't think so ... I see things hanging up to 24 hours easily (when collector
decides to give up on the session). Can this timeout be set at socket creation time?
Matevz
> Brian
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the XROOTD-DEV list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>
########################################################################
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
|