Print

Print


OK, for reference, here are links to relevant ML pages:

http://xrootd.t2.ucsd.edu?49
http://xrootd.t2.ucsd.edu?50
http://xrootd.t2.ucsd.edu?51
http://xrootd.t2.ucsd.edu?52

Cheers,
Matevz

On 06/27/14 11:01, [log in to unmask] wrote:
>
> I've just enabled the monitoring on all the proxy machines.
>
> Regards,
> Andrew.
>
> ________________________________________
> From: Matevz Tadel [[log in to unmask]]
> Sent: Friday, June 27, 2014 6:29 PM
> To: Andrew Hanushevsky; Matevz Tadel
> Cc: xrootd-dev; Lahiff, Andrew (STFC,RAL,PPD)
> Subject: Re: Stalls at outgoing proxy
>
> Thanks Andy,
>
> So we have to figure out if this is really the proxy overload or it's the
> netowrking infrastructure.
>
> Andrew, how does this looks at your machine monitoring level? You could also
> enable summary monitoring on all proxy machines, then we'll also be able to see
> the actual cpu usage by xrootd and cmsd processes in MonALISA (i only see these
> hosts now: heplnx229 lcgclsf02 lcgvo03).
>
> xrd.report xrootd.t2.ucsd.edu:9931 every 30s all sync
>
> Matevz
>
> On 06/27/14 10:12, Andrew Hanushevsky wrote:
>> Hi Matevz,
>>
>> OK, then all this probably means is that client a goes after file x and the
>> proxy is very heavily loaded so it takes a bit of time to actually open the file
>> at the remote location. While the open is taking place, client b tries to open
>> the same file. So, client b is delayed until client a finishes opening the file.
>> Nothing particularly wrong here.
>>
>> Andy
>>
>> On Fri, 27 Jun 2014, Matevz Tadel wrote:
>>
>>> Thanks Andy,
>>>
>>> This is actually the standard proxy, RAL was running 4.0.0-rc1 the last time
>>> we talked about it.
>>>
>>> Andrew, have you upgraded to 4.0.0 yet?
>>>
>>> Matevz
>>>
>>> On 06/27/14 10:01, Andrew Hanushevsky wrote:
>>>> Hi Matevz,
>>>>
>>>> No need to turn on debugging here. This particular stall occurs because a file
>>>> is being opened and the OFS has found that the file is already open or being
>>>> opened by another client. So, it tries to piggy-back the new open on that handle
>>>> to avoid actually doing another physical open. The problem is that the other
>>>> client has not yet released the handle for use; likely being hung up in the
>>>> proxy code trying to do the open or perhaps a close. The latter problem I
>>>> thought was solved by the disk caching proxy by doing the closes in the
>>>> background to avoid holding on to the handle lock for long periods of time.
>>>>
>>>> This is not a fatal problem the client will eventually open the file. The ofs
>>>> layer uses this as congenstion control when there is a lot of open/close
>>>> contention for the same file. I suppose you can trace opens and closes to get
>>>> better feeling of how long this takes:
>>>>
>>>> ofs.trace open close
>>>>
>>>> Assuming this is a disk caching proxy there may be tracing options for that to
>>>> see what happens during the open/close sequence.
>>>>
>>>> Andy
>>>>
>>>>    On Fri, 27 Jun 2014, Matevz Tadel wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> At RAL, they see the following on their outgoing proxy servers (repeating for
>>>>> about a minute before the file-open times-out at the application side):<<FNORD
>>>>>
>>>>> When our xrootd proxy cluster is busy, there are sometimes messages like this
>>>>> in the logs:
>>>>>
>>>>> 140626 16:53:25 24465  ofs_Stall: Stall 3: File
>>>>> 2EF5AF84-D65A-E311-AB3F-02163E00A0E1.root is being staged; estimated time to
>>>>> completion 3 seconds for
>>>>> /store/mc/Fall13/QCD_Pt-5to10_Tune4C_13TeV_pythia8/GEN-SIM/POSTLS162_V1_castor-v1/10000/2EF5AF84-D65A-E311-AB3F-02163E00A0E1.root
>>>>>
>>>>>
>>>>> 140626 16:53:25 24465 pcms054.6545:147@lcg1353 XrootdProtocol: stalling client
>>>>> for 3 sec
>>>>> 140626 16:53:25 24465 pcms054.6545:147@lcg1353 ofs_close: use=0 fn=dummy
>>>>>
>>>>> FNORD
>>>>>
>>>>> This probably means that the remote file can not be opened for some reason
>>>>> (like being delayed by external redirector/server)? Would there be a special
>>>>> error if the socket can not be opened (due to fd or firewall limits ... or
>>>>> some other internal limits)? Note that this only happens when the proxies are
>>>>> already under heavy load.
>>>>>
>>>>> What options should they set to debug this?
>>>>>
>>>>> pss.memcache debug ???
>>>>> xrd.trace    conn
>>>>> xrootd.trace redirect
>>>>>
>>>>> Matevz
>>>>>
>>>>> ########################################################################
>>>>> Use REPLY-ALL to reply to list
>>>>>
>>>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>>>>
>>>
>>
>> ########################################################################
>> Use REPLY-ALL to reply to list
>>
>> To unsubscribe from the XROOTD-DEV list, click the following link:
>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1