LISTSERV 16.5 - XROOTD-DEV Archives

Hi Matevz,

OK, then all this probably means is that client a goes after file x and 
the proxy is very heavily loaded so it takes a bit of time to actually 
open the file at the remote location. While the open is taking place, 
client b tries to open the same file. So, client b is delayed until client 
a finishes opening the file. Nothing particularly wrong here.

Andy

On Fri, 27 Jun 2014, Matevz Tadel wrote:

> Thanks Andy,
>
> This is actually the standard proxy, RAL was running 4.0.0-rc1 the last time 
> we talked about it.
>
> Andrew, have you upgraded to 4.0.0 yet?
>
> Matevz
>
> On 06/27/14 10:01, Andrew Hanushevsky wrote:
>> Hi Matevz,
>> 
>> No need to turn on debugging here. This particular stall occurs because a 
>> file
>> is being opened and the OFS has found that the file is already open or 
>> being
>> opened by another client. So, it tries to piggy-back the new open on that 
>> handle
>> to avoid actually doing another physical open. The problem is that the 
>> other
>> client has not yet released the handle for use; likely being hung up in the
>> proxy code trying to do the open or perhaps a close. The latter problem I
>> thought was solved by the disk caching proxy by doing the closes in the
>> background to avoid holding on to the handle lock for long periods of time.
>> 
>> This is not a fatal problem the client will eventually open the file. The 
>> ofs
>> layer uses this as congenstion control when there is a lot of open/close
>> contention for the same file. I suppose you can trace opens and closes to 
>> get
>> better feeling of how long this takes:
>> 
>> ofs.trace open close
>> 
>> Assuming this is a disk caching proxy there may be tracing options for that 
>> to
>> see what happens during the open/close sequence.
>> 
>> Andy
>>
>>   On Fri, 27 Jun 2014, Matevz Tadel wrote:
>> 
>>> Hi,
>>> 
>>> At RAL, they see the following on their outgoing proxy servers (repeating 
>>> for
>>> about a minute before the file-open times-out at the application 
>>> side):<<FNORD
>>> 
>>> When our xrootd proxy cluster is busy, there are sometimes messages like 
>>> this
>>> in the logs:
>>> 
>>> 140626 16:53:25 24465  ofs_Stall: Stall 3: File
>>> 2EF5AF84-D65A-E311-AB3F-02163E00A0E1.root is being staged; estimated time 
>>> to
>>> completion 3 seconds for
>>> /store/mc/Fall13/QCD_Pt-5to10_Tune4C_13TeV_pythia8/GEN-SIM/POSTLS162_V1_castor-v1/10000/2EF5AF84-D65A-E311-AB3F-02163E00A0E1.root
>>> 
>>> 140626 16:53:25 24465 pcms054.6545:147@lcg1353 XrootdProtocol: stalling 
>>> client
>>> for 3 sec
>>> 140626 16:53:25 24465 pcms054.6545:147@lcg1353 ofs_close: use=0 fn=dummy
>>> 
>>> FNORD
>>> 
>>> This probably means that the remote file can not be opened for some reason
>>> (like being delayed by external redirector/server)? Would there be a 
>>> special
>>> error if the socket can not be opened (due to fd or firewall limits ... or
>>> some other internal limits)? Note that this only happens when the proxies 
>>> are
>>> already under heavy load.
>>> 
>>> What options should they set to debug this?
>>> 
>>> pss.memcache debug ???
>>> xrd.trace    conn
>>> xrootd.trace redirect
>>> 
>>> Matevz
>>> 
>>> ########################################################################
>>> Use REPLY-ALL to reply to list
>>> 
>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>> 
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1