OK, for reference, here are links to relevant ML pages: http://xrootd.t2.ucsd.edu?49 http://xrootd.t2.ucsd.edu?50 http://xrootd.t2.ucsd.edu?51 http://xrootd.t2.ucsd.edu?52 Cheers, Matevz On 06/27/14 11:01, [log in to unmask] wrote: > > I've just enabled the monitoring on all the proxy machines. > > Regards, > Andrew. > > ________________________________________ > From: Matevz Tadel [[log in to unmask]] > Sent: Friday, June 27, 2014 6:29 PM > To: Andrew Hanushevsky; Matevz Tadel > Cc: xrootd-dev; Lahiff, Andrew (STFC,RAL,PPD) > Subject: Re: Stalls at outgoing proxy > > Thanks Andy, > > So we have to figure out if this is really the proxy overload or it's the > netowrking infrastructure. > > Andrew, how does this looks at your machine monitoring level? You could also > enable summary monitoring on all proxy machines, then we'll also be able to see > the actual cpu usage by xrootd and cmsd processes in MonALISA (i only see these > hosts now: heplnx229 lcgclsf02 lcgvo03). > > xrd.report xrootd.t2.ucsd.edu:9931 every 30s all sync > > Matevz > > On 06/27/14 10:12, Andrew Hanushevsky wrote: >> Hi Matevz, >> >> OK, then all this probably means is that client a goes after file x and the >> proxy is very heavily loaded so it takes a bit of time to actually open the file >> at the remote location. While the open is taking place, client b tries to open >> the same file. So, client b is delayed until client a finishes opening the file. >> Nothing particularly wrong here. >> >> Andy >> >> On Fri, 27 Jun 2014, Matevz Tadel wrote: >> >>> Thanks Andy, >>> >>> This is actually the standard proxy, RAL was running 4.0.0-rc1 the last time >>> we talked about it. >>> >>> Andrew, have you upgraded to 4.0.0 yet? >>> >>> Matevz >>> >>> On 06/27/14 10:01, Andrew Hanushevsky wrote: >>>> Hi Matevz, >>>> >>>> No need to turn on debugging here. This particular stall occurs because a file >>>> is being opened and the OFS has found that the file is already open or being >>>> opened by another client. So, it tries to piggy-back the new open on that handle >>>> to avoid actually doing another physical open. The problem is that the other >>>> client has not yet released the handle for use; likely being hung up in the >>>> proxy code trying to do the open or perhaps a close. The latter problem I >>>> thought was solved by the disk caching proxy by doing the closes in the >>>> background to avoid holding on to the handle lock for long periods of time. >>>> >>>> This is not a fatal problem the client will eventually open the file. The ofs >>>> layer uses this as congenstion control when there is a lot of open/close >>>> contention for the same file. I suppose you can trace opens and closes to get >>>> better feeling of how long this takes: >>>> >>>> ofs.trace open close >>>> >>>> Assuming this is a disk caching proxy there may be tracing options for that to >>>> see what happens during the open/close sequence. >>>> >>>> Andy >>>> >>>> On Fri, 27 Jun 2014, Matevz Tadel wrote: >>>> >>>>> Hi, >>>>> >>>>> At RAL, they see the following on their outgoing proxy servers (repeating for >>>>> about a minute before the file-open times-out at the application side):<<FNORD >>>>> >>>>> When our xrootd proxy cluster is busy, there are sometimes messages like this >>>>> in the logs: >>>>> >>>>> 140626 16:53:25 24465 ofs_Stall: Stall 3: File >>>>> 2EF5AF84-D65A-E311-AB3F-02163E00A0E1.root is being staged; estimated time to >>>>> completion 3 seconds for >>>>> /store/mc/Fall13/QCD_Pt-5to10_Tune4C_13TeV_pythia8/GEN-SIM/POSTLS162_V1_castor-v1/10000/2EF5AF84-D65A-E311-AB3F-02163E00A0E1.root >>>>> >>>>> >>>>> 140626 16:53:25 24465 pcms054.6545:147@lcg1353 XrootdProtocol: stalling client >>>>> for 3 sec >>>>> 140626 16:53:25 24465 pcms054.6545:147@lcg1353 ofs_close: use=0 fn=dummy >>>>> >>>>> FNORD >>>>> >>>>> This probably means that the remote file can not be opened for some reason >>>>> (like being delayed by external redirector/server)? Would there be a special >>>>> error if the socket can not be opened (due to fd or firewall limits ... or >>>>> some other internal limits)? Note that this only happens when the proxies are >>>>> already under heavy load. >>>>> >>>>> What options should they set to debug this? >>>>> >>>>> pss.memcache debug ??? >>>>> xrd.trace conn >>>>> xrootd.trace redirect >>>>> >>>>> Matevz >>>>> >>>>> ######################################################################## >>>>> Use REPLY-ALL to reply to list >>>>> >>>>> To unsubscribe from the XROOTD-DEV list, click the following link: >>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1 >>>>> >>> >> >> ######################################################################## >> Use REPLY-ALL to reply to list >> >> To unsubscribe from the XROOTD-DEV list, click the following link: >> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1 > ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-DEV list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1