Print

Print


On 2020-01-08 05:54, Bockelman, Brian wrote:
> Why though?  There's certainly a cost to every new option we add and I was 
> curious why someone might want to turn it off.

The option would be used to toggle between triedrc=resel and triedrc=reseg ... 
and then, thinking about this, I got a bit confused and asked Andy to confirm 
what reseg actually does: it seems to turn on the multi-src consideration now - 
but the static redirect of the client seems to be done in either case (resel and 
reseg) when server is configured for no-multisource and for static redirect.

This is the commit that added resel / reseg separation:
https://github.com/xrootd/xrootd/commit/cfdf448f4d3d93103f8f1a39244cd8e60936f67f

When maximum number of retries is reached, the static redirection is always 
done, if configured (assuming I'm reading the code right):

https://github.com/xrootd/xrootd/blob/master/src/XrdCms/XrdCmsNode.cc#L1080


I was sort of expecting (but don't see the code doing it):

a) no-multi-src and triedrc=resel -> error
b) no-multi-src and triedrc=reseg -> static redirect

c) multi-src and triedrc=resel and max-count-reached -> error
d) multi-src and triedrc=reseg and max-count-reached -> static redirect

e) triedrc=server-error                       -> local reselection
f) triedrc=server-error and max-count-reached -> static redirect

I think in cases a) and c) the code actually does static redirect (if configured).

I'm bringing all this up as it's not entirely clear to me how we are going to 
fix the FNAL problem with what what we have in the code now.

Matevz


>> On Jan 7, 2020, at 11:29 PM, Justas Balcas <[log in to unmask] 
>> <mailto:[log in to unmask]>> wrote:
>>
>> Hi,
>>
>> As I wrote on slack, we could add this to site-local conf storage.xml without 
>> code change.
>>
>> On Tue, 7 Jan 2020 at 18:13, Bockelman, Brian <[log in to unmask] 
>> <mailto:[log in to unmask]>> wrote:
>>
>>     Hi Matevz,
>>
>>     It's been awhile since I read this thread.  Upon re-reading, is it
>>     necessary to really expose this as an option?  If CMSSW correctly manages
>>     when to add 'reseg' (and when _not_ to add this), could it ever reasonably
>>     be harmful?
>>
>>     Brian
>>
>>     > On Jan 7, 2020, at 3:37 PM, Matevz Tadel <[log in to unmask]
>>     <mailto:[log in to unmask]>> wrote:
>>     >
>>     > Hi Andy, Brian,
>>     >
>>     > I'm trying to create a ticket for CMSSW XrdAdaptor to use triedrc=.
>>     >
>>     > For xcache redirector, it is clear one should use =triedrc=resel (local
>>     reselection).
>>     >
>>     > With XrdAdaptor this will become the default for other multi-source
>>     requests ... and will thus solve FNAL issue, if the jobs talk to the local
>>     FNAL redirector.
>>     >
>>     > 1. Brian: how should we introduce the option for =triedrc=reseg (global
>>     reselection) into XrdAdaptor? Should this come from outside in some
>>     fashion, like an env var?
>>     >
>>     > 2. Andy: I was tracing through the code to remember how this is all done
>>     and noticed that kYR_tryRSEG never gets used in the code (other than for
>>     definition and for setting it when reseg is given). Is there something
>>     missing here or I just don't get the whole global reselection thing (again)?
>>     >
>>     > Matevz
>>     >
>>     > On 2019-05-09 15:36, Andrew Hanushevsky wrote:
>>     >> Hi Brian,
>>     >> OK, you got you wish along with a way to bypass it (the default being
>>     "keep it local"). Now, are you going to be at the XRootD Workshop? If not,
>>     can someone be there to give a presentation on the latest developments on
>>     SciTokens et al?
>>     >> Andy
>>     >> -----Original Message----- From: Bockelman, Brian
>>     >> Sent: Monday, May 06, 2019 1:39 AM
>>     >> To: Andrew Hanushevsky
>>     >> Cc: Matevz Tadel ; Michal Kamil Simon ; xrootd-dev
>>     >> Subject: Re: Proposal for new opaque URL parameter using= complementing
>>     tried=
>>     >> Hi Andy,
>>     >> In this case - there's good reason to not send clients offsite (even if
>>     the offsite server is providing better performance, the WAN costs more...)
>>     when there's a perfectly good copy onsite.  We can be sure to drop the
>>     "resel" from the "triedrc" when the job is looking for an additional
>>     source because of an error instead of wanting faster sources.
>>     >> I think it would be useful to keep the "file not found" code to only
>>     trigger when the file is actually not found.
>>     >> Brian
>>     >>> On May 6, 2019, at 7:35 AM, Andrew Hanushevsky <[log in to unmask]
>>     <mailto:[log in to unmask]>> wrote:
>>     >>>
>>     >>> Well, isn't the point of reselection is to find the best possible site
>>     which could be offsite? We could give you an option to keep it local but
>>     you would need to add that to the conig file.
>>     >>>
>>     >>> On Sun, 5 May 2019, Bockelman, Brian wrote:
>>     >>>
>>     >>>> Hi Matevz,
>>     >>>>
>>     >>>> The other thing that should go into the cmsd is to avoid doing a
>>     ?redirect on file not found? for reselection.
>>     >>>>
>>     >>>> This would help immensely in cases like FNAL which uses this for all
>>     jobs, causing the multi source CMSSW to pull data that is onsite, from
>>     offsite, due to reselection.
>>     >>>>
>>     >>>> (After telling them for 5 years to change, I guess we can tweak the
>>     software ;) )
>>     >>>>
>>     >>>> Brian
>>     >>>>
>>     >>>> Sent from my iPhone
>>     >>>>
>>     >>>>> On May 3, 2019, at 6:57 PM, Matevz Tadel <[log in to unmask]
>>     <mailto:[log in to unmask]>> wrote:
>>     >>>>>
>>     >>>>> Hi,
>>     >>>>>
>>     >>>>> Andy realized that an option for this already exists -- triedrc=resel
>>     >>>>>
>>     >>>>> Andy impleented a change in cmsd that allows disabling opening of a
>>     new file on reselection, goes under cmsd.sched nomultisrc.
>>     >>>>>
>>     >>>>> Brian, now we have to propagate this into XrdAdapter.
>>     >>>>>
>>     >>>>> Matevz
>>     >>>>>
>>     >>>>>> On 4/18/19 10:45 AM, Andrew Hanushevsky wrote:
>>     >>>>>> After some discusion with Matevz, we decided to simplify this, So,
>>     it won't be exactly what was outlined but will be functionally the same.
>>     This requires soem development in the cmsd. That said, an issue should be cut.
>>     >>>>>> Andy
>>     >>>>>>> On Thu, 18 Apr 2019, Michal Kamil Simon wrote:
>>     >>>>>>> Hi,
>>     >>>>>>>
>>     >>>>>>> It sounds reasonable to me :-)
>>     >>>>>>>
>>     >>>>>>> Matevz: could you create an issue in github so we don't loose
>>     >>>>>>> track of this topic? ;-)
>>     >>>>>>>
>>     >>>>>>> Cheers,
>>     >>>>>>> Michal
>>     >>>>>>> ________________________________________
>>     >>>>>>> From: [log in to unmask]
>>     <mailto:[log in to unmask]> [[log in to unmask]
>>     <mailto:[log in to unmask]>] on behalf of Bockelman, Brian
>>     [[log in to unmask] <mailto:[log in to unmask]>]
>>     >>>>>>> Sent: 17 April 2019 03:59
>>     >>>>>>> To: [log in to unmask] <mailto:[log in to unmask]>
>>     >>>>>>> Cc: xrootd-dev
>>     >>>>>>> Subject: Re: Proposal for new opaque URL parameter using=
>>     complementing tried=
>>     >>>>>>>
>>     >>>>>>> Yes!  We definitely could benefit from this on the CMS side!
>>     >>>>>>>
>>     >>>>>>> Sent from my iPhone
>>     >>>>>>>
>>     >>>>>>>> On Apr 15, 2019, at 5:21 PM, Matevz Tadel <[log in to unmask]
>>     <mailto:[log in to unmask]>> wrote:
>>     >>>>>>>>
>>     >>>>>>>> Hi,
>>     >>>>>>>>
>>     >>>>>>>> [This is mostly for Andy, Brian, and Michal.]
>>     >>>>>>>>
>>     >>>>>>>> In the context of XCache cluster used by CMSSW multi-source jobs
>>     there is an issue with cmssw jobs requesting opening of a second source on
>>     the cache cluster using the tried= opaque parameter to point to cache
>>     server already in use. This leads to creation of another replica of the
>>     same file in the cache cluster.
>>     >>>>>>>>
>>     >>>>>>>> The cache still needs to honor tried= in case there is a problem
>>     with the existing server. However, asking for a new "extra" server in the
>>     context of cache does not make much sense.
>>     >>>>>>>>
>>     >>>>>>>> To distinguish these two conditions I propose to introduce a new
>>     opaque directive, "using=", used to signal to the redirector that the
>>     client is already using the listed servers.
>>     >>>>>>>>
>>     >>>>>>>> On cmsd side this would be accompanied with a cms.dfs multisource
>>     count ("sister" option to cms.dfs retries). These two would then control
>>     how many errors and parallel accesses are allowed for a client session.
>>     >>>>>>>>
>>     >>>>>>>> Does this make sense?
>>     >>>>>>>>
>>     >>>>>>>> Matevz
>>     >>>>>>>>
>>     >>>>>>>>
>>     ########################################################################
>>     >>>>>>>> Use REPLY-ALL to reply to list
>>     >>>>>>>>
>>     >>>>>>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>     >>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>     >>>>>>>
>>     >>>>>>>
>>     ########################################################################
>>     >>>>>>> Use REPLY-ALL to reply to list
>>     >>>>>>>
>>     >>>>>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>     >>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>     >>>>>>>
>>     >>>>>>>
>>     ########################################################################
>>     >>>>>>> Use REPLY-ALL to reply to list
>>     >>>>>>>
>>     >>>>>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>     >>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>     >>>>>>>
>>     >>>>>
>>     >>>>
>>     >> ########################################################################
>>     >> Use REPLY-ALL to reply to list
>>     >> To unsubscribe from the XROOTD-DEV list, click the following link:
>>     >> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>     >
>>
>>
>>
>> -- 
>> ---------------------------------------
>> Justas Balcas
>> Caltech CMS Group
>> CIT Downs-Lauritsen 239
>> CERN B32/3-A09 (72531)
>>
> 

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1