Print

Print


Hi Olivier,

It sure is a conincidence. The cmsd code base has not materially changed 
since 5.4.0. What release did you upgrade from?

Andy


On Fri, 20 May 2022, Olivier Devroede wrote:

> Hi Andrew,
>
> Thanks a lot for your answer.
>
> I finally had to go far away from the default and have set this:
>   xrd.sched mint 8 maxt 102 avlt 25 idle 39
>
> And now, it is stable, and the ETF tests run. I'll experiment with the
> cms.sched later, but I want to have some stability for the time being. And
> see if everyone gets served.
>
> That being said, we are not new to the federation. We have been part of it
> for many years and only occasionally had the cmsd go crazy like this. And
> in the past, waiting for a few hours always solved the problem. But not
> this time.
> It might be a coincidence, but this happened 1.5 days after upgrading to
> xrootd 5.4.2.
>
>
> Kind regards,
>
> Olivier.
>
> On Fri, 20 May 2022 at 01:55, Andrew Hanushevsky <[log in to unmask]>
> wrote:
>
>> Hi Olivier,
>>
>> This is to be expected when you join a federation and all of a sudden have
>> to deal with 10 of thousands new requests. There are several mitigations,
>> each with it's own set of drawbacks. I would suggest you consult with cms
>> experts on some of these.
>>
>> a) You can artificially restrict the number of threads in the cmsd. The
>> drawback here is that there is no distunction between local requests and
>> federated requests and you may wind up crowding out local requests. Use
>> the xrd.sched directive for this:
>> https://xrootd.slac.stanford.edu/doc/dev55/xrd_config.htm#_Toc88513978
>>
>> b) You can specify the percentage of federated requests you are wiling to
>> handle. The is not as precise as it seems as the global redirector might
>> not have any choice but to violate your request if your site has is the
>> only source of a file or when every other site has reached its global
>> share. Use the cms.sched directive for this:
>> https://xrootd.slac.stanford.edu/doc/dev55/xrd_config.htm#_Toc88513978
>>
>> c) Scale up to meet the demand. Nothing stops you from running a cluster
>> of redirectors for the federation. By using the all.manager all option
>> the load is equally split amongst all of the available sub-redirectors.
>> The drawback is that you need more hardware to do this but given the load
>> you are experiencing that's the only solution available without
>> constaining the local resource. See the all.manager directive:
>> https://xrootd.slac.stanford.edu/doc/dev54/cms_config.htm#_Toc53611061
>>
>> The issue here is that your current setup is insuffient for the number of
>> requests the federation is trying to handle. I would suggest talking to
>> the cms federation managers to see if you can employ one of the above
>> options or whether they have other alternatives.
>>
>> Andy
>>
>>
>> On Thu, 19 May 2022, Olivier Devroede wrote:
>>
>>> Dear xrootd experts,
>>>
>>> we upgraded to version 5.4.2 of xrootd two days ago.
>>>
>>> It worked flawlessly for 1.5 days, but now cmsd spawns thousands of
>>> threads. This causes huge loads (up to 30.000) on the machine.
>>> Restarting the daemon does not solve the problem.
>>>
>>> Extra info: we are part of the xrootd federation of the cms experiment.
>>>
>>> Do you have any idea how we can fix/debug this problem? The logs do not
>>> tell us a lot. It's mostly requests for files in the cmsd logs [1] and
>>> nothing special in the xrootd logs [2]
>>>
>>> Any help is greatly appreciated.
>>>
>>> Olivier.
>>>
>>> [1] cmsd.log
>>>
>>> 220519 15:42:40 22203 cms_Dispatch:
>>> manager.0:[log in to unmask] for state dlen=156
>>> 220519 15:42:40 6266 manager.0:[log in to unmask]
>>> cms_do_State: /store/user/nshadski/c
>>>
>> ustomNano/QCD_Pt_470to600_TuneCP5_13TeV_pythia8/KIT_CustomNanoV9_MC_2016postVFP/211229_123611/0000/MC_2016p
>>> ostVFP_NanoAODv9_1-32.root
>>> 220519 15:42:40 22265 cms_Dispatch: manager.0:[log in to unmask]
>>> for state dlen=156
>>> 220519 15:42:40 6267 manager.0:[log in to unmask] cms_do_State:
>>> /store/user/nshadski/customNano/QCD
>>>
>> _Pt_470to600_TuneCP5_13TeV_pythia8/KIT_CustomNanoV9_MC_2016postVFP/211229_123611/0000/MC_2016postVFP_NanoAO
>>> Dv9_1-32.root
>>> 220519 15:42:40 22265 cms_Dispatch: manager.0:[log in to unmask]
>>> for state dlen=105
>>> 220519 15:42:40 6268 manager.0:[log in to unmask] cms_do_State:
>>> /store/data/Run2018D/EGamma/MINIAOD
>>> /12Nov2019_UL2018-v4/260000/E8C2D279-4422-E743-904A-4233F0BF230E.root
>>> 220519 15:42:40 22266 cms_Dispatch: manager.0:[log in to unmask] for
>>> state dlen=156
>>> 220519 15:42:40 6269 manager.0:[log in to unmask] cms_do_State:
>>> /store/user/nshadski/customNano/QCD_Pt_47
>>>
>> 0to600_TuneCP5_13TeV_pythia8/KIT_CustomNanoV9_MC_2016postVFP/211229_123611/0000/MC_2016postVFP_NanoAODv9_1-
>>> 32.root
>>>
>>> [2] xrootd.log
>>> 220519 15:42:30 22178 sysThrottleManager: Current IO counter is 0; total
>> IO
>>> wait time is 0ms.
>>> 220519 15:42:31 22178 sysThrottleManager: Round ops allocation -1
>>>
>>> ########################################################################
>>> Use REPLY-ALL to reply to list
>>>
>>> To unsubscribe from the XROOTD-L list, click the following link:
>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>
>>
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the XROOTD-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1