Print

Print


Hi Olivier,

This is to be expected when you join a federation and all of a sudden have 
to deal with 10 of thousands new requests. There are several mitigations, 
each with it's own set of drawbacks. I would suggest you consult with cms 
experts on some of these.

a) You can artificially restrict the number of threads in the cmsd. The 
drawback here is that there is no distunction between local requests and 
federated requests and you may wind up crowding out local requests. Use 
the xrd.sched directive for this:
https://xrootd.slac.stanford.edu/doc/dev55/xrd_config.htm#_Toc88513978

b) You can specify the percentage of federated requests you are wiling to 
handle. The is not as precise as it seems as the global redirector might 
not have any choice but to violate your request if your site has is the 
only source of a file or when every other site has reached its global 
share. Use the cms.sched directive for this:
https://xrootd.slac.stanford.edu/doc/dev55/xrd_config.htm#_Toc88513978

c) Scale up to meet the demand. Nothing stops you from running a cluster 
of redirectors for the federation. By using the all.manager all option 
the load is equally split amongst all of the available sub-redirectors. 
The drawback is that you need more hardware to do this but given the load 
you are experiencing that's the only solution available without 
constaining the local resource. See the all.manager directive:
https://xrootd.slac.stanford.edu/doc/dev54/cms_config.htm#_Toc53611061

The issue here is that your current setup is insuffient for the number of 
requests the federation is trying to handle. I would suggest talking to 
the cms federation managers to see if you can employ one of the above 
options or whether they have other alternatives.

Andy


On Thu, 19 May 2022, Olivier Devroede wrote:

> Dear xrootd experts,
>
> we upgraded to version 5.4.2 of xrootd two days ago.
>
> It worked flawlessly for 1.5 days, but now cmsd spawns thousands of
> threads. This causes huge loads (up to 30.000) on the machine.
> Restarting the daemon does not solve the problem.
>
> Extra info: we are part of the xrootd federation of the cms experiment.
>
> Do you have any idea how we can fix/debug this problem? The logs do not
> tell us a lot. It's mostly requests for files in the cmsd logs [1] and
> nothing special in the xrootd logs [2]
>
> Any help is greatly appreciated.
>
> Olivier.
>
> [1] cmsd.log
>
> 220519 15:42:40 22203 cms_Dispatch:
> manager.0:[log in to unmask] for state dlen=156
> 220519 15:42:40 6266 manager.0:[log in to unmask]
> cms_do_State: /store/user/nshadski/c
> ustomNano/QCD_Pt_470to600_TuneCP5_13TeV_pythia8/KIT_CustomNanoV9_MC_2016postVFP/211229_123611/0000/MC_2016p
> ostVFP_NanoAODv9_1-32.root
> 220519 15:42:40 22265 cms_Dispatch: manager.0:[log in to unmask]
> for state dlen=156
> 220519 15:42:40 6267 manager.0:[log in to unmask] cms_do_State:
> /store/user/nshadski/customNano/QCD
> _Pt_470to600_TuneCP5_13TeV_pythia8/KIT_CustomNanoV9_MC_2016postVFP/211229_123611/0000/MC_2016postVFP_NanoAO
> Dv9_1-32.root
> 220519 15:42:40 22265 cms_Dispatch: manager.0:[log in to unmask]
> for state dlen=105
> 220519 15:42:40 6268 manager.0:[log in to unmask] cms_do_State:
> /store/data/Run2018D/EGamma/MINIAOD
> /12Nov2019_UL2018-v4/260000/E8C2D279-4422-E743-904A-4233F0BF230E.root
> 220519 15:42:40 22266 cms_Dispatch: manager.0:[log in to unmask] for
> state dlen=156
> 220519 15:42:40 6269 manager.0:[log in to unmask] cms_do_State:
> /store/user/nshadski/customNano/QCD_Pt_47
> 0to600_TuneCP5_13TeV_pythia8/KIT_CustomNanoV9_MC_2016postVFP/211229_123611/0000/MC_2016postVFP_NanoAODv9_1-
> 32.root
>
> [2] xrootd.log
> 220519 15:42:30 22178 sysThrottleManager: Current IO counter is 0; total IO
> wait time is 0ms.
> 220519 15:42:31 22178 sysThrottleManager: Round ops allocation -1
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the XROOTD-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1