Print

Print


Hi Andrew,

Thanks a lot for your answer.

I finally had to go far away from the default and have set this: 
   xrd.sched mint 8 maxt 102 avlt 25 idle 39

And now, it is stable, and the ETF tests run. I'll experiment with the cms.sched later, but I want to have some stability for the time being. And see if everyone gets served. 

That being said, we are not new to the federation. We have been part of it for many years and only occasionally had the cmsd go crazy like this. And in the past, waiting for a few hours always solved the problem. But not this time.
It might be a coincidence, but this happened 1.5 days after upgrading to xrootd 5.4.2.


Kind regards,

Olivier.

On Fri, 20 May 2022 at 01:55, Andrew Hanushevsky <[log in to unmask]> wrote:
Hi Olivier,

This is to be expected when you join a federation and all of a sudden have
to deal with 10 of thousands new requests. There are several mitigations,
each with it's own set of drawbacks. I would suggest you consult with cms
experts on some of these.

a) You can artificially restrict the number of threads in the cmsd. The
drawback here is that there is no distunction between local requests and
federated requests and you may wind up crowding out local requests. Use
the xrd.sched directive for this:
https://xrootd.slac.stanford.edu/doc/dev55/xrd_config.htm#_Toc88513978

b) You can specify the percentage of federated requests you are wiling to
handle. The is not as precise as it seems as the global redirector might
not have any choice but to violate your request if your site has is the
only source of a file or when every other site has reached its global
share. Use the cms.sched directive for this:
https://xrootd.slac.stanford.edu/doc/dev55/xrd_config.htm#_Toc88513978

c) Scale up to meet the demand. Nothing stops you from running a cluster
of redirectors for the federation. By using the all.manager all option
the load is equally split amongst all of the available sub-redirectors.
The drawback is that you need more hardware to do this but given the load
you are experiencing that's the only solution available without
constaining the local resource. See the all.manager directive:
https://xrootd.slac.stanford.edu/doc/dev54/cms_config.htm#_Toc53611061

The issue here is that your current setup is insuffient for the number of
requests the federation is trying to handle. I would suggest talking to
the cms federation managers to see if you can employ one of the above
options or whether they have other alternatives.

Andy


On Thu, 19 May 2022, Olivier Devroede wrote:

> Dear xrootd experts,
>
> we upgraded to version 5.4.2 of xrootd two days ago.
>
> It worked flawlessly for 1.5 days, but now cmsd spawns thousands of
> threads. This causes huge loads (up to 30.000) on the machine.
> Restarting the daemon does not solve the problem.
>
> Extra info: we are part of the xrootd federation of the cms experiment.
>
> Do you have any idea how we can fix/debug this problem? The logs do not
> tell us a lot. It's mostly requests for files in the cmsd logs [1] and
> nothing special in the xrootd logs [2]
>
> Any help is greatly appreciated.
>
> Olivier.
>
> [1] cmsd.log
>
> 220519 15:42:40 22203 cms_Dispatch:
> manager.0:[log in to unmask] for state dlen=156
> 220519 15:42:40 6266 manager.0:[log in to unmask]
> cms_do_State: /store/user/nshadski/c
> ustomNano/QCD_Pt_470to600_TuneCP5_13TeV_pythia8/KIT_CustomNanoV9_MC_2016postVFP/211229_123611/0000/MC_2016p
> ostVFP_NanoAODv9_1-32.root
> 220519 15:42:40 22265 cms_Dispatch: manager.0:[log in to unmask]
> for state dlen=156
> 220519 15:42:40 6267 manager.0:[log in to unmask] cms_do_State:
> /store/user/nshadski/customNano/QCD
> _Pt_470to600_TuneCP5_13TeV_pythia8/KIT_CustomNanoV9_MC_2016postVFP/211229_123611/0000/MC_2016postVFP_NanoAO
> Dv9_1-32.root
> 220519 15:42:40 22265 cms_Dispatch: manager.0:[log in to unmask]
> for state dlen=105
> 220519 15:42:40 6268 manager.0:[log in to unmask] cms_do_State:
> /store/data/Run2018D/EGamma/MINIAOD
> /12Nov2019_UL2018-v4/260000/E8C2D279-4422-E743-904A-4233F0BF230E.root
> 220519 15:42:40 22266 cms_Dispatch: manager.0:[log in to unmask] for
> state dlen=156
> 220519 15:42:40 6269 manager.0:[log in to unmask] cms_do_State:
> /store/user/nshadski/customNano/QCD_Pt_47
> 0to600_TuneCP5_13TeV_pythia8/KIT_CustomNanoV9_MC_2016postVFP/211229_123611/0000/MC_2016postVFP_NanoAODv9_1-
> 32.root
>
> [2] xrootd.log
> 220519 15:42:30 22178 sysThrottleManager: Current IO counter is 0; total IO
> wait time is 0ms.
> 220519 15:42:31 22178 sysThrottleManager: Round ops allocation -1
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the XROOTD-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>


Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1