Hi Olivier, It sure is a conincidence. The cmsd code base has not materially changed since 5.4.0. What release did you upgrade from? Andy On Fri, 20 May 2022, Olivier Devroede wrote: > Hi Andrew, > > Thanks a lot for your answer. > > I finally had to go far away from the default and have set this: > xrd.sched mint 8 maxt 102 avlt 25 idle 39 > > And now, it is stable, and the ETF tests run. I'll experiment with the > cms.sched later, but I want to have some stability for the time being. And > see if everyone gets served. > > That being said, we are not new to the federation. We have been part of it > for many years and only occasionally had the cmsd go crazy like this. And > in the past, waiting for a few hours always solved the problem. But not > this time. > It might be a coincidence, but this happened 1.5 days after upgrading to > xrootd 5.4.2. > > > Kind regards, > > Olivier. > > On Fri, 20 May 2022 at 01:55, Andrew Hanushevsky <[log in to unmask]> > wrote: > >> Hi Olivier, >> >> This is to be expected when you join a federation and all of a sudden have >> to deal with 10 of thousands new requests. There are several mitigations, >> each with it's own set of drawbacks. I would suggest you consult with cms >> experts on some of these. >> >> a) You can artificially restrict the number of threads in the cmsd. The >> drawback here is that there is no distunction between local requests and >> federated requests and you may wind up crowding out local requests. Use >> the xrd.sched directive for this: >> https://xrootd.slac.stanford.edu/doc/dev55/xrd_config.htm#_Toc88513978 >> >> b) You can specify the percentage of federated requests you are wiling to >> handle. The is not as precise as it seems as the global redirector might >> not have any choice but to violate your request if your site has is the >> only source of a file or when every other site has reached its global >> share. Use the cms.sched directive for this: >> https://xrootd.slac.stanford.edu/doc/dev55/xrd_config.htm#_Toc88513978 >> >> c) Scale up to meet the demand. Nothing stops you from running a cluster >> of redirectors for the federation. By using the all.manager all option >> the load is equally split amongst all of the available sub-redirectors. >> The drawback is that you need more hardware to do this but given the load >> you are experiencing that's the only solution available without >> constaining the local resource. See the all.manager directive: >> https://xrootd.slac.stanford.edu/doc/dev54/cms_config.htm#_Toc53611061 >> >> The issue here is that your current setup is insuffient for the number of >> requests the federation is trying to handle. I would suggest talking to >> the cms federation managers to see if you can employ one of the above >> options or whether they have other alternatives. >> >> Andy >> >> >> On Thu, 19 May 2022, Olivier Devroede wrote: >> >>> Dear xrootd experts, >>> >>> we upgraded to version 5.4.2 of xrootd two days ago. >>> >>> It worked flawlessly for 1.5 days, but now cmsd spawns thousands of >>> threads. This causes huge loads (up to 30.000) on the machine. >>> Restarting the daemon does not solve the problem. >>> >>> Extra info: we are part of the xrootd federation of the cms experiment. >>> >>> Do you have any idea how we can fix/debug this problem? The logs do not >>> tell us a lot. It's mostly requests for files in the cmsd logs [1] and >>> nothing special in the xrootd logs [2] >>> >>> Any help is greatly appreciated. >>> >>> Olivier. >>> >>> [1] cmsd.log >>> >>> 220519 15:42:40 22203 cms_Dispatch: >>> manager.0:[log in to unmask] for state dlen=156 >>> 220519 15:42:40 6266 manager.0:[log in to unmask] >>> cms_do_State: /store/user/nshadski/c >>> >> ustomNano/QCD_Pt_470to600_TuneCP5_13TeV_pythia8/KIT_CustomNanoV9_MC_2016postVFP/211229_123611/0000/MC_2016p >>> ostVFP_NanoAODv9_1-32.root >>> 220519 15:42:40 22265 cms_Dispatch: manager.0:[log in to unmask] >>> for state dlen=156 >>> 220519 15:42:40 6267 manager.0:[log in to unmask] cms_do_State: >>> /store/user/nshadski/customNano/QCD >>> >> _Pt_470to600_TuneCP5_13TeV_pythia8/KIT_CustomNanoV9_MC_2016postVFP/211229_123611/0000/MC_2016postVFP_NanoAO >>> Dv9_1-32.root >>> 220519 15:42:40 22265 cms_Dispatch: manager.0:[log in to unmask] >>> for state dlen=105 >>> 220519 15:42:40 6268 manager.0:[log in to unmask] cms_do_State: >>> /store/data/Run2018D/EGamma/MINIAOD >>> /12Nov2019_UL2018-v4/260000/E8C2D279-4422-E743-904A-4233F0BF230E.root >>> 220519 15:42:40 22266 cms_Dispatch: manager.0:[log in to unmask] for >>> state dlen=156 >>> 220519 15:42:40 6269 manager.0:[log in to unmask] cms_do_State: >>> /store/user/nshadski/customNano/QCD_Pt_47 >>> >> 0to600_TuneCP5_13TeV_pythia8/KIT_CustomNanoV9_MC_2016postVFP/211229_123611/0000/MC_2016postVFP_NanoAODv9_1- >>> 32.root >>> >>> [2] xrootd.log >>> 220519 15:42:30 22178 sysThrottleManager: Current IO counter is 0; total >> IO >>> wait time is 0ms. >>> 220519 15:42:31 22178 sysThrottleManager: Round ops allocation -1 >>> >>> ######################################################################## >>> Use REPLY-ALL to reply to list >>> >>> To unsubscribe from the XROOTD-L list, click the following link: >>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >>> >> > > ######################################################################## > Use REPLY-ALL to reply to list > > To unsubscribe from the XROOTD-L list, click the following link: > https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 > ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-L list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1