Print

Print


Hi Andrew,

Thanks a lot for your answer.

I finally had to go far away from the default and have set this:
   xrd.sched mint 8 maxt 102 avlt 25 idle 39

And now, it is stable, and the ETF tests run. I'll experiment with the
cms.sched later, but I want to have some stability for the time being. And
see if everyone gets served.

That being said, we are not new to the federation. We have been part of it
for many years and only occasionally had the cmsd go crazy like this. And
in the past, waiting for a few hours always solved the problem. But not
this time.
It might be a coincidence, but this happened 1.5 days after upgrading to
xrootd 5.4.2.


Kind regards,

Olivier.

On Fri, 20 May 2022 at 01:55, Andrew Hanushevsky <[log in to unmask]>
wrote:

> Hi Olivier,
>
> This is to be expected when you join a federation and all of a sudden have
> to deal with 10 of thousands new requests. There are several mitigations,
> each with it's own set of drawbacks. I would suggest you consult with cms
> experts on some of these.
>
> a) You can artificially restrict the number of threads in the cmsd. The
> drawback here is that there is no distunction between local requests and
> federated requests and you may wind up crowding out local requests. Use
> the xrd.sched directive for this:
> https://xrootd.slac.stanford.edu/doc/dev55/xrd_config.htm#_Toc88513978
>
> b) You can specify the percentage of federated requests you are wiling to
> handle. The is not as precise as it seems as the global redirector might
> not have any choice but to violate your request if your site has is the
> only source of a file or when every other site has reached its global
> share. Use the cms.sched directive for this:
> https://xrootd.slac.stanford.edu/doc/dev55/xrd_config.htm#_Toc88513978
>
> c) Scale up to meet the demand. Nothing stops you from running a cluster
> of redirectors for the federation. By using the all.manager all option
> the load is equally split amongst all of the available sub-redirectors.
> The drawback is that you need more hardware to do this but given the load
> you are experiencing that's the only solution available without
> constaining the local resource. See the all.manager directive:
> https://xrootd.slac.stanford.edu/doc/dev54/cms_config.htm#_Toc53611061
>
> The issue here is that your current setup is insuffient for the number of
> requests the federation is trying to handle. I would suggest talking to
> the cms federation managers to see if you can employ one of the above
> options or whether they have other alternatives.
>
> Andy
>
>
> On Thu, 19 May 2022, Olivier Devroede wrote:
>
> > Dear xrootd experts,
> >
> > we upgraded to version 5.4.2 of xrootd two days ago.
> >
> > It worked flawlessly for 1.5 days, but now cmsd spawns thousands of
> > threads. This causes huge loads (up to 30.000) on the machine.
> > Restarting the daemon does not solve the problem.
> >
> > Extra info: we are part of the xrootd federation of the cms experiment.
> >
> > Do you have any idea how we can fix/debug this problem? The logs do not
> > tell us a lot. It's mostly requests for files in the cmsd logs [1] and
> > nothing special in the xrootd logs [2]
> >
> > Any help is greatly appreciated.
> >
> > Olivier.
> >
> > [1] cmsd.log
> >
> > 220519 15:42:40 22203 cms_Dispatch:
> > manager.0:[log in to unmask] for state dlen=156
> > 220519 15:42:40 6266 manager.0:[log in to unmask]
> > cms_do_State: /store/user/nshadski/c
> >
> ustomNano/QCD_Pt_470to600_TuneCP5_13TeV_pythia8/KIT_CustomNanoV9_MC_2016postVFP/211229_123611/0000/MC_2016p
> > ostVFP_NanoAODv9_1-32.root
> > 220519 15:42:40 22265 cms_Dispatch: manager.0:[log in to unmask]
> > for state dlen=156
> > 220519 15:42:40 6267 manager.0:[log in to unmask] cms_do_State:
> > /store/user/nshadski/customNano/QCD
> >
> _Pt_470to600_TuneCP5_13TeV_pythia8/KIT_CustomNanoV9_MC_2016postVFP/211229_123611/0000/MC_2016postVFP_NanoAO
> > Dv9_1-32.root
> > 220519 15:42:40 22265 cms_Dispatch: manager.0:[log in to unmask]
> > for state dlen=105
> > 220519 15:42:40 6268 manager.0:[log in to unmask] cms_do_State:
> > /store/data/Run2018D/EGamma/MINIAOD
> > /12Nov2019_UL2018-v4/260000/E8C2D279-4422-E743-904A-4233F0BF230E.root
> > 220519 15:42:40 22266 cms_Dispatch: manager.0:[log in to unmask] for
> > state dlen=156
> > 220519 15:42:40 6269 manager.0:[log in to unmask] cms_do_State:
> > /store/user/nshadski/customNano/QCD_Pt_47
> >
> 0to600_TuneCP5_13TeV_pythia8/KIT_CustomNanoV9_MC_2016postVFP/211229_123611/0000/MC_2016postVFP_NanoAODv9_1-
> > 32.root
> >
> > [2] xrootd.log
> > 220519 15:42:30 22178 sysThrottleManager: Current IO counter is 0; total
> IO
> > wait time is 0ms.
> > 220519 15:42:31 22178 sysThrottleManager: Round ops allocation -1
> >
> > ########################################################################
> > Use REPLY-ALL to reply to list
> >
> > To unsubscribe from the XROOTD-L list, click the following link:
> > https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
> >
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1