LISTSERV 16.5 - XROOTD-L Archives

Are there other messages around the delay message (should be)? Any 
corresponding messages in the xrootd log? Usually, there will be messages 
telling you about events that would cause this kind of delay. I suspect that 
the number of data servers available has fallen below the continue threshold 
(depending on your config this could be a fixed number or is 20% of the 
number of current servers). When the cmsd sees a massive die-off it goes 
into a holding pattern to see if servers start comming back or there is a 
systemic problem.

Andy


----- Original Message ----- 
From: "Jerome LAURET" <[log in to unmask]>
To: <[log in to unmask]>
Sent: Thursday, February 03, 2011 10:56 AM
Subject: Another puzzle with our deployment ...


> While I saw messages like those
> 110203 12:28:50 24300 zer0.597:40@stargrid04 do_Select: delay 15
> /home/starlib/home/starreco/reco/2007ProductionMinBias/FullField/P08ic/2007/120/8120080/st_physics_8120080_raw_1010005.MuDst.root
>
> on pur Xrootd redirector cmslog, I could not explain why the requests
> were delayed. Our rule is currently (can see from Ofer's posting but
> I am narrowing it here)
>
> cms.sched cpu 70 io 20 mem 5 runq 5 fuzz 10 refreset 3600
>
> and the file was held by two dataservers, the first of which
> reported
>
> Report_Usage cpu=63 net=6 xeq=37 mem=79 pag=0 dsk=90 138783
>
> and the second reported
>
> Report_Usage cpu=100 net=0 xeq=64 mem=61 pag=3 dsk=91 58759
>
> None would lead to a cutoff at 100% (although I am not sure how
> how "runq" is computed so  left off the formula while I checked
> by hand). I currently dropped runq to 4 and refreset to 900.
>
> Anyone sees a problem with those paremeters?
> How to know the reason behind those "delay 15" messages? [any
> ways to get a more verbose information to know what the decision
> was based on?]
>
> Thank you,
>
> -- 
>
>             ,,,,,
>            ( o o )
>         --m---U---m--
>             Jerome
>