Thank you for your prompt answer.
Here is an example of relevant lines from cmslog.
110203 12:28:05 24300 Select seeking
110203 12:28:05 24300 SelNode client defered; eligible servers suspended
110203 12:28:05 24300 zer0.597:40@stargrid04 do_Select: delay 15
> the number of data servers available has fallen below the
> continue threshold
Our config specifies
cms.delay servers 75% service 15 startup 65 suspend 15
What is the meaning of "fallen below threshold"? Because usually,
all services are up and running according to my monitoring (may be
missing from time to time 1-2% of the dataservers at most).
Andrew Hanushevsky wrote:
> Are there other messages around the delay message (should be)? Any
> corresponding messages in the xrootd log? Usually, there will be
> messages telling you about events that would cause this kind of delay. I
> suspect that the number of data servers available has fallen below the
> continue threshold (depending on your config this could be a fixed
> number or is 20% of the number of current servers). When the cmsd sees a
> massive die-off it goes into a holding pattern to see if servers start
> comming back or there is a systemic problem.
> ----- Original Message ----- From: "Jerome LAURET" <[log in to unmask]>
> To: <[log in to unmask]>
> Sent: Thursday, February 03, 2011 10:56 AM
> Subject: Another puzzle with our deployment ...
>> While I saw messages like those
>> 110203 12:28:50 24300 zer0.597:40@stargrid04 do_Select: delay 15
>> on pur Xrootd redirector cmslog, I could not explain why the requests
>> were delayed. Our rule is currently (can see from Ofer's posting but
>> I am narrowing it here)
>> cms.sched cpu 70 io 20 mem 5 runq 5 fuzz 10 refreset 3600
>> and the file was held by two dataservers, the first of which
>> Report_Usage cpu=63 net=6 xeq=37 mem=79 pag=0 dsk=90 138783
>> and the second reported
>> Report_Usage cpu=100 net=0 xeq=64 mem=61 pag=3 dsk=91 58759
>> None would lead to a cutoff at 100% (although I am not sure how
>> how "runq" is computed so left off the formula while I checked
>> by hand). I currently dropped runq to 4 and refreset to 900.
>> Anyone sees a problem with those paremeters?
>> How to know the reason behind those "delay 15" messages? [any
>> ways to get a more verbose information to know what the decision
>> was based on?]
>> Thank you,
>> ( o o )
( o o )