Print

Print


Hi Jerome,

Below you see the line that tries to explain why the client is being 
delayed....

client defered; eligible servers suspended

What this means is either 25% of your servers have died (the claim is that 
this is unlikely) or the server that actually had the file is no longer 
responsive. So, I suspect that the latter case is true. The cmsd knows where 
the file is but the server that has it isn't responding. So, the client is 
getting defered until that server comes back. This will be allowed to happen 
for 10 minutes. After that time, the client will be simply told "no luck".

Andy

----- Original Message ----- 
From: "Jerome LAURET" <[log in to unmask]>
To: <[log in to unmask]>
Sent: Friday, February 04, 2011 6:56 AM
Subject: Re: Another puzzle with our deployment ...


>
> Hello Andy,
>
> Thank you for your prompt answer.
> Here is an example of relevant lines from cmslog.
>
> 110203 12:28:05 24300 Select seeking
> /home/starlib/home/starreco/reco/2007ProductionMinBias/FullField/P08ic/2007/120/8120080/st_physics_8120080_raw_1010005.MuDst.root
> 110203 12:28:05 24300 SelNode client defered; eligible servers suspended
> for for
> /home/starlib/home/starreco/reco/2007ProductionMinBias/FullField/P08ic/2007/120/8120080/st_physics_8120080_raw_1010005.MuDst.root
> 110203 12:28:05 24300 zer0.597:40@stargrid04 do_Select: delay 15
> /home/starlib/home/starreco/reco/2007ProductionMinBias/FullField/P08ic/2007/120/8120080/st_physics_8120080_raw_1010005.MuDst.root
>
>
>> the number of data servers available has fallen below the
>> continue threshold
>
> Our config specifies
> cms.delay servers 75% service 15 startup 65 suspend 15
>
> What is the meaning of "fallen below threshold"? Because usually,
> all services are up and running according to my monitoring (may be
> missing from time to time 1-2% of the dataservers at most).
>
>
>
>
> Andrew Hanushevsky wrote:
>> Are there other messages around the delay message (should be)? Any
>> corresponding messages in the xrootd log? Usually, there will be
>> messages telling you about events that would cause this kind of delay. I
>> suspect that the number of data servers available has fallen below the
>> continue threshold (depending on your config this could be a fixed
>> number or is 20% of the number of current servers). When the cmsd sees a
>> massive die-off it goes into a holding pattern to see if servers start
>> comming back or there is a systemic problem.
>>
>> Andy
>>
>>
>> ----- Original Message ----- From: "Jerome LAURET" <[log in to unmask]>
>> To: <[log in to unmask]>
>> Sent: Thursday, February 03, 2011 10:56 AM
>> Subject: Another puzzle with our deployment ...
>>
>>
>>> While I saw messages like those
>>> 110203 12:28:50 24300 zer0.597:40@stargrid04 do_Select: delay 15
>>> /home/starlib/home/starreco/reco/2007ProductionMinBias/FullField/P08ic/2007/120/8120080/st_physics_8120080_raw_1010005.MuDst.root
>>>
>>>
>>> on pur Xrootd redirector cmslog, I could not explain why the requests
>>> were delayed. Our rule is currently (can see from Ofer's posting but
>>> I am narrowing it here)
>>>
>>> cms.sched cpu 70 io 20 mem 5 runq 5 fuzz 10 refreset 3600
>>>
>>> and the file was held by two dataservers, the first of which
>>> reported
>>>
>>> Report_Usage cpu=63 net=6 xeq=37 mem=79 pag=0 dsk=90 138783
>>>
>>> and the second reported
>>>
>>> Report_Usage cpu=100 net=0 xeq=64 mem=61 pag=3 dsk=91 58759
>>>
>>> None would lead to a cutoff at 100% (although I am not sure how
>>> how "runq" is computed so  left off the formula while I checked
>>> by hand). I currently dropped runq to 4 and refreset to 900.
>>>
>>> Anyone sees a problem with those paremeters?
>>> How to know the reason behind those "delay 15" messages? [any
>>> ways to get a more verbose information to know what the decision
>>> was based on?]
>>>
>>> Thank you,
>>>
>>> -- 
>>>
>>>             ,,,,,
>>>            ( o o )
>>>         --m---U---m--
>>>             Jerome
>>>
>
> -- 
>
>             ,,,,,
>            ( o o )
>         --m---U---m--
>             Jerome
>