Hi Andy,
On 06/25/11 21:34, Andrew Hanushevsky wrote:
> Hi Matevz,
>
> On Sat, 25 Jun 2011, Matevz Tadel wrote:
>
>> Hi Andy,
>>
>> On 06/25/11 17:26, Andrew Hanushevsky wrote:
>>> Hi Matevz,
>>>
>> Thanks, I'll look it up and set it to something more aggressive. How come it
>> didn't recuperate automatically?
> It can't. By definition when too many servers disconnect it goes into a holding
> pattern until those servers come back. This prevents the system from doing
> stupid things like restaging data on the remaining servers.
OK, understood. That's also what I was hit by ... I had the default 'cms.delay
servers' value (which turns out to be 80%) and four servers ... one went down
... and so the whole thing stopped.
>> It is also true that the same machines (uaf-X) are used for interactive logon
>> and are loaded pretty badly last couple of weeks.
> That shouldn't cause a huge problem unless you've reached the load limit. In
> that case, clients will be delayed until the load falls back down below the
> threshold.
I don't have 'cms.sched maxload' set ... the default is 100, right? And another
thing, runq percentage -- this pertains to system load average, the first number
reported by the executable starded via cms.perf (being 100 * LoadAvg15 / N_cores
in XrdOlbMonPerf, it seems)?
Cheers,
Matevz
<snip>
|