Print

Print


Hi Andy,

Thanks for looking into this. I'm pretty sure that sched was 0 for all 5 fields 
during the test. The perl scripts reporting the load are running on the servers 
though.

What I might have added later was fuzz 100.

Anyway, with the current config file (the one on the web) the redirector is 
still doing the same.

Should I try commenting out cms.perf and cms.shed and restart the whole cluster 
-- and then start putting things back in? Will this result in pure round robin 
open request redirection?

Matevz

PS - Our xrootd cluster was in certificate hell yesterday, only 4 out of 10 
machines were accessible -- so it's a good time to have disruptive fun with it 
... both :) and :(

On 3/7/14 5:07 PM, Andrew Hanushevsky wrote:
> Hi Matevz,
>
> OK, based on the log the config file you pointed to is not the one used in the
> associated log. Why? Because a non-zero load is being calculated so that means
> the factors were not zero at the time of the test. Indeed, the redirector will
> avoid heavily loaded servers and that would explain what you saw.
>
> Andy
>
> On Fri, 7 Mar 2014, Matevz Tadel wrote:
>
>> Hi Andy,
>>
>> Is this good enough or I should prepare something else?
>>
>> Matevz
>>
>> On 02/27/14 10:49, Matevz Tadel wrote:
>>> Hi Andy,
>>>
>>> I had "cms.trace all" all along.
>>>
>>> This is the extract of redirects:
>>>    http://uaf-2.t2.ucsd.edu/~matevz/tmp/cmsd-redirect.txt
>>>
>>> The full log:
>>>    http://uaf-2.t2.ucsd.edu/~matevz/tmp/cmsd.log
>>>
>>> And a sortable table of a set of ~200 files opened with 1 second interval:
>>>    http://uaf-2.t2.ucsd.edu/~matevz/tmp/ucsd-openfiles.html
>>> - you can sort it by open time (similar to redirect extract);
>>> - or by server name to see the distribution over servers.
>>>
>>> Our servers are uaf-[3-9], cabinet-8-8-[0-8], cabinet-8-8-[10-13].
>>>
>>> You'll see that cabinet 0, 2, 3, 7, 8 and 10 do not get selected at all in this
>>> 200 file test and that uaf-4, 5 and 9 are only selected 2 or 3 times. I checked
>>> there is no weirdness on xrootd / cmsd logs on the under provisioned nodes (and
>>> that I can talk to them directly).
>>>
>>> Ah, just noticed ... the cabinet nodes that don't get selected do have a higher
>>> load & cpu usage and the ones that do are not doing anything (which is really
>>> unusual, that's why I didn't even check it at first). So my cms.sched settings
>>> seem to get ignored!
>>>
>>> The full config, redirector is xrootd.t2.ucsd.edu:
>>>    http://uaf-2.t2.ucsd.edu/~matevz/tmp/xrootd.cfg
>>>
>>> Matevz
>>>
>>> On 02/27/14 01:05, Andrew Hanushevsky wrote:
>>>> Hi Matevz,
>>>>
>>>> The only way to find out is to turn on redirect debugging in the cmsd for a
>>>> while and see what the decisions were. We can go from there once we have a
>>>> timeline.
>>>>
>>>> Andy
>>>>
>>>> On Wed, 26 Feb 2014, Matevz Tadel wrote:
>>>>
>>>>> On 02/26/14 09:22, Matevz Tadel wrote:
>>>>>> Hi,
>>>>>>
>>>>>> We have ~20 of xrootd servers at UCSD, all of them do something else, too,
>>>>>> and
>>>>>> are thus under different load. This led to practically all requests going
>>>>>> to a
>>>>>> few servers only so I set cms.sched to do round-robin. But this does't help
>>>>>> much, the open requests are still mostly sent to the same few servers.
>>>>>>
>>>>>> Could it be that "cms.dfs lookup distrib" causes the redirector to send the
>>>>>> client to the "fastest to respond" server instead of decoupling verify and
>>>>>> redirect steps?
>>>>>
>>>>> OK, that wasn't it ... I got hdfs configured on our redirector and tried
>>>>> lookup central but it didn't change anything.
>>>>>
>>>>> What could cause the redirector to only redirect to a few servers? I have this
>>>>> now ... so it should be pure round-robin, right?
>>>>>  cms.sched    cpu 0 io 0 mem 0 pag 0 runq 0 space 0 fuzz 100 refreset 3600
>>>>>
>>>>>
>>>>> Matevz
>>>>>
>>>>> ########################################################################
>>>>> Use REPLY-ALL to reply to list
>>>>>
>>>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>>>>
>>>
>>> ########################################################################
>>> Use REPLY-ALL to reply to list
>>>
>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>
>> ########################################################################
>> Use REPLY-ALL to reply to list
>>
>> To unsubscribe from the XROOTD-DEV list, click the following link:
>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1