Hi Matevz,
Hmmm, no need. I am puzzled, with the values set to zero the load should
be computed as zero and it is not. That tells me either the code totally
broken (unlikely) or the zero values are not present for some reason.
Could you give me a gcore for the redirector so I can see what the values
actually are (plus the linux version and xrootd version you are running).
Andy
On Sat, 8 Mar 2014, Matevz Tadel wrote:
> Hi Andy,
>
> Thanks for looking into this. I'm pretty sure that sched was 0 for all 5
> fields during the test. The perl scripts reporting the load are running on
> the servers though.
>
> What I might have added later was fuzz 100.
>
> Anyway, with the current config file (the one on the web) the redirector is
> still doing the same.
>
> Should I try commenting out cms.perf and cms.shed and restart the whole
> cluster -- and then start putting things back in? Will this result in pure
> round robin open request redirection?
>
> Matevz
>
> PS - Our xrootd cluster was in certificate hell yesterday, only 4 out of 10
> machines were accessible -- so it's a good time to have disruptive fun with
> it ... both :) and :(
>
> On 3/7/14 5:07 PM, Andrew Hanushevsky wrote:
>> Hi Matevz,
>>
>> OK, based on the log the config file you pointed to is not the one used in
>> the
>> associated log. Why? Because a non-zero load is being calculated so that
>> means
>> the factors were not zero at the time of the test. Indeed, the redirector
>> will
>> avoid heavily loaded servers and that would explain what you saw.
>>
>> Andy
>>
>> On Fri, 7 Mar 2014, Matevz Tadel wrote:
>>
>>> Hi Andy,
>>>
>>> Is this good enough or I should prepare something else?
>>>
>>> Matevz
>>>
>>> On 02/27/14 10:49, Matevz Tadel wrote:
>>>> Hi Andy,
>>>>
>>>> I had "cms.trace all" all along.
>>>>
>>>> This is the extract of redirects:
>>>> http://uaf-2.t2.ucsd.edu/~matevz/tmp/cmsd-redirect.txt
>>>>
>>>> The full log:
>>>> http://uaf-2.t2.ucsd.edu/~matevz/tmp/cmsd.log
>>>>
>>>> And a sortable table of a set of ~200 files opened with 1 second
>>>> interval:
>>>> http://uaf-2.t2.ucsd.edu/~matevz/tmp/ucsd-openfiles.html
>>>> - you can sort it by open time (similar to redirect extract);
>>>> - or by server name to see the distribution over servers.
>>>>
>>>> Our servers are uaf-[3-9], cabinet-8-8-[0-8], cabinet-8-8-[10-13].
>>>>
>>>> You'll see that cabinet 0, 2, 3, 7, 8 and 10 do not get selected at all
>>>> in this
>>>> 200 file test and that uaf-4, 5 and 9 are only selected 2 or 3 times. I
>>>> checked
>>>> there is no weirdness on xrootd / cmsd logs on the under provisioned
>>>> nodes (and
>>>> that I can talk to them directly).
>>>>
>>>> Ah, just noticed ... the cabinet nodes that don't get selected do have a
>>>> higher
>>>> load & cpu usage and the ones that do are not doing anything (which is
>>>> really
>>>> unusual, that's why I didn't even check it at first). So my cms.sched
>>>> settings
>>>> seem to get ignored!
>>>>
>>>> The full config, redirector is xrootd.t2.ucsd.edu:
>>>> http://uaf-2.t2.ucsd.edu/~matevz/tmp/xrootd.cfg
>>>>
>>>> Matevz
>>>>
>>>> On 02/27/14 01:05, Andrew Hanushevsky wrote:
>>>>> Hi Matevz,
>>>>>
>>>>> The only way to find out is to turn on redirect debugging in the cmsd
>>>>> for a
>>>>> while and see what the decisions were. We can go from there once we have
>>>>> a
>>>>> timeline.
>>>>>
>>>>> Andy
>>>>>
>>>>> On Wed, 26 Feb 2014, Matevz Tadel wrote:
>>>>>
>>>>>> On 02/26/14 09:22, Matevz Tadel wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> We have ~20 of xrootd servers at UCSD, all of them do something else,
>>>>>>> too,
>>>>>>> and
>>>>>>> are thus under different load. This led to practically all requests
>>>>>>> going
>>>>>>> to a
>>>>>>> few servers only so I set cms.sched to do round-robin. But this does't
>>>>>>> help
>>>>>>> much, the open requests are still mostly sent to the same few servers.
>>>>>>>
>>>>>>> Could it be that "cms.dfs lookup distrib" causes the redirector to
>>>>>>> send the
>>>>>>> client to the "fastest to respond" server instead of decoupling verify
>>>>>>> and
>>>>>>> redirect steps?
>>>>>>
>>>>>> OK, that wasn't it ... I got hdfs configured on our redirector and
>>>>>> tried
>>>>>> lookup central but it didn't change anything.
>>>>>>
>>>>>> What could cause the redirector to only redirect to a few servers? I
>>>>>> have this
>>>>>> now ... so it should be pure round-robin, right?
>>>>>> cms.sched cpu 0 io 0 mem 0 pag 0 runq 0 space 0 fuzz 100 refreset
>>>>>> 3600
>>>>>>
>>>>>>
>>>>>> Matevz
>>>>>>
>>>>>> ########################################################################
>>>>>> Use REPLY-ALL to reply to list
>>>>>>
>>>>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>>>>>
>>>>
>>>> ########################################################################
>>>> Use REPLY-ALL to reply to list
>>>>
>>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>>
>>> ########################################################################
>>> Use REPLY-ALL to reply to list
>>>
>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>>
>
########################################################################
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
|