Print

Print


Hi Matevz,

Can you post the config file?  In the last few days, I've shot myself on silly file format issues; sometimes it helps to have a second set of eyeballs.

Brian

On Mar 8, 2014, at 3:26 PM, Andrew Hanushevsky <[log in to unmask]> wrote:

> Hi Matevz,
> 
> Hmmm, no need. I am puzzled, with the values set to zero the load should be computed as zero and it is not. That tells me either the code totally broken (unlikely) or the zero values are not present for some reason. Could you give me a gcore for the redirector so I can see what the values actually are (plus the linux version and xrootd version you are running).
> 
> Andy
> 
> On Sat, 8 Mar 2014, Matevz Tadel wrote:
> 
>> Hi Andy,
>> 
>> Thanks for looking into this. I'm pretty sure that sched was 0 for all 5 fields during the test. The perl scripts reporting the load are running on the servers though.
>> 
>> What I might have added later was fuzz 100.
>> 
>> Anyway, with the current config file (the one on the web) the redirector is still doing the same.
>> 
>> Should I try commenting out cms.perf and cms.shed and restart the whole cluster -- and then start putting things back in? Will this result in pure round robin open request redirection?
>> 
>> Matevz
>> 
>> PS - Our xrootd cluster was in certificate hell yesterday, only 4 out of 10 machines were accessible -- so it's a good time to have disruptive fun with it ... both :) and :(
>> 
>> On 3/7/14 5:07 PM, Andrew Hanushevsky wrote:
>>> Hi Matevz,
>>> OK, based on the log the config file you pointed to is not the one used in the
>>> associated log. Why? Because a non-zero load is being calculated so that means
>>> the factors were not zero at the time of the test. Indeed, the redirector will
>>> avoid heavily loaded servers and that would explain what you saw.
>>> Andy
>>> On Fri, 7 Mar 2014, Matevz Tadel wrote:
>>>> Hi Andy,
>>>> Is this good enough or I should prepare something else?
>>>> Matevz
>>>> On 02/27/14 10:49, Matevz Tadel wrote:
>>>>> Hi Andy,
>>>>> I had "cms.trace all" all along.
>>>>> This is the extract of redirects:
>>>>>   http://uaf-2.t2.ucsd.edu/~matevz/tmp/cmsd-redirect.txt
>>>>> The full log:
>>>>>   http://uaf-2.t2.ucsd.edu/~matevz/tmp/cmsd.log
>>>>> And a sortable table of a set of ~200 files opened with 1 second interval:
>>>>>   http://uaf-2.t2.ucsd.edu/~matevz/tmp/ucsd-openfiles.html
>>>>> - you can sort it by open time (similar to redirect extract);
>>>>> - or by server name to see the distribution over servers.
>>>>> Our servers are uaf-[3-9], cabinet-8-8-[0-8], cabinet-8-8-[10-13].
>>>>> You'll see that cabinet 0, 2, 3, 7, 8 and 10 do not get selected at all in this
>>>>> 200 file test and that uaf-4, 5 and 9 are only selected 2 or 3 times. I checked
>>>>> there is no weirdness on xrootd / cmsd logs on the under provisioned nodes (and
>>>>> that I can talk to them directly).
>>>>> Ah, just noticed ... the cabinet nodes that don't get selected do have a higher
>>>>> load & cpu usage and the ones that do are not doing anything (which is really
>>>>> unusual, that's why I didn't even check it at first). So my cms.sched settings
>>>>> seem to get ignored!
>>>>> The full config, redirector is xrootd.t2.ucsd.edu:
>>>>>   http://uaf-2.t2.ucsd.edu/~matevz/tmp/xrootd.cfg
>>>>> Matevz
>>>>> On 02/27/14 01:05, Andrew Hanushevsky wrote:
>>>>>> Hi Matevz,
>>>>>> The only way to find out is to turn on redirect debugging in the cmsd for a
>>>>>> while and see what the decisions were. We can go from there once we have a
>>>>>> timeline.
>>>>>> Andy
>>>>>> On Wed, 26 Feb 2014, Matevz Tadel wrote:
>>>>>>> On 02/26/14 09:22, Matevz Tadel wrote:
>>>>>>>> Hi,
>>>>>>>> We have ~20 of xrootd servers at UCSD, all of them do something else, too,
>>>>>>>> and
>>>>>>>> are thus under different load. This led to practically all requests going
>>>>>>>> to a
>>>>>>>> few servers only so I set cms.sched to do round-robin. But this does't help
>>>>>>>> much, the open requests are still mostly sent to the same few servers.
>>>>>>>> Could it be that "cms.dfs lookup distrib" causes the redirector to send the
>>>>>>>> client to the "fastest to respond" server instead of decoupling verify and
>>>>>>>> redirect steps?
>>>>>>> OK, that wasn't it ... I got hdfs configured on our redirector and tried
>>>>>>> lookup central but it didn't change anything.
>>>>>>> What could cause the redirector to only redirect to a few servers? I have this
>>>>>>> now ... so it should be pure round-robin, right?
>>>>>>> cms.sched    cpu 0 io 0 mem 0 pag 0 runq 0 space 0 fuzz 100 refreset 3600
>>>>>>> Matevz
>>>>>>> ########################################################################
>>>>>>> Use REPLY-ALL to reply to list
>>>>>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>>>> ########################################################################
>>>>> Use REPLY-ALL to reply to list
>>>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>>> ########################################################################
>>>> Use REPLY-ALL to reply to list
>>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>> 
> 
> ########################################################################
> Use REPLY-ALL to reply to list
> 
> To unsubscribe from the XROOTD-DEV list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1