Print

Print


Hi Andrew,

  the logs can be found here. From the log you can see atlas-bkp1
manager are dropping nodes again and again which tries to connect to
it.
  http://higgs03.cs.wisc.edu/wguan/


On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky
<[log in to unmask]> wrote:
> Hi Wen, Could you start everything up and provide me a pointer to the
> manager log file, supervisor log file, and one data server logfile all of
> which cover the same time-frame (from start to some point where you think
> things are working or not). That way I can see what is happening. At the
> moment I only see two "bad" things in the config file:
>
> 1) Only atlas-bkp1.cs.wisc.edu is designated as a manager but you claim, via
> the all.manager directive, that there are three (bkp2 and bkp3). While it
> should work, the log file will be dense with error messages. Please correct
> this to be consistent and make it easier to see real errors.

This is not a problem for me. Because this config is used in
dataserver. In manager, I updated the if atlas-bkp1.cs.wisc.edu to
atlas-bkp2 or something. This is a history problem. at first only
atlas-bkp1 is used.  atlas-bkp2 and atlas-bkp3 are added  later.

> 2) Please use cms.space not olb.space (for historical reasons the latter is
> still accepted and over-rides the former, but that will soon end), and
> please use only one (the config file uses both directives).
yes. I should remove this line. in fact cms.space is in the cfg too.


Thanks
Wen

> The xrootd has an internal mechanism to connect servers with supervisors to
> allow for maximum reliability. You cannot change that algorithm and there is
> no need to do so. You should *never* tell anyone to directly connect to a
> supervisor. If you do, you will likely get unreachable nodes.
>
> As for dropping data servers, it would appear to me, given the flurry of
> such activity, that something either crashed or was restarted. That's why it
> would be good to see the complete log of each one of the entities.
>
> Andy
>
> On Fri, 11 Dec 2009, wen guan wrote:
>
>> Hi Andrew,
>>
>>    I read the document. and write a config
>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg).
>>    I used my conf, I can see manager is dispatch message to
>> supervisor. But I cannot see any dataserver tries to connect to the
>> supervisor. At the same time, in the manager's log, I can see some
>> dataserver are Dropped.
>>   How does xrootd decide which dataserver will connect supervisor?
>> should I specify some dataservers to connect the supervisor?
>>
>>
>> (*) supervisor log
>> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2 for state dlen=42
>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State:
>> /atlas/xrootd/users/wguan/test/test131141
>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_StateFWD: Path find
>> failed for state /atlas/xrootd/users/wguan/test/test131141
>>
>> (*)manager log
>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu TSpace=5587GB NumFS=1
>> FSpace=5693644MB MinFR=57218MB Util=0
>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding path: w /atlas
>> 091211 04:13:24 15661 server.10585:[log in to unmask]:1094
>> do_Space: 5696231MB free; 0% util
>> 091211 04:13:24 15661 Protocol:
>> server.10585:[log in to unmask]:1094 logged in.
>> 091211 04:13:24 001 XrdInet: Accepted connection from
>> [log in to unmask]
>> 091211 04:13:24 15661 XrdSched: running ?:[log in to unmask] inq=0
>> 091211 04:13:24 15661 XrdProtocol: matched protocol cmsd
>> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll: FD 79 attached
>> to poller 2; num=22
>> 091211 04:13:24 15661 Add server.21739:[log in to unmask] bumps
>> server.15905:[log in to unmask]:1094 #63
>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>> 091211 04:13:24 15661 Drop_Node:
>> server.15905:[log in to unmask]:1094 dropped.
>> 091211 04:13:24 15661 Add Shoved
>> server.21739:[log in to unmask]:1094 to cluster; id=63.78; num=64;
>> min=51
>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0
>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu TSpace=5587GB NumFS=1
>> FSpace=5721854MB MinFR=57218MB Util=0
>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding path: w /atlas
>> 091211 04:13:24 15661 server.21739:[log in to unmask]:1094
>> do_Space: 5721854MB free; 0% util
>> 091211 04:13:24 15661 Protocol:
>> server.21739:[log in to unmask]:1094 logged in.
>> 091211 04:13:24 15661 XrdLink: Unable to recieve from
>> c187.chtc.wisc.edu; connection reset by peer
>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>> 091211 04:13:24 15661 XrdSched: scheduling drop node in 60 seconds
>> 091211 04:13:24 15661 Remove_Node
>> server.21739:[log in to unmask]:1094 node 63.78
>> 091211 04:13:24 15661 Protocol: server.21739:[log in to unmask] logged
>> out.
>> 091211 04:13:24 15661 server.21739:[log in to unmask] XrdPoll: FD
>> 79 detached from poller 2; num=21
>> 091211 04:13:27 15661 Dispatch server.24718:[log in to unmask]:1094
>> for status dlen=0
>> 091211 04:13:27 15661 server.24718:[log in to unmask]:1094 do_Status:
>> suspend
>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service suspended
>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c177.chtc.wisc.edu
>> FD=16
>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>> 091211 04:13:27 15661 Remove_Node
>> server.24718:[log in to unmask]:1094 node 0.3
>> 091211 04:13:27 15661 Protocol: server.21656:[log in to unmask] logged
>> out.
>> 091211 04:13:27 15661 server.21656:[log in to unmask] XrdPoll: FD
>> 16 detached from poller 2; num=20
>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c179.chtc.wisc.edu
>> FD=21
>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>> 091211 04:13:27 15661 Remove_Node
>> server.17065:[log in to unmask]:1094 node 1.4
>> 091211 04:13:27 15661 Protocol: server.7978:[log in to unmask] logged
>> out.
>> 091211 04:13:27 15661 server.7978:[log in to unmask] XrdPoll: FD 21
>> detached from poller 1; num=21
>> 091211 04:13:27 15661 State: Status changed to suspended
>> 091211 04:13:27 15661 Send status to redirector.15656:14@atlas-bkp2
>> 091211 04:13:27 15661 Dispatch server.12937:[log in to unmask]:1094
>> for status dlen=0
>> 091211 04:13:27 15661 server.12937:[log in to unmask]:1094 do_Status:
>> suspend
>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service suspended
>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c182.chtc.wisc.edu
>> FD=19
>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>> 091211 04:13:27 15661 Remove_Node
>> server.12937:[log in to unmask]:1094 node 7.10
>> 091211 04:13:27 15661 Protocol: server.26620:[log in to unmask] logged
>> out.
>> 091211 04:13:27 15661 server.26620:[log in to unmask] XrdPoll: FD
>> 19 detached from poller 2; num=19
>> 091211 04:13:27 15661 Dispatch server.10842:[log in to unmask]:1094
>> for status dlen=0
>> 091211 04:13:27 15661 server.10842:[log in to unmask]:1094 do_Status:
>> suspend
>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service suspended
>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c178.chtc.wisc.edu
>> FD=15
>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>> 091211 04:13:27 15661 Remove_Node
>> server.10842:[log in to unmask]:1094 node 9.12
>> 091211 04:13:27 15661 Protocol: server.11901:[log in to unmask] logged
>> out.
>> 091211 04:13:27 15661 server.11901:[log in to unmask] XrdPoll: FD
>> 15 detached from poller 1; num=20
>> 091211 04:13:27 15661 Dispatch server.5535:[log in to unmask]:1094
>> for status dlen=0
>> 091211 04:13:27 15661 server.5535:[log in to unmask]:1094 do_Status:
>> suspend
>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service suspended
>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c181.chtc.wisc.edu
>> FD=17
>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>> 091211 04:13:27 15661 Remove_Node
>> server.5535:[log in to unmask]:1094 node 5.8
>> 091211 04:13:27 15661 Protocol: server.13984:[log in to unmask] logged
>> out.
>> 091211 04:13:27 15661 server.13984:[log in to unmask] XrdPoll: FD
>> 17 detached from poller 0; num=21
>> 091211 04:13:27 15661 Dispatch server.23711:[log in to unmask]:1094
>> for status dlen=0
>> 091211 04:13:27 15661 server.23711:[log in to unmask]:1094 do_Status:
>> suspend
>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service suspended
>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c183.chtc.wisc.edu
>> FD=22
>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>> 091211 04:13:27 15661 Remove_Node
>> server.23711:[log in to unmask]:1094 node 8.11
>> 091211 04:13:27 15661 Protocol: server.27735:[log in to unmask] logged
>> out.
>> 091211 04:13:27 15661 server.27735:[log in to unmask] XrdPoll: FD
>> 22 detached from poller 2; num=18
>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c184.chtc.wisc.edu
>> FD=20
>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>> 091211 04:13:27 15661 Remove_Node
>> server.4131:[log in to unmask]:1094 node 3.6
>> 091211 04:13:27 15661 Protocol: server.26787:[log in to unmask] logged
>> out.
>> 091211 04:13:27 15661 server.26787:[log in to unmask] XrdPoll: FD
>> 20 detached from poller 0; num=20
>> 091211 04:13:27 15661 Dispatch server.10585:[log in to unmask]:1094
>> for status dlen=0
>> 091211 04:13:27 15661 server.10585:[log in to unmask]:1094 do_Status:
>> suspend
>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service suspended
>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c185.chtc.wisc.edu
>> FD=23
>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>> 091211 04:13:27 15661 Remove_Node
>> server.10585:[log in to unmask]:1094 node 6.9
>> 091211 04:13:27 15661 Protocol: server.8524:[log in to unmask] logged
>> out.
>> 091211 04:13:27 15661 server.8524:[log in to unmask] XrdPoll: FD 23
>> detached from poller 0; num=19
>> 091211 04:13:27 15661 Dispatch server.20264:[log in to unmask]:1094
>> for status dlen=0
>> 091211 04:13:27 15661 server.20264:[log in to unmask]:1094 do_Status:
>> suspend
>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service suspended
>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c180.chtc.wisc.edu
>> FD=18
>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>> 091211 04:13:27 15661 Remove_Node
>> server.20264:[log in to unmask]:1094 node 4.7
>> 091211 04:13:27 15661 Protocol: server.14636:[log in to unmask] logged
>> out.
>> 091211 04:13:27 15661 server.14636:[log in to unmask] XrdPoll: FD
>> 18 detached from poller 1; num=19
>> 091211 04:13:27 15661 Dispatch server.1656:[log in to unmask]:1094
>> for status dlen=0
>> 091211 04:13:27 15661 server.1656:[log in to unmask]:1094 do_Status:
>> suspend
>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service suspended
>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c186.chtc.wisc.edu
>> FD=24
>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>> 091211 04:13:27 15661 Remove_Node
>> server.1656:[log in to unmask]:1094 node 2.5
>> 091211 04:13:27 15661 Protocol: server.7849:[log in to unmask] logged
>> out.
>> 091211 04:13:27 15661 server.7849:[log in to unmask] XrdPoll: FD 24
>> detached from poller 1; num=18
>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled.
>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled.
>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled.
>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled.
>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled.
>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled.
>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled.
>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled.
>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled.
>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled.
>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled.
>> 091211 04:14:24 15661 XrdSched: Now have 68 workers
>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled.
>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>
>> Wen
>>
>>
>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky
>> <[log in to unmask]> wrote:
>>>
>>> Hi Wen,
>>>
>>> To go past 64 data servers you will need to setup one or more
>>> supervisors.
>>> This does not logically change the current configuration you have. You
>>> only
>>> need to configure one or more *new* servers (or at least xrootd
>>> processes)
>>> whose role is supervisor. We'd like them to run in separate machines for
>>> reliability purposes, but they could run on the manager node as long as
>>> you
>>> give each one a unique instance name (i.e., -n option).
>>>
>>> The front part of the cmsd reference explains how to do this.
>>>
>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm
>>>
>>> Andy
>>>
>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>
>>>> Hi Andrew,
>>>>
>>>>   Is there any change to configure xrootd with more than 65
>>>> machines? I used the configure below but it doesn't work.  Should I
>>>> configure some machines' manager to be supvervisor?
>>>>
>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg
>>>>
>>>>
>>>> Wen
>>>>
>>>
>