Print

Print


Hi Andrew,

     I read the document. and write a config
file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg).
     I used my conf, I can see manager is dispatch message to
supervisor. But I cannot see any dataserver tries to connect to the
supervisor. At the same time, in the manager's log, I can see some
dataserver are Dropped.
    How does xrootd decide which dataserver will connect supervisor?
should I specify some dataservers to connect the supervisor?


(*) supervisor log
091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2 for state dlen=42
091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State:
/atlas/xrootd/users/wguan/test/test131141
091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_StateFWD: Path find
failed for state /atlas/xrootd/users/wguan/test/test131141

(*)manager log
091211 04:13:24 15661 Admit c185.chtc.wisc.edu TSpace=5587GB NumFS=1
FSpace=5693644MB MinFR=57218MB Util=0
091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding path: w /atlas
091211 04:13:24 15661 server.10585:[log in to unmask]:1094
do_Space: 5696231MB free; 0% util
091211 04:13:24 15661 Protocol:
server.10585:[log in to unmask]:1094 logged in.
091211 04:13:24 001 XrdInet: Accepted connection from [log in to unmask]
091211 04:13:24 15661 XrdSched: running ?:[log in to unmask] inq=0
091211 04:13:24 15661 XrdProtocol: matched protocol cmsd
091211 04:13:24 15661 ?:[log in to unmask] XrdPoll: FD 79 attached
to poller 2; num=22
091211 04:13:24 15661 Add server.21739:[log in to unmask] bumps
server.15905:[log in to unmask]:1094 #63
091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
091211 04:13:24 15661 Drop_Node:
server.15905:[log in to unmask]:1094 dropped.
091211 04:13:24 15661 Add Shoved
server.21739:[log in to unmask]:1094 to cluster; id=63.78; num=64;
min=51
091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0
091211 04:13:24 15661 Admit c187.chtc.wisc.edu TSpace=5587GB NumFS=1
FSpace=5721854MB MinFR=57218MB Util=0
091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding path: w /atlas
091211 04:13:24 15661 server.21739:[log in to unmask]:1094
do_Space: 5721854MB free; 0% util
091211 04:13:24 15661 Protocol:
server.21739:[log in to unmask]:1094 logged in.
091211 04:13:24 15661 XrdLink: Unable to recieve from
c187.chtc.wisc.edu; connection reset by peer
091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
091211 04:13:24 15661 XrdSched: scheduling drop node in 60 seconds
091211 04:13:24 15661 Remove_Node
server.21739:[log in to unmask]:1094 node 63.78
091211 04:13:24 15661 Protocol: server.21739:[log in to unmask] logged out.
091211 04:13:24 15661 server.21739:[log in to unmask] XrdPoll: FD
79 detached from poller 2; num=21
091211 04:13:27 15661 Dispatch server.24718:[log in to unmask]:1094
for status dlen=0
091211 04:13:27 15661 server.24718:[log in to unmask]:1094 do_Status: suspend
091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
091211 04:13:27 15661 Node: c177.chtc.wisc.edu service suspended
091211 04:13:27 15661 XrdLink: No RecvAll() data from c177.chtc.wisc.edu FD=16
091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
091211 04:13:27 15661 Remove_Node
server.24718:[log in to unmask]:1094 node 0.3
091211 04:13:27 15661 Protocol: server.21656:[log in to unmask] logged out.
091211 04:13:27 15661 server.21656:[log in to unmask] XrdPoll: FD
16 detached from poller 2; num=20
091211 04:13:27 15661 XrdLink: No RecvAll() data from c179.chtc.wisc.edu FD=21
091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
091211 04:13:27 15661 Remove_Node
server.17065:[log in to unmask]:1094 node 1.4
091211 04:13:27 15661 Protocol: server.7978:[log in to unmask] logged out.
091211 04:13:27 15661 server.7978:[log in to unmask] XrdPoll: FD 21
detached from poller 1; num=21
091211 04:13:27 15661 State: Status changed to suspended
091211 04:13:27 15661 Send status to redirector.15656:14@atlas-bkp2
091211 04:13:27 15661 Dispatch server.12937:[log in to unmask]:1094
for status dlen=0
091211 04:13:27 15661 server.12937:[log in to unmask]:1094 do_Status: suspend
091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
091211 04:13:27 15661 Node: c182.chtc.wisc.edu service suspended
091211 04:13:27 15661 XrdLink: No RecvAll() data from c182.chtc.wisc.edu FD=19
091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
091211 04:13:27 15661 Remove_Node
server.12937:[log in to unmask]:1094 node 7.10
091211 04:13:27 15661 Protocol: server.26620:[log in to unmask] logged out.
091211 04:13:27 15661 server.26620:[log in to unmask] XrdPoll: FD
19 detached from poller 2; num=19
091211 04:13:27 15661 Dispatch server.10842:[log in to unmask]:1094
for status dlen=0
091211 04:13:27 15661 server.10842:[log in to unmask]:1094 do_Status: suspend
091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
091211 04:13:27 15661 Node: c178.chtc.wisc.edu service suspended
091211 04:13:27 15661 XrdLink: No RecvAll() data from c178.chtc.wisc.edu FD=15
091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
091211 04:13:27 15661 Remove_Node
server.10842:[log in to unmask]:1094 node 9.12
091211 04:13:27 15661 Protocol: server.11901:[log in to unmask] logged out.
091211 04:13:27 15661 server.11901:[log in to unmask] XrdPoll: FD
15 detached from poller 1; num=20
091211 04:13:27 15661 Dispatch server.5535:[log in to unmask]:1094
for status dlen=0
091211 04:13:27 15661 server.5535:[log in to unmask]:1094 do_Status: suspend
091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
091211 04:13:27 15661 Node: c181.chtc.wisc.edu service suspended
091211 04:13:27 15661 XrdLink: No RecvAll() data from c181.chtc.wisc.edu FD=17
091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
091211 04:13:27 15661 Remove_Node
server.5535:[log in to unmask]:1094 node 5.8
091211 04:13:27 15661 Protocol: server.13984:[log in to unmask] logged out.
091211 04:13:27 15661 server.13984:[log in to unmask] XrdPoll: FD
17 detached from poller 0; num=21
091211 04:13:27 15661 Dispatch server.23711:[log in to unmask]:1094
for status dlen=0
091211 04:13:27 15661 server.23711:[log in to unmask]:1094 do_Status: suspend
091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
091211 04:13:27 15661 Node: c183.chtc.wisc.edu service suspended
091211 04:13:27 15661 XrdLink: No RecvAll() data from c183.chtc.wisc.edu FD=22
091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
091211 04:13:27 15661 Remove_Node
server.23711:[log in to unmask]:1094 node 8.11
091211 04:13:27 15661 Protocol: server.27735:[log in to unmask] logged out.
091211 04:13:27 15661 server.27735:[log in to unmask] XrdPoll: FD
22 detached from poller 2; num=18
091211 04:13:27 15661 XrdLink: No RecvAll() data from c184.chtc.wisc.edu FD=20
091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
091211 04:13:27 15661 Remove_Node
server.4131:[log in to unmask]:1094 node 3.6
091211 04:13:27 15661 Protocol: server.26787:[log in to unmask] logged out.
091211 04:13:27 15661 server.26787:[log in to unmask] XrdPoll: FD
20 detached from poller 0; num=20
091211 04:13:27 15661 Dispatch server.10585:[log in to unmask]:1094
for status dlen=0
091211 04:13:27 15661 server.10585:[log in to unmask]:1094 do_Status: suspend
091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
091211 04:13:27 15661 Node: c185.chtc.wisc.edu service suspended
091211 04:13:27 15661 XrdLink: No RecvAll() data from c185.chtc.wisc.edu FD=23
091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
091211 04:13:27 15661 Remove_Node
server.10585:[log in to unmask]:1094 node 6.9
091211 04:13:27 15661 Protocol: server.8524:[log in to unmask] logged out.
091211 04:13:27 15661 server.8524:[log in to unmask] XrdPoll: FD 23
detached from poller 0; num=19
091211 04:13:27 15661 Dispatch server.20264:[log in to unmask]:1094
for status dlen=0
091211 04:13:27 15661 server.20264:[log in to unmask]:1094 do_Status: suspend
091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
091211 04:13:27 15661 Node: c180.chtc.wisc.edu service suspended
091211 04:13:27 15661 XrdLink: No RecvAll() data from c180.chtc.wisc.edu FD=18
091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
091211 04:13:27 15661 Remove_Node
server.20264:[log in to unmask]:1094 node 4.7
091211 04:13:27 15661 Protocol: server.14636:[log in to unmask] logged out.
091211 04:13:27 15661 server.14636:[log in to unmask] XrdPoll: FD
18 detached from poller 1; num=19
091211 04:13:27 15661 Dispatch server.1656:[log in to unmask]:1094
for status dlen=0
091211 04:13:27 15661 server.1656:[log in to unmask]:1094 do_Status: suspend
091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
091211 04:13:27 15661 Node: c186.chtc.wisc.edu service suspended
091211 04:13:27 15661 XrdLink: No RecvAll() data from c186.chtc.wisc.edu FD=24
091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
091211 04:13:27 15661 Remove_Node
server.1656:[log in to unmask]:1094 node 2.5
091211 04:13:27 15661 Protocol: server.7849:[log in to unmask] logged out.
091211 04:13:27 15661 server.7849:[log in to unmask] XrdPoll: FD 24
detached from poller 1; num=18
091211 04:14:14 15661 XrdSched: running drop node inq=0
091211 04:14:14 15661 XrdSched: running drop node inq=0
091211 04:14:14 15661 XrdSched: running drop node inq=0
091211 04:14:14 15661 XrdSched: running drop node inq=0
091211 04:14:14 15661 XrdSched: running drop node inq=0
091211 04:14:14 15661 XrdSched: running drop node inq=0
091211 04:14:14 15661 XrdSched: running drop node inq=0
091211 04:14:14 15661 XrdSched: running drop node inq=0
091211 04:14:14 15661 XrdSched: running drop node inq=0
091211 04:14:14 15661 XrdSched: running drop node inq=0
091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
091211 04:14:24 15661 XrdSched: running drop node inq=1
091211 04:14:24 15661 XrdSched: running drop node inq=1
091211 04:14:24 15661 Drop_Node 63.66 cancelled.
091211 04:14:24 15661 XrdSched: running drop node inq=1
091211 04:14:24 15661 Drop_Node 63.68 cancelled.
091211 04:14:24 15661 XrdSched: running drop node inq=1
091211 04:14:24 15661 Drop_Node 63.69 cancelled.
091211 04:14:24 15661 Drop_Node 63.67 cancelled.
091211 04:14:24 15661 XrdSched: running drop node inq=1
091211 04:14:24 15661 Drop_Node 63.70 cancelled.
091211 04:14:24 15661 XrdSched: running drop node inq=1
091211 04:14:24 15661 Drop_Node 63.71 cancelled.
091211 04:14:24 15661 XrdSched: running drop node inq=1
091211 04:14:24 15661 Drop_Node 63.72 cancelled.
091211 04:14:24 15661 XrdSched: running drop node inq=1
091211 04:14:24 15661 Drop_Node 63.73 cancelled.
091211 04:14:24 15661 XrdSched: running drop node inq=1
091211 04:14:24 15661 Drop_Node 63.74 cancelled.
091211 04:14:24 15661 XrdSched: running drop node inq=1
091211 04:14:24 15661 Drop_Node 63.75 cancelled.
091211 04:14:24 15661 XrdSched: running drop node inq=1
091211 04:14:24 15661 Drop_Node 63.76 cancelled.
091211 04:14:24 15661 XrdSched: Now have 68 workers
091211 04:14:24 15661 XrdSched: running drop node inq=0
091211 04:14:24 15661 Drop_Node 63.77 cancelled.
091211 04:14:24 15661 XrdSched: running drop node inq=0

Wen


On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky
<[log in to unmask]> wrote:
> Hi Wen,
>
> To go past 64 data servers you will need to setup one or more supervisors.
> This does not logically change the current configuration you have. You only
> need to configure one or more *new* servers (or at least xrootd processes)
> whose role is supervisor. We'd like them to run in separate machines for
> reliability purposes, but they could run on the manager node as long as you
> give each one a unique instance name (i.e., -n option).
>
> The front part of the cmsd reference explains how to do this.
>
> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm
>
> Andy
>
> On Fri, 11 Dec 2009, wen guan wrote:
>
>> Hi Andrew,
>>
>>   Is there any change to configure xrootd with more than 65
>> machines? I used the configure below but it doesn't work.  Should I
>> configure some machines' manager to be supvervisor?
>>
>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg
>>
>>
>> Wen
>>
>