LISTSERV 16.5 - XROOTD-L Archives

Hi Wen, Could you start everything up and provide me a pointer to the 
manager log file, supervisor log file, and one data server logfile all of 
which cover the same time-frame (from start to some point where you think 
things are working or not). That way I can see what is happening. At the 
moment I only see two "bad" things in the config file:

1) Only atlas-bkp1.cs.wisc.edu is designated as a manager but you claim, 
via the all.manager directive, that there are three (bkp2 and bkp3). 
While it should work, the log file will be dense with error messages. 
Please correct this to be consistent and make it easier to see real 
errors.

2) Please use cms.space not olb.space (for historical reasons the latter 
is still accepted and over-rides the former, but that will soon end), and 
please use only one (the config file uses both directives).

The xrootd has an internal mechanism to connect servers with supervisors 
to allow for maximum reliability. You cannot change that algorithm and 
there is no need to do so. You should *never* tell anyone to directly 
connect to a supervisor. If you do, you will likely get unreachable nodes.

As for dropping data servers, it would appear to me, given the flurry of 
such activity, that something either crashed or was restarted. That's why 
it would be good to see the complete log of each one of the entities.

Andy

On Fri, 11 Dec 2009, wen guan wrote:

> Hi Andrew,
>
>     I read the document. and write a config
> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg).
>     I used my conf, I can see manager is dispatch message to
> supervisor. But I cannot see any dataserver tries to connect to the
> supervisor. At the same time, in the manager's log, I can see some
> dataserver are Dropped.
>    How does xrootd decide which dataserver will connect supervisor?
> should I specify some dataservers to connect the supervisor?
>
>
> (*) supervisor log
> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2 for state dlen=42
> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State:
> /atlas/xrootd/users/wguan/test/test131141
> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_StateFWD: Path find
> failed for state /atlas/xrootd/users/wguan/test/test131141
>
> (*)manager log
> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu TSpace=5587GB NumFS=1
> FSpace=5693644MB MinFR=57218MB Util=0
> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding path: w /atlas
> 091211 04:13:24 15661 server.10585:[log in to unmask]:1094
> do_Space: 5696231MB free; 0% util
> 091211 04:13:24 15661 Protocol:
> server.10585:[log in to unmask]:1094 logged in.
> 091211 04:13:24 001 XrdInet: Accepted connection from [log in to unmask]
> 091211 04:13:24 15661 XrdSched: running ?:[log in to unmask] inq=0
> 091211 04:13:24 15661 XrdProtocol: matched protocol cmsd
> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll: FD 79 attached
> to poller 2; num=22
> 091211 04:13:24 15661 Add server.21739:[log in to unmask] bumps
> server.15905:[log in to unmask]:1094 #63
> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
> 091211 04:13:24 15661 Drop_Node:
> server.15905:[log in to unmask]:1094 dropped.
> 091211 04:13:24 15661 Add Shoved
> server.21739:[log in to unmask]:1094 to cluster; id=63.78; num=64;
> min=51
> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0
> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu TSpace=5587GB NumFS=1
> FSpace=5721854MB MinFR=57218MB Util=0
> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding path: w /atlas
> 091211 04:13:24 15661 server.21739:[log in to unmask]:1094
> do_Space: 5721854MB free; 0% util
> 091211 04:13:24 15661 Protocol:
> server.21739:[log in to unmask]:1094 logged in.
> 091211 04:13:24 15661 XrdLink: Unable to recieve from
> c187.chtc.wisc.edu; connection reset by peer
> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
> 091211 04:13:24 15661 XrdSched: scheduling drop node in 60 seconds
> 091211 04:13:24 15661 Remove_Node
> server.21739:[log in to unmask]:1094 node 63.78
> 091211 04:13:24 15661 Protocol: server.21739:[log in to unmask] logged out.
> 091211 04:13:24 15661 server.21739:[log in to unmask] XrdPoll: FD
> 79 detached from poller 2; num=21
> 091211 04:13:27 15661 Dispatch server.24718:[log in to unmask]:1094
> for status dlen=0
> 091211 04:13:27 15661 server.24718:[log in to unmask]:1094 do_Status: suspend
> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service suspended
> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c177.chtc.wisc.edu FD=16
> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
> 091211 04:13:27 15661 Remove_Node
> server.24718:[log in to unmask]:1094 node 0.3
> 091211 04:13:27 15661 Protocol: server.21656:[log in to unmask] logged out.
> 091211 04:13:27 15661 server.21656:[log in to unmask] XrdPoll: FD
> 16 detached from poller 2; num=20
> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c179.chtc.wisc.edu FD=21
> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
> 091211 04:13:27 15661 Remove_Node
> server.17065:[log in to unmask]:1094 node 1.4
> 091211 04:13:27 15661 Protocol: server.7978:[log in to unmask] logged out.
> 091211 04:13:27 15661 server.7978:[log in to unmask] XrdPoll: FD 21
> detached from poller 1; num=21
> 091211 04:13:27 15661 State: Status changed to suspended
> 091211 04:13:27 15661 Send status to redirector.15656:14@atlas-bkp2
> 091211 04:13:27 15661 Dispatch server.12937:[log in to unmask]:1094
> for status dlen=0
> 091211 04:13:27 15661 server.12937:[log in to unmask]:1094 do_Status: suspend
> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service suspended
> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c182.chtc.wisc.edu FD=19
> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
> 091211 04:13:27 15661 Remove_Node
> server.12937:[log in to unmask]:1094 node 7.10
> 091211 04:13:27 15661 Protocol: server.26620:[log in to unmask] logged out.
> 091211 04:13:27 15661 server.26620:[log in to unmask] XrdPoll: FD
> 19 detached from poller 2; num=19
> 091211 04:13:27 15661 Dispatch server.10842:[log in to unmask]:1094
> for status dlen=0
> 091211 04:13:27 15661 server.10842:[log in to unmask]:1094 do_Status: suspend
> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service suspended
> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c178.chtc.wisc.edu FD=15
> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
> 091211 04:13:27 15661 Remove_Node
> server.10842:[log in to unmask]:1094 node 9.12
> 091211 04:13:27 15661 Protocol: server.11901:[log in to unmask] logged out.
> 091211 04:13:27 15661 server.11901:[log in to unmask] XrdPoll: FD
> 15 detached from poller 1; num=20
> 091211 04:13:27 15661 Dispatch server.5535:[log in to unmask]:1094
> for status dlen=0
> 091211 04:13:27 15661 server.5535:[log in to unmask]:1094 do_Status: suspend
> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service suspended
> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c181.chtc.wisc.edu FD=17
> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
> 091211 04:13:27 15661 Remove_Node
> server.5535:[log in to unmask]:1094 node 5.8
> 091211 04:13:27 15661 Protocol: server.13984:[log in to unmask] logged out.
> 091211 04:13:27 15661 server.13984:[log in to unmask] XrdPoll: FD
> 17 detached from poller 0; num=21
> 091211 04:13:27 15661 Dispatch server.23711:[log in to unmask]:1094
> for status dlen=0
> 091211 04:13:27 15661 server.23711:[log in to unmask]:1094 do_Status: suspend
> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service suspended
> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c183.chtc.wisc.edu FD=22
> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
> 091211 04:13:27 15661 Remove_Node
> server.23711:[log in to unmask]:1094 node 8.11
> 091211 04:13:27 15661 Protocol: server.27735:[log in to unmask] logged out.
> 091211 04:13:27 15661 server.27735:[log in to unmask] XrdPoll: FD
> 22 detached from poller 2; num=18
> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c184.chtc.wisc.edu FD=20
> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
> 091211 04:13:27 15661 Remove_Node
> server.4131:[log in to unmask]:1094 node 3.6
> 091211 04:13:27 15661 Protocol: server.26787:[log in to unmask] logged out.
> 091211 04:13:27 15661 server.26787:[log in to unmask] XrdPoll: FD
> 20 detached from poller 0; num=20
> 091211 04:13:27 15661 Dispatch server.10585:[log in to unmask]:1094
> for status dlen=0
> 091211 04:13:27 15661 server.10585:[log in to unmask]:1094 do_Status: suspend
> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service suspended
> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c185.chtc.wisc.edu FD=23
> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
> 091211 04:13:27 15661 Remove_Node
> server.10585:[log in to unmask]:1094 node 6.9
> 091211 04:13:27 15661 Protocol: server.8524:[log in to unmask] logged out.
> 091211 04:13:27 15661 server.8524:[log in to unmask] XrdPoll: FD 23
> detached from poller 0; num=19
> 091211 04:13:27 15661 Dispatch server.20264:[log in to unmask]:1094
> for status dlen=0
> 091211 04:13:27 15661 server.20264:[log in to unmask]:1094 do_Status: suspend
> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service suspended
> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c180.chtc.wisc.edu FD=18
> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
> 091211 04:13:27 15661 Remove_Node
> server.20264:[log in to unmask]:1094 node 4.7
> 091211 04:13:27 15661 Protocol: server.14636:[log in to unmask] logged out.
> 091211 04:13:27 15661 server.14636:[log in to unmask] XrdPoll: FD
> 18 detached from poller 1; num=19
> 091211 04:13:27 15661 Dispatch server.1656:[log in to unmask]:1094
> for status dlen=0
> 091211 04:13:27 15661 server.1656:[log in to unmask]:1094 do_Status: suspend
> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service suspended
> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c186.chtc.wisc.edu FD=24
> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
> 091211 04:13:27 15661 Remove_Node
> server.1656:[log in to unmask]:1094 node 2.5
> 091211 04:13:27 15661 Protocol: server.7849:[log in to unmask] logged out.
> 091211 04:13:27 15661 server.7849:[log in to unmask] XrdPoll: FD 24
> detached from poller 1; num=18
> 091211 04:14:14 15661 XrdSched: running drop node inq=0
> 091211 04:14:14 15661 XrdSched: running drop node inq=0
> 091211 04:14:14 15661 XrdSched: running drop node inq=0
> 091211 04:14:14 15661 XrdSched: running drop node inq=0
> 091211 04:14:14 15661 XrdSched: running drop node inq=0
> 091211 04:14:14 15661 XrdSched: running drop node inq=0
> 091211 04:14:14 15661 XrdSched: running drop node inq=0
> 091211 04:14:14 15661 XrdSched: running drop node inq=0
> 091211 04:14:14 15661 XrdSched: running drop node inq=0
> 091211 04:14:14 15661 XrdSched: running drop node inq=0
> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
> 091211 04:14:24 15661 XrdSched: running drop node inq=1
> 091211 04:14:24 15661 XrdSched: running drop node inq=1
> 091211 04:14:24 15661 Drop_Node 63.66 cancelled.
> 091211 04:14:24 15661 XrdSched: running drop node inq=1
> 091211 04:14:24 15661 Drop_Node 63.68 cancelled.
> 091211 04:14:24 15661 XrdSched: running drop node inq=1
> 091211 04:14:24 15661 Drop_Node 63.69 cancelled.
> 091211 04:14:24 15661 Drop_Node 63.67 cancelled.
> 091211 04:14:24 15661 XrdSched: running drop node inq=1
> 091211 04:14:24 15661 Drop_Node 63.70 cancelled.
> 091211 04:14:24 15661 XrdSched: running drop node inq=1
> 091211 04:14:24 15661 Drop_Node 63.71 cancelled.
> 091211 04:14:24 15661 XrdSched: running drop node inq=1
> 091211 04:14:24 15661 Drop_Node 63.72 cancelled.
> 091211 04:14:24 15661 XrdSched: running drop node inq=1
> 091211 04:14:24 15661 Drop_Node 63.73 cancelled.
> 091211 04:14:24 15661 XrdSched: running drop node inq=1
> 091211 04:14:24 15661 Drop_Node 63.74 cancelled.
> 091211 04:14:24 15661 XrdSched: running drop node inq=1
> 091211 04:14:24 15661 Drop_Node 63.75 cancelled.
> 091211 04:14:24 15661 XrdSched: running drop node inq=1
> 091211 04:14:24 15661 Drop_Node 63.76 cancelled.
> 091211 04:14:24 15661 XrdSched: Now have 68 workers
> 091211 04:14:24 15661 XrdSched: running drop node inq=0
> 091211 04:14:24 15661 Drop_Node 63.77 cancelled.
> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>
> Wen
>
>
> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky
> <[log in to unmask]> wrote:
>> Hi Wen,
>>
>> To go past 64 data servers you will need to setup one or more supervisors.
>> This does not logically change the current configuration you have. You only
>> need to configure one or more *new* servers (or at least xrootd processes)
>> whose role is supervisor. We'd like them to run in separate machines for
>> reliability purposes, but they could run on the manager node as long as you
>> give each one a unique instance name (i.e., -n option).
>>
>> The front part of the cmsd reference explains how to do this.
>>
>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm
>>
>> Andy
>>
>> On Fri, 11 Dec 2009, wen guan wrote:
>>
>>> Hi Andrew,
>>>
>>>   Is there any change to configure xrootd with more than 65
>>> machines? I used the configure below but it doesn't work.  Should I
>>> configure some machines' manager to be supvervisor?
>>>
>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg
>>>
>>>
>>> Wen
>>>
>>
>