Print

Print


Hi Wen,

I found the problem. Looks like a regression from way back when. There is 
a missing flag on the redirect. This will require a patched cmsd but you 
need only to replace the redirector's cmsd as this only affects the 
redirector. How would you like to proceed?

Andy

On Sat, 12 Dec 2009, wen guan wrote:

> Hi Andrew,
>
>      It doesn't work. atlas-bkp1 manager still dropping nodes again.
> In supervisor, I still haven't seen any dataserver registered. I said
> "I updated the ntp"  because you said "the log timestamp do not
> overlap".
>
> Wen
>
> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky
> <[log in to unmask]> wrote:
>> Hi Wen,
>>
>> Do you mean that everything is now working? It could be that you removed the
>> xrd.timeout directive. That really could cause problems. As for the delays,
>> that is normal when the redirector thinks something is going wrong. The
>> strategy is to delay clients until it can get back to a stable
>> configuration. This usually prevents jobs from crashing during stressful
>> periods.
>>
>> Andy
>>
>> On Sat, 12 Dec 2009, wen guan wrote:
>>
>>> Hi  Andrew,
>>>
>>>   I restarted it to do supervisor test.  Also because xrootd manager
>>> frequently doesn't response. (*) is the cms.log, the file select is
>>> delayed again and again. When do a restart, all things are fine. Now I
>>> am trying to find a clue about it.
>>>
>>> (*)
>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] do_Select: wc
>>>
>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>> 091212 00:00:19 21318 Select seeking
>>>
>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>> 091212 00:00:19 21318 UnkFile rc=1
>>>
>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] do_Select:
>>> delay 5
>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 2+-1 post=0
>>> 091212 00:00:19 21318 Dispatch redirector.21313:14@atlas-bkp2 for
>>> select dlen=166
>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1 post=0
>>> 091212 00:00:19 21318 XrdSched: running redirector inq=0
>>>
>>>
>>> There is no core file. I copied a new copies of the logs to the link
>>> below.
>>> http://higgs03.cs.wisc.edu/wguan/
>>>
>>> Wen
>>>
>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky
>>> <[log in to unmask]> wrote:
>>>>
>>>> Hi Wen,
>>>>
>>>> I see in the server log that it is restarting often. Could you take a
>>>> look
>>>> in the c193 to see if you have any core files? Also please make sure that
>>>> core files are enabled as Linux defaults the size to 0. The first step
>>>> here
>>>> is to find out why your servers are restarting.
>>>>
>>>> Andy
>>>>
>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>
>>>>> Hi Andrew,
>>>>>
>>>>>  the logs can be found here. From the log you can see atlas-bkp1
>>>>> manager are dropping nodes again and again which tries to connect to
>>>>> it.
>>>>>  http://higgs03.cs.wisc.edu/wguan/
>>>>>
>>>>>
>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky
>>>>> <[log in to unmask]> wrote:
>>>>>>
>>>>>> Hi Wen, Could you start everything up and provide me a pointer to the
>>>>>> manager log file, supervisor log file, and one data server logfile all
>>>>>> of
>>>>>> which cover the same time-frame (from start to some point where you
>>>>>> think
>>>>>> things are working or not). That way I can see what is happening. At
>>>>>> the
>>>>>> moment I only see two "bad" things in the config file:
>>>>>>
>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a manager but you
>>>>>> claim,
>>>>>> via
>>>>>> the all.manager directive, that there are three (bkp2 and bkp3). While
>>>>>> it
>>>>>> should work, the log file will be dense with error messages. Please
>>>>>> correct
>>>>>> this to be consistent and make it easier to see real errors.
>>>>>
>>>>> This is not a problem for me. Because this config is used in
>>>>> dataserver. In manager, I updated the if atlas-bkp1.cs.wisc.edu to
>>>>> atlas-bkp2 or something. This is a history problem. at first only
>>>>> atlas-bkp1 is used.  atlas-bkp2 and atlas-bkp3 are added  later.
>>>>>
>>>>>> 2) Please use cms.space not olb.space (for historical reasons the
>>>>>> latter
>>>>>> is
>>>>>> still accepted and over-rides the former, but that will soon end), and
>>>>>> please use only one (the config file uses both directives).
>>>>>
>>>>> yes. I should remove this line. in fact cms.space is in the cfg too.
>>>>>
>>>>>
>>>>> Thanks
>>>>> Wen
>>>>>
>>>>>> The xrootd has an internal mechanism to connect servers with
>>>>>> supervisors
>>>>>> to
>>>>>> allow for maximum reliability. You cannot change that algorithm and
>>>>>> there
>>>>>> is
>>>>>> no need to do so. You should *never* tell anyone to directly connect to
>>>>>> a
>>>>>> supervisor. If you do, you will likely get unreachable nodes.
>>>>>>
>>>>>> As for dropping data servers, it would appear to me, given the flurry
>>>>>> of
>>>>>> such activity, that something either crashed or was restarted. That's
>>>>>> why
>>>>>> it
>>>>>> would be good to see the complete log of each one of the entities.
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>
>>>>>>> Hi Andrew,
>>>>>>>
>>>>>>>    I read the document. and write a config
>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg).
>>>>>>>    I used my conf, I can see manager is dispatch message to
>>>>>>> supervisor. But I cannot see any dataserver tries to connect to the
>>>>>>> supervisor. At the same time, in the manager's log, I can see some
>>>>>>> dataserver are Dropped.
>>>>>>>   How does xrootd decide which dataserver will connect supervisor?
>>>>>>> should I specify some dataservers to connect the supervisor?
>>>>>>>
>>>>>>>
>>>>>>> (*) supervisor log
>>>>>>> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2 for state
>>>>>>> dlen=42
>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State:
>>>>>>> /atlas/xrootd/users/wguan/test/test131141
>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_StateFWD: Path find
>>>>>>> failed for state /atlas/xrootd/users/wguan/test/test131141
>>>>>>>
>>>>>>> (*)manager log
>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu TSpace=5587GB NumFS=1
>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0
>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding path: w /atlas
>>>>>>> 091211 04:13:24 15661 server.10585:[log in to unmask]:1094
>>>>>>> do_Space: 5696231MB free; 0% util
>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>> server.10585:[log in to unmask]:1094 logged in.
>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection from
>>>>>>> [log in to unmask]
>>>>>>> 091211 04:13:24 15661 XrdSched: running ?:[log in to unmask] inq=0
>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched protocol cmsd
>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll: FD 79 attached
>>>>>>> to poller 2; num=22
>>>>>>> 091211 04:13:24 15661 Add server.21739:[log in to unmask] bumps
>>>>>>> server.15905:[log in to unmask]:1094 #63
>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>> 091211 04:13:24 15661 Drop_Node:
>>>>>>> server.15905:[log in to unmask]:1094 dropped.
>>>>>>> 091211 04:13:24 15661 Add Shoved
>>>>>>> server.21739:[log in to unmask]:1094 to cluster; id=63.78; num=64;
>>>>>>> min=51
>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0
>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu TSpace=5587GB NumFS=1
>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0
>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding path: w /atlas
>>>>>>> 091211 04:13:24 15661 server.21739:[log in to unmask]:1094
>>>>>>> do_Space: 5721854MB free; 0% util
>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>> server.21739:[log in to unmask]:1094 logged in.
>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve from
>>>>>>> c187.chtc.wisc.edu; connection reset by peer
>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop node in 60 seconds
>>>>>>> 091211 04:13:24 15661 Remove_Node
>>>>>>> server.21739:[log in to unmask]:1094 node 63.78
>>>>>>> 091211 04:13:24 15661 Protocol: server.21739:[log in to unmask]
>>>>>>> logged
>>>>>>> out.
>>>>>>> 091211 04:13:24 15661 server.21739:[log in to unmask] XrdPoll: FD
>>>>>>> 79 detached from poller 2; num=21
>>>>>>> 091211 04:13:27 15661 Dispatch server.24718:[log in to unmask]:1094
>>>>>>> for status dlen=0
>>>>>>> 091211 04:13:27 15661 server.24718:[log in to unmask]:1094
>>>>>>> do_Status:
>>>>>>> suspend
>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service suspended
>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>> c177.chtc.wisc.edu
>>>>>>> FD=16
>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>> server.24718:[log in to unmask]:1094 node 0.3
>>>>>>> 091211 04:13:27 15661 Protocol: server.21656:[log in to unmask]
>>>>>>> logged
>>>>>>> out.
>>>>>>> 091211 04:13:27 15661 server.21656:[log in to unmask] XrdPoll: FD
>>>>>>> 16 detached from poller 2; num=20
>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>> c179.chtc.wisc.edu
>>>>>>> FD=21
>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>> server.17065:[log in to unmask]:1094 node 1.4
>>>>>>> 091211 04:13:27 15661 Protocol: server.7978:[log in to unmask]
>>>>>>> logged
>>>>>>> out.
>>>>>>> 091211 04:13:27 15661 server.7978:[log in to unmask] XrdPoll: FD 21
>>>>>>> detached from poller 1; num=21
>>>>>>> 091211 04:13:27 15661 State: Status changed to suspended
>>>>>>> 091211 04:13:27 15661 Send status to redirector.15656:14@atlas-bkp2
>>>>>>> 091211 04:13:27 15661 Dispatch server.12937:[log in to unmask]:1094
>>>>>>> for status dlen=0
>>>>>>> 091211 04:13:27 15661 server.12937:[log in to unmask]:1094
>>>>>>> do_Status:
>>>>>>> suspend
>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service suspended
>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>> c182.chtc.wisc.edu
>>>>>>> FD=19
>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>> server.12937:[log in to unmask]:1094 node 7.10
>>>>>>> 091211 04:13:27 15661 Protocol: server.26620:[log in to unmask]
>>>>>>> logged
>>>>>>> out.
>>>>>>> 091211 04:13:27 15661 server.26620:[log in to unmask] XrdPoll: FD
>>>>>>> 19 detached from poller 2; num=19
>>>>>>> 091211 04:13:27 15661 Dispatch server.10842:[log in to unmask]:1094
>>>>>>> for status dlen=0
>>>>>>> 091211 04:13:27 15661 server.10842:[log in to unmask]:1094
>>>>>>> do_Status:
>>>>>>> suspend
>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service suspended
>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>> c178.chtc.wisc.edu
>>>>>>> FD=15
>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>> server.10842:[log in to unmask]:1094 node 9.12
>>>>>>> 091211 04:13:27 15661 Protocol: server.11901:[log in to unmask]
>>>>>>> logged
>>>>>>> out.
>>>>>>> 091211 04:13:27 15661 server.11901:[log in to unmask] XrdPoll: FD
>>>>>>> 15 detached from poller 1; num=20
>>>>>>> 091211 04:13:27 15661 Dispatch server.5535:[log in to unmask]:1094
>>>>>>> for status dlen=0
>>>>>>> 091211 04:13:27 15661 server.5535:[log in to unmask]:1094
>>>>>>> do_Status:
>>>>>>> suspend
>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service suspended
>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>> c181.chtc.wisc.edu
>>>>>>> FD=17
>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>> server.5535:[log in to unmask]:1094 node 5.8
>>>>>>> 091211 04:13:27 15661 Protocol: server.13984:[log in to unmask]
>>>>>>> logged
>>>>>>> out.
>>>>>>> 091211 04:13:27 15661 server.13984:[log in to unmask] XrdPoll: FD
>>>>>>> 17 detached from poller 0; num=21
>>>>>>> 091211 04:13:27 15661 Dispatch server.23711:[log in to unmask]:1094
>>>>>>> for status dlen=0
>>>>>>> 091211 04:13:27 15661 server.23711:[log in to unmask]:1094
>>>>>>> do_Status:
>>>>>>> suspend
>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service suspended
>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>> c183.chtc.wisc.edu
>>>>>>> FD=22
>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>> server.23711:[log in to unmask]:1094 node 8.11
>>>>>>> 091211 04:13:27 15661 Protocol: server.27735:[log in to unmask]
>>>>>>> logged
>>>>>>> out.
>>>>>>> 091211 04:13:27 15661 server.27735:[log in to unmask] XrdPoll: FD
>>>>>>> 22 detached from poller 2; num=18
>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>> c184.chtc.wisc.edu
>>>>>>> FD=20
>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>> server.4131:[log in to unmask]:1094 node 3.6
>>>>>>> 091211 04:13:27 15661 Protocol: server.26787:[log in to unmask]
>>>>>>> logged
>>>>>>> out.
>>>>>>> 091211 04:13:27 15661 server.26787:[log in to unmask] XrdPoll: FD
>>>>>>> 20 detached from poller 0; num=20
>>>>>>> 091211 04:13:27 15661 Dispatch server.10585:[log in to unmask]:1094
>>>>>>> for status dlen=0
>>>>>>> 091211 04:13:27 15661 server.10585:[log in to unmask]:1094
>>>>>>> do_Status:
>>>>>>> suspend
>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service suspended
>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>> c185.chtc.wisc.edu
>>>>>>> FD=23
>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>> server.10585:[log in to unmask]:1094 node 6.9
>>>>>>> 091211 04:13:27 15661 Protocol: server.8524:[log in to unmask]
>>>>>>> logged
>>>>>>> out.
>>>>>>> 091211 04:13:27 15661 server.8524:[log in to unmask] XrdPoll: FD 23
>>>>>>> detached from poller 0; num=19
>>>>>>> 091211 04:13:27 15661 Dispatch server.20264:[log in to unmask]:1094
>>>>>>> for status dlen=0
>>>>>>> 091211 04:13:27 15661 server.20264:[log in to unmask]:1094
>>>>>>> do_Status:
>>>>>>> suspend
>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service suspended
>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>> c180.chtc.wisc.edu
>>>>>>> FD=18
>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>> server.20264:[log in to unmask]:1094 node 4.7
>>>>>>> 091211 04:13:27 15661 Protocol: server.14636:[log in to unmask]
>>>>>>> logged
>>>>>>> out.
>>>>>>> 091211 04:13:27 15661 server.14636:[log in to unmask] XrdPoll: FD
>>>>>>> 18 detached from poller 1; num=19
>>>>>>> 091211 04:13:27 15661 Dispatch server.1656:[log in to unmask]:1094
>>>>>>> for status dlen=0
>>>>>>> 091211 04:13:27 15661 server.1656:[log in to unmask]:1094
>>>>>>> do_Status:
>>>>>>> suspend
>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service suspended
>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>> c186.chtc.wisc.edu
>>>>>>> FD=24
>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>> server.1656:[log in to unmask]:1094 node 2.5
>>>>>>> 091211 04:13:27 15661 Protocol: server.7849:[log in to unmask]
>>>>>>> logged
>>>>>>> out.
>>>>>>> 091211 04:13:27 15661 server.7849:[log in to unmask] XrdPoll: FD 24
>>>>>>> detached from poller 1; num=18
>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled.
>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled.
>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled.
>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled.
>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled.
>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled.
>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled.
>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled.
>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled.
>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled.
>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled.
>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers
>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled.
>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>>>>>>
>>>>>>> Wen
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky
>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>
>>>>>>>> Hi Wen,
>>>>>>>>
>>>>>>>> To go past 64 data servers you will need to setup one or more
>>>>>>>> supervisors.
>>>>>>>> This does not logically change the current configuration you have.
>>>>>>>> You
>>>>>>>> only
>>>>>>>> need to configure one or more *new* servers (or at least xrootd
>>>>>>>> processes)
>>>>>>>> whose role is supervisor. We'd like them to run in separate machines
>>>>>>>> for
>>>>>>>> reliability purposes, but they could run on the manager node as long
>>>>>>>> as
>>>>>>>> you
>>>>>>>> give each one a unique instance name (i.e., -n option).
>>>>>>>>
>>>>>>>> The front part of the cmsd reference explains how to do this.
>>>>>>>>
>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm
>>>>>>>>
>>>>>>>> Andy
>>>>>>>>
>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>
>>>>>>>>> Hi Andrew,
>>>>>>>>>
>>>>>>>>>   Is there any change to configure xrootd with more than 65
>>>>>>>>> machines? I used the configure below but it doesn't work.  Should I
>>>>>>>>> configure some machines' manager to be supvervisor?
>>>>>>>>>
>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Wen
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>
>