Print

Print


Hi Andrew,


    Thanks.
    I used the new cmsd at atlas-bkp1 manager. But it's still dropping
nodes. And in supervisor's log, I cannot find any dataserver to
register to it.

    The new logs are in http://higgs03.cs.wisc.edu/wguan/*.20091213.
    The manager is patched at 091213 08:38:15.

Wen

On Sun, Dec 13, 2009 at 1:52 AM, Andrew Hanushevsky
<[log in to unmask]> wrote:
> Hi Wen
>
> You will find the source replacement at:
>
> http://www.slac.stanford.edu/~abh/cmsd/
>
> It's XrdCmsCluster.cc and it replaces xrootd/src/XrdCms/XrdCmsCluster.cc
>
> I'm stepping out for a couple of hours but will be back to see how things
> went. Sorry for the issues :-(
>
> Andy
>
> On Sun, 13 Dec 2009, wen guan wrote:
>
>> Hi Andrew,
>>
>>      I prefer a source replacement.  Then I can compile it.
>>
>> Thanks
>> Wen
>>>
>>> I can do one of two things here:
>>>
>>> 1) Supply a source replacement and then you would recompile, or
>>>
>>> 2) Give me the uname -a of where the cmsd will run and I'll supply a
>>> binary
>>> replacement for you.
>>>
>>> Your choice.
>>>
>>> Andy
>>>
>>> On Sun, 13 Dec 2009, wen guan wrote:
>>>
>>>> Hi Andrew
>>>>
>>>> The problem is found. Great. Thanks.
>>>>
>>>> Where can I find the patched cmsd?
>>>>
>>>> Wen
>>>>
>>>> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky
>>>> <[log in to unmask]> wrote:
>>>>>
>>>>> Hi Wen,
>>>>>
>>>>> I found the problem. Looks like a regression from way back when. There
>>>>> is
>>>>> a
>>>>> missing flag on the redirect. This will require a patched cmsd but you
>>>>> need
>>>>> only to replace the redirector's cmsd as this only affects the
>>>>> redirector.
>>>>> How would you like to proceed?
>>>>>
>>>>> Andy
>>>>>
>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>
>>>>>> Hi Andrew,
>>>>>>
>>>>>>     It doesn't work. atlas-bkp1 manager still dropping nodes again.
>>>>>> In supervisor, I still haven't seen any dataserver registered. I said
>>>>>> "I updated the ntp"  because you said "the log timestamp do not
>>>>>> overlap".
>>>>>>
>>>>>> Wen
>>>>>>
>>>>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky
>>>>>> <[log in to unmask]> wrote:
>>>>>>>
>>>>>>> Hi Wen,
>>>>>>>
>>>>>>> Do you mean that everything is now working? It could be that you
>>>>>>> removed
>>>>>>> the
>>>>>>> xrd.timeout directive. That really could cause problems. As for the
>>>>>>> delays,
>>>>>>> that is normal when the redirector thinks something is going wrong.
>>>>>>> The
>>>>>>> strategy is to delay clients until it can get back to a stable
>>>>>>> configuration. This usually prevents jobs from crashing during
>>>>>>> stressful
>>>>>>> periods.
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>
>>>>>>>> Hi  Andrew,
>>>>>>>>
>>>>>>>>   I restarted it to do supervisor test.  Also because xrootd manager
>>>>>>>> frequently doesn't response. (*) is the cms.log, the file select is
>>>>>>>> delayed again and again. When do a restart, all things are fine. Now
>>>>>>>> I
>>>>>>>> am trying to find a clue about it.
>>>>>>>>
>>>>>>>> (*)
>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] do_Select:
>>>>>>>> wc
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>> 091212 00:00:19 21318 Select seeking
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>> 091212 00:00:19 21318 UnkFile rc=1
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] do_Select:
>>>>>>>> delay 5
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 2+-1 post=0
>>>>>>>> 091212 00:00:19 21318 Dispatch redirector.21313:14@atlas-bkp2 for
>>>>>>>> select dlen=166
>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1 post=0
>>>>>>>> 091212 00:00:19 21318 XrdSched: running redirector inq=0
>>>>>>>>
>>>>>>>>
>>>>>>>> There is no core file. I copied a new copies of the logs to the link
>>>>>>>> below.
>>>>>>>> http://higgs03.cs.wisc.edu/wguan/
>>>>>>>>
>>>>>>>> Wen
>>>>>>>>
>>>>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky
>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>
>>>>>>>>> Hi Wen,
>>>>>>>>>
>>>>>>>>> I see in the server log that it is restarting often. Could you take
>>>>>>>>> a
>>>>>>>>> look
>>>>>>>>> in the c193 to see if you have any core files? Also please make
>>>>>>>>> sure
>>>>>>>>> that
>>>>>>>>> core files are enabled as Linux defaults the size to 0. The first
>>>>>>>>> step
>>>>>>>>> here
>>>>>>>>> is to find out why your servers are restarting.
>>>>>>>>>
>>>>>>>>> Andy
>>>>>>>>>
>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>
>>>>>>>>>> Hi Andrew,
>>>>>>>>>>
>>>>>>>>>>  the logs can be found here. From the log you can see atlas-bkp1
>>>>>>>>>> manager are dropping nodes again and again which tries to connect
>>>>>>>>>> to
>>>>>>>>>> it.
>>>>>>>>>>  http://higgs03.cs.wisc.edu/wguan/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky
>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Wen, Could you start everything up and provide me a pointer to
>>>>>>>>>>> the
>>>>>>>>>>> manager log file, supervisor log file, and one data server
>>>>>>>>>>> logfile
>>>>>>>>>>> all
>>>>>>>>>>> of
>>>>>>>>>>> which cover the same time-frame (from start to some point where
>>>>>>>>>>> you
>>>>>>>>>>> think
>>>>>>>>>>> things are working or not). That way I can see what is happening.
>>>>>>>>>>> At
>>>>>>>>>>> the
>>>>>>>>>>> moment I only see two "bad" things in the config file:
>>>>>>>>>>>
>>>>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a manager but you
>>>>>>>>>>> claim,
>>>>>>>>>>> via
>>>>>>>>>>> the all.manager directive, that there are three (bkp2 and bkp3).
>>>>>>>>>>> While
>>>>>>>>>>> it
>>>>>>>>>>> should work, the log file will be dense with error messages.
>>>>>>>>>>> Please
>>>>>>>>>>> correct
>>>>>>>>>>> this to be consistent and make it easier to see real errors.
>>>>>>>>>>
>>>>>>>>>> This is not a problem for me. Because this config is used in
>>>>>>>>>> dataserver. In manager, I updated the if atlas-bkp1.cs.wisc.edu to
>>>>>>>>>> atlas-bkp2 or something. This is a history problem. at first only
>>>>>>>>>> atlas-bkp1 is used.  atlas-bkp2 and atlas-bkp3 are added  later.
>>>>>>>>>>
>>>>>>>>>>> 2) Please use cms.space not olb.space (for historical reasons the
>>>>>>>>>>> latter
>>>>>>>>>>> is
>>>>>>>>>>> still accepted and over-rides the former, but that will soon
>>>>>>>>>>> end),
>>>>>>>>>>> and
>>>>>>>>>>> please use only one (the config file uses both directives).
>>>>>>>>>>
>>>>>>>>>> yes. I should remove this line. in fact cms.space is in the cfg
>>>>>>>>>> too.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Wen
>>>>>>>>>>
>>>>>>>>>>> The xrootd has an internal mechanism to connect servers with
>>>>>>>>>>> supervisors
>>>>>>>>>>> to
>>>>>>>>>>> allow for maximum reliability. You cannot change that algorithm
>>>>>>>>>>> and
>>>>>>>>>>> there
>>>>>>>>>>> is
>>>>>>>>>>> no need to do so. You should *never* tell anyone to directly
>>>>>>>>>>> connect
>>>>>>>>>>> to
>>>>>>>>>>> a
>>>>>>>>>>> supervisor. If you do, you will likely get unreachable nodes.
>>>>>>>>>>>
>>>>>>>>>>> As for dropping data servers, it would appear to me, given the
>>>>>>>>>>> flurry
>>>>>>>>>>> of
>>>>>>>>>>> such activity, that something either crashed or was restarted.
>>>>>>>>>>> That's
>>>>>>>>>>> why
>>>>>>>>>>> it
>>>>>>>>>>> would be good to see the complete log of each one of the
>>>>>>>>>>> entities.
>>>>>>>>>>>
>>>>>>>>>>> Andy
>>>>>>>>>>>
>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>
>>>>>>>>>>>>    I read the document. and write a config
>>>>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg).
>>>>>>>>>>>>    I used my conf, I can see manager is dispatch message to
>>>>>>>>>>>> supervisor. But I cannot see any dataserver tries to connect to
>>>>>>>>>>>> the
>>>>>>>>>>>> supervisor. At the same time, in the manager's log, I can see
>>>>>>>>>>>> some
>>>>>>>>>>>> dataserver are Dropped.
>>>>>>>>>>>>   How does xrootd decide which dataserver will connect
>>>>>>>>>>>> supervisor?
>>>>>>>>>>>> should I specify some dataservers to connect the supervisor?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> (*) supervisor log
>>>>>>>>>>>> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2 for state
>>>>>>>>>>>> dlen=42
>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State:
>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_StateFWD: Path
>>>>>>>>>>>> find
>>>>>>>>>>>> failed for state /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>>>
>>>>>>>>>>>> (*)manager log
>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu TSpace=5587GB
>>>>>>>>>>>> NumFS=1
>>>>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0
>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding path: w
>>>>>>>>>>>> /atlas
>>>>>>>>>>>> 091211 04:13:24 15661 server.10585:[log in to unmask]:1094
>>>>>>>>>>>> do_Space: 5696231MB free; 0% util
>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>> server.10585:[log in to unmask]:1094 logged in.
>>>>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection from
>>>>>>>>>>>> [log in to unmask]
>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: running ?:[log in to unmask]
>>>>>>>>>>>> inq=0
>>>>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched protocol cmsd
>>>>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll: FD 79
>>>>>>>>>>>> attached
>>>>>>>>>>>> to poller 2; num=22
>>>>>>>>>>>> 091211 04:13:24 15661 Add server.21739:[log in to unmask]
>>>>>>>>>>>> bumps
>>>>>>>>>>>> server.15905:[log in to unmask]:1094 #63
>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>> 091211 04:13:24 15661 Drop_Node:
>>>>>>>>>>>> server.15905:[log in to unmask]:1094 dropped.
>>>>>>>>>>>> 091211 04:13:24 15661 Add Shoved
>>>>>>>>>>>> server.21739:[log in to unmask]:1094 to cluster; id=63.78;
>>>>>>>>>>>> num=64;
>>>>>>>>>>>> min=51
>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0
>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu TSpace=5587GB
>>>>>>>>>>>> NumFS=1
>>>>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0
>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding path: w
>>>>>>>>>>>> /atlas
>>>>>>>>>>>> 091211 04:13:24 15661 server.21739:[log in to unmask]:1094
>>>>>>>>>>>> do_Space: 5721854MB free; 0% util
>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>> server.21739:[log in to unmask]:1094 logged in.
>>>>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve from
>>>>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer
>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop node in 60
>>>>>>>>>>>> seconds
>>>>>>>>>>>> 091211 04:13:24 15661 Remove_Node
>>>>>>>>>>>> server.21739:[log in to unmask]:1094 node 63.78
>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>> logged
>>>>>>>>>>>> out.
>>>>>>>>>>>> 091211 04:13:24 15661 server.21739:[log in to unmask]
>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>> FD
>>>>>>>>>>>> 79 detached from poller 2; num=21
>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>> server.24718:[log in to unmask]:1094
>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>> 091211 04:13:27 15661 server.24718:[log in to unmask]:1094
>>>>>>>>>>>> do_Status:
>>>>>>>>>>>> suspend
>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service suspended
>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>> c177.chtc.wisc.edu
>>>>>>>>>>>> FD=16
>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>> server.24718:[log in to unmask]:1094 node 0.3
>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>> server.21656:[log in to unmask]
>>>>>>>>>>>> logged
>>>>>>>>>>>> out.
>>>>>>>>>>>> 091211 04:13:27 15661 server.21656:[log in to unmask]
>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>> FD
>>>>>>>>>>>> 16 detached from poller 2; num=20
>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>> c179.chtc.wisc.edu
>>>>>>>>>>>> FD=21
>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>> server.17065:[log in to unmask]:1094 node 1.4
>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>> server.7978:[log in to unmask]
>>>>>>>>>>>> logged
>>>>>>>>>>>> out.
>>>>>>>>>>>> 091211 04:13:27 15661 server.7978:[log in to unmask] XrdPoll:
>>>>>>>>>>>> FD
>>>>>>>>>>>> 21
>>>>>>>>>>>> detached from poller 1; num=21
>>>>>>>>>>>> 091211 04:13:27 15661 State: Status changed to suspended
>>>>>>>>>>>> 091211 04:13:27 15661 Send status to
>>>>>>>>>>>> redirector.15656:14@atlas-bkp2
>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>> server.12937:[log in to unmask]:1094
>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>> 091211 04:13:27 15661 server.12937:[log in to unmask]:1094
>>>>>>>>>>>> do_Status:
>>>>>>>>>>>> suspend
>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service suspended
>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>> c182.chtc.wisc.edu
>>>>>>>>>>>> FD=19
>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>> server.12937:[log in to unmask]:1094 node 7.10
>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>> server.26620:[log in to unmask]
>>>>>>>>>>>> logged
>>>>>>>>>>>> out.
>>>>>>>>>>>> 091211 04:13:27 15661 server.26620:[log in to unmask]
>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>> FD
>>>>>>>>>>>> 19 detached from poller 2; num=19
>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>> server.10842:[log in to unmask]:1094
>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>> 091211 04:13:27 15661 server.10842:[log in to unmask]:1094
>>>>>>>>>>>> do_Status:
>>>>>>>>>>>> suspend
>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service suspended
>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>> c178.chtc.wisc.edu
>>>>>>>>>>>> FD=15
>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>> server.10842:[log in to unmask]:1094 node 9.12
>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>> server.11901:[log in to unmask]
>>>>>>>>>>>> logged
>>>>>>>>>>>> out.
>>>>>>>>>>>> 091211 04:13:27 15661 server.11901:[log in to unmask]
>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>> FD
>>>>>>>>>>>> 15 detached from poller 1; num=20
>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>> server.5535:[log in to unmask]:1094
>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>> 091211 04:13:27 15661 server.5535:[log in to unmask]:1094
>>>>>>>>>>>> do_Status:
>>>>>>>>>>>> suspend
>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service suspended
>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>> c181.chtc.wisc.edu
>>>>>>>>>>>> FD=17
>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>> server.5535:[log in to unmask]:1094 node 5.8
>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>> server.13984:[log in to unmask]
>>>>>>>>>>>> logged
>>>>>>>>>>>> out.
>>>>>>>>>>>> 091211 04:13:27 15661 server.13984:[log in to unmask]
>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>> FD
>>>>>>>>>>>> 17 detached from poller 0; num=21
>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>> server.23711:[log in to unmask]:1094
>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>> 091211 04:13:27 15661 server.23711:[log in to unmask]:1094
>>>>>>>>>>>> do_Status:
>>>>>>>>>>>> suspend
>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service suspended
>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>> c183.chtc.wisc.edu
>>>>>>>>>>>> FD=22
>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>> server.23711:[log in to unmask]:1094 node 8.11
>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>> server.27735:[log in to unmask]
>>>>>>>>>>>> logged
>>>>>>>>>>>> out.
>>>>>>>>>>>> 091211 04:13:27 15661 server.27735:[log in to unmask]
>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>> FD
>>>>>>>>>>>> 22 detached from poller 2; num=18
>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>> c184.chtc.wisc.edu
>>>>>>>>>>>> FD=20
>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>> server.4131:[log in to unmask]:1094 node 3.6
>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>> server.26787:[log in to unmask]
>>>>>>>>>>>> logged
>>>>>>>>>>>> out.
>>>>>>>>>>>> 091211 04:13:27 15661 server.26787:[log in to unmask]
>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>> FD
>>>>>>>>>>>> 20 detached from poller 0; num=20
>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>> 091211 04:13:27 15661 server.10585:[log in to unmask]:1094
>>>>>>>>>>>> do_Status:
>>>>>>>>>>>> suspend
>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service suspended
>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>> c185.chtc.wisc.edu
>>>>>>>>>>>> FD=23
>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>> server.10585:[log in to unmask]:1094 node 6.9
>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>> server.8524:[log in to unmask]
>>>>>>>>>>>> logged
>>>>>>>>>>>> out.
>>>>>>>>>>>> 091211 04:13:27 15661 server.8524:[log in to unmask] XrdPoll:
>>>>>>>>>>>> FD
>>>>>>>>>>>> 23
>>>>>>>>>>>> detached from poller 0; num=19
>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>> server.20264:[log in to unmask]:1094
>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>> 091211 04:13:27 15661 server.20264:[log in to unmask]:1094
>>>>>>>>>>>> do_Status:
>>>>>>>>>>>> suspend
>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service suspended
>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>> c180.chtc.wisc.edu
>>>>>>>>>>>> FD=18
>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>> server.20264:[log in to unmask]:1094 node 4.7
>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>> server.14636:[log in to unmask]
>>>>>>>>>>>> logged
>>>>>>>>>>>> out.
>>>>>>>>>>>> 091211 04:13:27 15661 server.14636:[log in to unmask]
>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>> FD
>>>>>>>>>>>> 18 detached from poller 1; num=19
>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>> server.1656:[log in to unmask]:1094
>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>> 091211 04:13:27 15661 server.1656:[log in to unmask]:1094
>>>>>>>>>>>> do_Status:
>>>>>>>>>>>> suspend
>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service suspended
>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>> c186.chtc.wisc.edu
>>>>>>>>>>>> FD=24
>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>> server.1656:[log in to unmask]:1094 node 2.5
>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>> server.7849:[log in to unmask]
>>>>>>>>>>>> logged
>>>>>>>>>>>> out.
>>>>>>>>>>>> 091211 04:13:27 15661 server.7849:[log in to unmask] XrdPoll:
>>>>>>>>>>>> FD
>>>>>>>>>>>> 24
>>>>>>>>>>>> detached from poller 1; num=18
>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>> seconds
>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>> seconds
>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>> seconds
>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>> seconds
>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>> seconds
>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>> seconds
>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>> seconds
>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>> seconds
>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>> seconds
>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>> seconds
>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled.
>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled.
>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled.
>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled.
>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled.
>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled.
>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled.
>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled.
>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled.
>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled.
>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled.
>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers
>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled.
>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>
>>>>>>>>>>>> Wen
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky
>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>
>>>>>>>>>>>>> To go past 64 data servers you will need to setup one or more
>>>>>>>>>>>>> supervisors.
>>>>>>>>>>>>> This does not logically change the current configuration you
>>>>>>>>>>>>> have.
>>>>>>>>>>>>> You
>>>>>>>>>>>>> only
>>>>>>>>>>>>> need to configure one or more *new* servers (or at least xrootd
>>>>>>>>>>>>> processes)
>>>>>>>>>>>>> whose role is supervisor. We'd like them to run in separate
>>>>>>>>>>>>> machines
>>>>>>>>>>>>> for
>>>>>>>>>>>>> reliability purposes, but they could run on the manager node as
>>>>>>>>>>>>> long
>>>>>>>>>>>>> as
>>>>>>>>>>>>> you
>>>>>>>>>>>>> give each one a unique instance name (i.e., -n option).
>>>>>>>>>>>>>
>>>>>>>>>>>>> The front part of the cmsd reference explains how to do this.
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm
>>>>>>>>>>>>>
>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   Is there any change to configure xrootd with more than 65
>>>>>>>>>>>>>> machines? I used the configure below but it doesn't work.
>>>>>>>>>>>>>>  Should
>>>>>>>>>>>>>> I
>>>>>>>>>>>>>> configure some machines' manager to be supvervisor?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>