LISTSERV 16.5 - XROOTD-L Archives

Hi Wen,

Oh yes, the permanent fix should be available Monday late afternoon PST.

Andy


On Sun, 13 Dec 2009, wen guan wrote:

> Hi Andrew,
>
>
>    Thanks.
>    I used the new cmsd at atlas-bkp1 manager. But it's still dropping
> nodes. And in supervisor's log, I cannot find any dataserver to
> register to it.
>
>    The new logs are in http://higgs03.cs.wisc.edu/wguan/*.20091213.
>    The manager is patched at 091213 08:38:15.
>
> Wen
>
> On Sun, Dec 13, 2009 at 1:52 AM, Andrew Hanushevsky
> <[log in to unmask]> wrote:
>> Hi Wen
>>
>> You will find the source replacement at:
>>
>> http://www.slac.stanford.edu/~abh/cmsd/
>>
>> It's XrdCmsCluster.cc and it replaces xrootd/src/XrdCms/XrdCmsCluster.cc
>>
>> I'm stepping out for a couple of hours but will be back to see how things
>> went. Sorry for the issues :-(
>>
>> Andy
>>
>> On Sun, 13 Dec 2009, wen guan wrote:
>>
>>> Hi Andrew,
>>>
>>>      I prefer a source replacement.  Then I can compile it.
>>>
>>> Thanks
>>> Wen
>>>>
>>>> I can do one of two things here:
>>>>
>>>> 1) Supply a source replacement and then you would recompile, or
>>>>
>>>> 2) Give me the uname -a of where the cmsd will run and I'll supply a
>>>> binary
>>>> replacement for you.
>>>>
>>>> Your choice.
>>>>
>>>> Andy
>>>>
>>>> On Sun, 13 Dec 2009, wen guan wrote:
>>>>
>>>>> Hi Andrew
>>>>>
>>>>> The problem is found. Great. Thanks.
>>>>>
>>>>> Where can I find the patched cmsd?
>>>>>
>>>>> Wen
>>>>>
>>>>> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky
>>>>> <[log in to unmask]> wrote:
>>>>>>
>>>>>> Hi Wen,
>>>>>>
>>>>>> I found the problem. Looks like a regression from way back when. There
>>>>>> is
>>>>>> a
>>>>>> missing flag on the redirect. This will require a patched cmsd but you
>>>>>> need
>>>>>> only to replace the redirector's cmsd as this only affects the
>>>>>> redirector.
>>>>>> How would you like to proceed?
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>
>>>>>>> Hi Andrew,
>>>>>>>
>>>>>>>     It doesn't work. atlas-bkp1 manager still dropping nodes again.
>>>>>>> In supervisor, I still haven't seen any dataserver registered. I said
>>>>>>> "I updated the ntp"  because you said "the log timestamp do not
>>>>>>> overlap".
>>>>>>>
>>>>>>> Wen
>>>>>>>
>>>>>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky
>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>
>>>>>>>> Hi Wen,
>>>>>>>>
>>>>>>>> Do you mean that everything is now working? It could be that you
>>>>>>>> removed
>>>>>>>> the
>>>>>>>> xrd.timeout directive. That really could cause problems. As for the
>>>>>>>> delays,
>>>>>>>> that is normal when the redirector thinks something is going wrong.
>>>>>>>> The
>>>>>>>> strategy is to delay clients until it can get back to a stable
>>>>>>>> configuration. This usually prevents jobs from crashing during
>>>>>>>> stressful
>>>>>>>> periods.
>>>>>>>>
>>>>>>>> Andy
>>>>>>>>
>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>
>>>>>>>>> Hi  Andrew,
>>>>>>>>>
>>>>>>>>>   I restarted it to do supervisor test.  Also because xrootd manager
>>>>>>>>> frequently doesn't response. (*) is the cms.log, the file select is
>>>>>>>>> delayed again and again. When do a restart, all things are fine. Now
>>>>>>>>> I
>>>>>>>>> am trying to find a clue about it.
>>>>>>>>>
>>>>>>>>> (*)
>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] do_Select:
>>>>>>>>> wc
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>> 091212 00:00:19 21318 Select seeking
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>> 091212 00:00:19 21318 UnkFile rc=1
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] do_Select:
>>>>>>>>> delay 5
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 2+-1 post=0
>>>>>>>>> 091212 00:00:19 21318 Dispatch redirector.21313:14@atlas-bkp2 for
>>>>>>>>> select dlen=166
>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1 post=0
>>>>>>>>> 091212 00:00:19 21318 XrdSched: running redirector inq=0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> There is no core file. I copied a new copies of the logs to the link
>>>>>>>>> below.
>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/
>>>>>>>>>
>>>>>>>>> Wen
>>>>>>>>>
>>>>>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky
>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Wen,
>>>>>>>>>>
>>>>>>>>>> I see in the server log that it is restarting often. Could you take
>>>>>>>>>> a
>>>>>>>>>> look
>>>>>>>>>> in the c193 to see if you have any core files? Also please make
>>>>>>>>>> sure
>>>>>>>>>> that
>>>>>>>>>> core files are enabled as Linux defaults the size to 0. The first
>>>>>>>>>> step
>>>>>>>>>> here
>>>>>>>>>> is to find out why your servers are restarting.
>>>>>>>>>>
>>>>>>>>>> Andy
>>>>>>>>>>
>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>
>>>>>>>>>>>  the logs can be found here. From the log you can see atlas-bkp1
>>>>>>>>>>> manager are dropping nodes again and again which tries to connect
>>>>>>>>>>> to
>>>>>>>>>>> it.
>>>>>>>>>>>  http://higgs03.cs.wisc.edu/wguan/
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky
>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Wen, Could you start everything up and provide me a pointer to
>>>>>>>>>>>> the
>>>>>>>>>>>> manager log file, supervisor log file, and one data server
>>>>>>>>>>>> logfile
>>>>>>>>>>>> all
>>>>>>>>>>>> of
>>>>>>>>>>>> which cover the same time-frame (from start to some point where
>>>>>>>>>>>> you
>>>>>>>>>>>> think
>>>>>>>>>>>> things are working or not). That way I can see what is happening.
>>>>>>>>>>>> At
>>>>>>>>>>>> the
>>>>>>>>>>>> moment I only see two "bad" things in the config file:
>>>>>>>>>>>>
>>>>>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a manager but you
>>>>>>>>>>>> claim,
>>>>>>>>>>>> via
>>>>>>>>>>>> the all.manager directive, that there are three (bkp2 and bkp3).
>>>>>>>>>>>> While
>>>>>>>>>>>> it
>>>>>>>>>>>> should work, the log file will be dense with error messages.
>>>>>>>>>>>> Please
>>>>>>>>>>>> correct
>>>>>>>>>>>> this to be consistent and make it easier to see real errors.
>>>>>>>>>>>
>>>>>>>>>>> This is not a problem for me. Because this config is used in
>>>>>>>>>>> dataserver. In manager, I updated the if atlas-bkp1.cs.wisc.edu to
>>>>>>>>>>> atlas-bkp2 or something. This is a history problem. at first only
>>>>>>>>>>> atlas-bkp1 is used.  atlas-bkp2 and atlas-bkp3 are added  later.
>>>>>>>>>>>
>>>>>>>>>>>> 2) Please use cms.space not olb.space (for historical reasons the
>>>>>>>>>>>> latter
>>>>>>>>>>>> is
>>>>>>>>>>>> still accepted and over-rides the former, but that will soon
>>>>>>>>>>>> end),
>>>>>>>>>>>> and
>>>>>>>>>>>> please use only one (the config file uses both directives).
>>>>>>>>>>>
>>>>>>>>>>> yes. I should remove this line. in fact cms.space is in the cfg
>>>>>>>>>>> too.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Wen
>>>>>>>>>>>
>>>>>>>>>>>> The xrootd has an internal mechanism to connect servers with
>>>>>>>>>>>> supervisors
>>>>>>>>>>>> to
>>>>>>>>>>>> allow for maximum reliability. You cannot change that algorithm
>>>>>>>>>>>> and
>>>>>>>>>>>> there
>>>>>>>>>>>> is
>>>>>>>>>>>> no need to do so. You should *never* tell anyone to directly
>>>>>>>>>>>> connect
>>>>>>>>>>>> to
>>>>>>>>>>>> a
>>>>>>>>>>>> supervisor. If you do, you will likely get unreachable nodes.
>>>>>>>>>>>>
>>>>>>>>>>>> As for dropping data servers, it would appear to me, given the
>>>>>>>>>>>> flurry
>>>>>>>>>>>> of
>>>>>>>>>>>> such activity, that something either crashed or was restarted.
>>>>>>>>>>>> That's
>>>>>>>>>>>> why
>>>>>>>>>>>> it
>>>>>>>>>>>> would be good to see the complete log of each one of the
>>>>>>>>>>>> entities.
>>>>>>>>>>>>
>>>>>>>>>>>> Andy
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>
>>>>>>>>>>>>>    I read the document. and write a config
>>>>>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg).
>>>>>>>>>>>>>    I used my conf, I can see manager is dispatch message to
>>>>>>>>>>>>> supervisor. But I cannot see any dataserver tries to connect to
>>>>>>>>>>>>> the
>>>>>>>>>>>>> supervisor. At the same time, in the manager's log, I can see
>>>>>>>>>>>>> some
>>>>>>>>>>>>> dataserver are Dropped.
>>>>>>>>>>>>>   How does xrootd decide which dataserver will connect
>>>>>>>>>>>>> supervisor?
>>>>>>>>>>>>> should I specify some dataservers to connect the supervisor?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> (*) supervisor log
>>>>>>>>>>>>> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2 for state
>>>>>>>>>>>>> dlen=42
>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State:
>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_StateFWD: Path
>>>>>>>>>>>>> find
>>>>>>>>>>>>> failed for state /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>>>>
>>>>>>>>>>>>> (*)manager log
>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu TSpace=5587GB
>>>>>>>>>>>>> NumFS=1
>>>>>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0
>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding path: w
>>>>>>>>>>>>> /atlas
>>>>>>>>>>>>> 091211 04:13:24 15661 server.10585:[log in to unmask]:1094
>>>>>>>>>>>>> do_Space: 5696231MB free; 0% util
>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 logged in.
>>>>>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection from
>>>>>>>>>>>>> [log in to unmask]
>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: running ?:[log in to unmask]
>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched protocol cmsd
>>>>>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll: FD 79
>>>>>>>>>>>>> attached
>>>>>>>>>>>>> to poller 2; num=22
>>>>>>>>>>>>> 091211 04:13:24 15661 Add server.21739:[log in to unmask]
>>>>>>>>>>>>> bumps
>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 #63
>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>> 091211 04:13:24 15661 Drop_Node:
>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 dropped.
>>>>>>>>>>>>> 091211 04:13:24 15661 Add Shoved
>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 to cluster; id=63.78;
>>>>>>>>>>>>> num=64;
>>>>>>>>>>>>> min=51
>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0
>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu TSpace=5587GB
>>>>>>>>>>>>> NumFS=1
>>>>>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0
>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding path: w
>>>>>>>>>>>>> /atlas
>>>>>>>>>>>>> 091211 04:13:24 15661 server.21739:[log in to unmask]:1094
>>>>>>>>>>>>> do_Space: 5721854MB free; 0% util
>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 logged in.
>>>>>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve from
>>>>>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer
>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop node in 60
>>>>>>>>>>>>> seconds
>>>>>>>>>>>>> 091211 04:13:24 15661 Remove_Node
>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 node 63.78
>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>> logged
>>>>>>>>>>>>> out.
>>>>>>>>>>>>> 091211 04:13:24 15661 server.21739:[log in to unmask]
>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>> FD
>>>>>>>>>>>>> 79 detached from poller 2; num=21
>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>> server.24718:[log in to unmask]:1094
>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>> 091211 04:13:27 15661 server.24718:[log in to unmask]:1094
>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>> suspend
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service suspended
>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>> c177.chtc.wisc.edu
>>>>>>>>>>>>> FD=16
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 node 0.3
>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>> server.21656:[log in to unmask]
>>>>>>>>>>>>> logged
>>>>>>>>>>>>> out.
>>>>>>>>>>>>> 091211 04:13:27 15661 server.21656:[log in to unmask]
>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>> FD
>>>>>>>>>>>>> 16 detached from poller 2; num=20
>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>> c179.chtc.wisc.edu
>>>>>>>>>>>>> FD=21
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>> server.17065:[log in to unmask]:1094 node 1.4
>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>> server.7978:[log in to unmask]
>>>>>>>>>>>>> logged
>>>>>>>>>>>>> out.
>>>>>>>>>>>>> 091211 04:13:27 15661 server.7978:[log in to unmask] XrdPoll:
>>>>>>>>>>>>> FD
>>>>>>>>>>>>> 21
>>>>>>>>>>>>> detached from poller 1; num=21
>>>>>>>>>>>>> 091211 04:13:27 15661 State: Status changed to suspended
>>>>>>>>>>>>> 091211 04:13:27 15661 Send status to
>>>>>>>>>>>>> redirector.15656:14@atlas-bkp2
>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>> server.12937:[log in to unmask]:1094
>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>> 091211 04:13:27 15661 server.12937:[log in to unmask]:1094
>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>> suspend
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service suspended
>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>> c182.chtc.wisc.edu
>>>>>>>>>>>>> FD=19
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 node 7.10
>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>> server.26620:[log in to unmask]
>>>>>>>>>>>>> logged
>>>>>>>>>>>>> out.
>>>>>>>>>>>>> 091211 04:13:27 15661 server.26620:[log in to unmask]
>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>> FD
>>>>>>>>>>>>> 19 detached from poller 2; num=19
>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>> server.10842:[log in to unmask]:1094
>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>> 091211 04:13:27 15661 server.10842:[log in to unmask]:1094
>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>> suspend
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service suspended
>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>> c178.chtc.wisc.edu
>>>>>>>>>>>>> FD=15
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 node 9.12
>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>> server.11901:[log in to unmask]
>>>>>>>>>>>>> logged
>>>>>>>>>>>>> out.
>>>>>>>>>>>>> 091211 04:13:27 15661 server.11901:[log in to unmask]
>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>> FD
>>>>>>>>>>>>> 15 detached from poller 1; num=20
>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>> server.5535:[log in to unmask]:1094
>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>> 091211 04:13:27 15661 server.5535:[log in to unmask]:1094
>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>> suspend
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service suspended
>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>> c181.chtc.wisc.edu
>>>>>>>>>>>>> FD=17
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 node 5.8
>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>> server.13984:[log in to unmask]
>>>>>>>>>>>>> logged
>>>>>>>>>>>>> out.
>>>>>>>>>>>>> 091211 04:13:27 15661 server.13984:[log in to unmask]
>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>> FD
>>>>>>>>>>>>> 17 detached from poller 0; num=21
>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>> server.23711:[log in to unmask]:1094
>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>> 091211 04:13:27 15661 server.23711:[log in to unmask]:1094
>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>> suspend
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service suspended
>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>> c183.chtc.wisc.edu
>>>>>>>>>>>>> FD=22
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 node 8.11
>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>> server.27735:[log in to unmask]
>>>>>>>>>>>>> logged
>>>>>>>>>>>>> out.
>>>>>>>>>>>>> 091211 04:13:27 15661 server.27735:[log in to unmask]
>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>> FD
>>>>>>>>>>>>> 22 detached from poller 2; num=18
>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>> c184.chtc.wisc.edu
>>>>>>>>>>>>> FD=20
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>> server.4131:[log in to unmask]:1094 node 3.6
>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>> server.26787:[log in to unmask]
>>>>>>>>>>>>> logged
>>>>>>>>>>>>> out.
>>>>>>>>>>>>> 091211 04:13:27 15661 server.26787:[log in to unmask]
>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>> FD
>>>>>>>>>>>>> 20 detached from poller 0; num=20
>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>> 091211 04:13:27 15661 server.10585:[log in to unmask]:1094
>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>> suspend
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service suspended
>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>> c185.chtc.wisc.edu
>>>>>>>>>>>>> FD=23
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 node 6.9
>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>> server.8524:[log in to unmask]
>>>>>>>>>>>>> logged
>>>>>>>>>>>>> out.
>>>>>>>>>>>>> 091211 04:13:27 15661 server.8524:[log in to unmask] XrdPoll:
>>>>>>>>>>>>> FD
>>>>>>>>>>>>> 23
>>>>>>>>>>>>> detached from poller 0; num=19
>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>> server.20264:[log in to unmask]:1094
>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>> 091211 04:13:27 15661 server.20264:[log in to unmask]:1094
>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>> suspend
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service suspended
>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>> c180.chtc.wisc.edu
>>>>>>>>>>>>> FD=18
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 node 4.7
>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>> server.14636:[log in to unmask]
>>>>>>>>>>>>> logged
>>>>>>>>>>>>> out.
>>>>>>>>>>>>> 091211 04:13:27 15661 server.14636:[log in to unmask]
>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>> FD
>>>>>>>>>>>>> 18 detached from poller 1; num=19
>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>> server.1656:[log in to unmask]:1094
>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>> 091211 04:13:27 15661 server.1656:[log in to unmask]:1094
>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>> suspend
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service suspended
>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>> c186.chtc.wisc.edu
>>>>>>>>>>>>> FD=24
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 node 2.5
>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>> server.7849:[log in to unmask]
>>>>>>>>>>>>> logged
>>>>>>>>>>>>> out.
>>>>>>>>>>>>> 091211 04:13:27 15661 server.7849:[log in to unmask] XrdPoll:
>>>>>>>>>>>>> FD
>>>>>>>>>>>>> 24
>>>>>>>>>>>>> detached from poller 1; num=18
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>> seconds
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>> seconds
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>> seconds
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>> seconds
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>> seconds
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>> seconds
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>> seconds
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>> seconds
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>> seconds
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>> seconds
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled.
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled.
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled.
>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled.
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled.
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled.
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled.
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled.
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled.
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled.
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled.
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled.
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>
>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky
>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> To go past 64 data servers you will need to setup one or more
>>>>>>>>>>>>>> supervisors.
>>>>>>>>>>>>>> This does not logically change the current configuration you
>>>>>>>>>>>>>> have.
>>>>>>>>>>>>>> You
>>>>>>>>>>>>>> only
>>>>>>>>>>>>>> need to configure one or more *new* servers (or at least xrootd
>>>>>>>>>>>>>> processes)
>>>>>>>>>>>>>> whose role is supervisor. We'd like them to run in separate
>>>>>>>>>>>>>> machines
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>> reliability purposes, but they could run on the manager node as
>>>>>>>>>>>>>> long
>>>>>>>>>>>>>> as
>>>>>>>>>>>>>> you
>>>>>>>>>>>>>> give each one a unique instance name (i.e., -n option).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The front part of the cmsd reference explains how to do this.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   Is there any change to configure xrootd with more than 65
>>>>>>>>>>>>>>> machines? I used the configure below but it doesn't work.
>>>>>>>>>>>>>>>  Should
>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>> configure some machines' manager to be supvervisor?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>
>