Print

Print


Hi Wen,

I did some more research with an actual virtual large cluster and did 
uncover a couple of problems that would explain everything you are seeing. 
I have developed an interim fix that works. I say it is interim because it 
introduces a delay of 3 minutes, which I find unacceptable, when a server 
is redirected to a supervisor. However, it is a fix that will get you by 
for a while until I can get a permanent fix to you.

In any case, it reuires:

1) Replacement of XrdCmsCluster.cc and XrdCmsNode.cc (which you'll find in 
http://www.slac.stanford.edu/~abh/cmsd).
2) Recompliation, and
3) Deployment of teh new cmsd on *all* of your servers.

As you can see, since it's interim, you will have to do it again with the 
permanent fix. Let me know how you are going to proceed.

Andy

On Sun, 13 Dec 2009, wen guan wrote:

> Hi Andrew,
>
>
>    Thanks.
>    I used the new cmsd at atlas-bkp1 manager. But it's still dropping
> nodes. And in supervisor's log, I cannot find any dataserver to
> register to it.
>
>    The new logs are in http://higgs03.cs.wisc.edu/wguan/*.20091213.
>    The manager is patched at 091213 08:38:15.
>
> Wen
>
> On Sun, Dec 13, 2009 at 1:52 AM, Andrew Hanushevsky
> <[log in to unmask]> wrote:
>> Hi Wen
>>
>> You will find the source replacement at:
>>
>> http://www.slac.stanford.edu/~abh/cmsd/
>>
>> It's XrdCmsCluster.cc and it replaces xrootd/src/XrdCms/XrdCmsCluster.cc
>>
>> I'm stepping out for a couple of hours but will be back to see how things
>> went. Sorry for the issues :-(
>>
>> Andy
>>
>> On Sun, 13 Dec 2009, wen guan wrote:
>>
>>> Hi Andrew,
>>>
>>>      I prefer a source replacement.  Then I can compile it.
>>>
>>> Thanks
>>> Wen
>>>>
>>>> I can do one of two things here:
>>>>
>>>> 1) Supply a source replacement and then you would recompile, or
>>>>
>>>> 2) Give me the uname -a of where the cmsd will run and I'll supply a
>>>> binary
>>>> replacement for you.
>>>>
>>>> Your choice.
>>>>
>>>> Andy
>>>>
>>>> On Sun, 13 Dec 2009, wen guan wrote:
>>>>
>>>>> Hi Andrew
>>>>>
>>>>> The problem is found. Great. Thanks.
>>>>>
>>>>> Where can I find the patched cmsd?
>>>>>
>>>>> Wen
>>>>>
>>>>> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky
>>>>> <[log in to unmask]> wrote:
>>>>>>
>>>>>> Hi Wen,
>>>>>>
>>>>>> I found the problem. Looks like a regression from way back when. There
>>>>>> is
>>>>>> a
>>>>>> missing flag on the redirect. This will require a patched cmsd but you
>>>>>> need
>>>>>> only to replace the redirector's cmsd as this only affects the
>>>>>> redirector.
>>>>>> How would you like to proceed?
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>
>>>>>>> Hi Andrew,
>>>>>>>
>>>>>>>     It doesn't work. atlas-bkp1 manager still dropping nodes again.
>>>>>>> In supervisor, I still haven't seen any dataserver registered. I said
>>>>>>> "I updated the ntp"  because you said "the log timestamp do not
>>>>>>> overlap".
>>>>>>>
>>>>>>> Wen
>>>>>>>
>>>>>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky
>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>
>>>>>>>> Hi Wen,
>>>>>>>>
>>>>>>>> Do you mean that everything is now working? It could be that you
>>>>>>>> removed
>>>>>>>> the
>>>>>>>> xrd.timeout directive. That really could cause problems. As for the
>>>>>>>> delays,
>>>>>>>> that is normal when the redirector thinks something is going wrong.
>>>>>>>> The
>>>>>>>> strategy is to delay clients until it can get back to a stable
>>>>>>>> configuration. This usually prevents jobs from crashing during
>>>>>>>> stressful
>>>>>>>> periods.
>>>>>>>>
>>>>>>>> Andy
>>>>>>>>
>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>
>>>>>>>>> Hi  Andrew,
>>>>>>>>>
>>>>>>>>>   I restarted it to do supervisor test.  Also because xrootd manager
>>>>>>>>> frequently doesn't response. (*) is the cms.log, the file select is
>>>>>>>>> delayed again and again. When do a restart, all things are fine. Now
>>>>>>>>> I
>>>>>>>>> am trying to find a clue about it.
>>>>>>>>>
>>>>>>>>> (*)
>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] do_Select:
>>>>>>>>> wc
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>> 091212 00:00:19 21318 Select seeking
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>> 091212 00:00:19 21318 UnkFile rc=1
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] do_Select:
>>>>>>>>> delay 5
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 2+-1 post=0
>>>>>>>>> 091212 00:00:19 21318 Dispatch redirector.21313:14@atlas-bkp2 for
>>>>>>>>> select dlen=166
>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1 post=0
>>>>>>>>> 091212 00:00:19 21318 XrdSched: running redirector inq=0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> There is no core file. I copied a new copies of the logs to the link
>>>>>>>>> below.
>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/
>>>>>>>>>
>>>>>>>>> Wen
>>>>>>>>>
>>>>>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky
>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Wen,
>>>>>>>>>>
>>>>>>>>>> I see in the server log that it is restarting often. Could you take
>>>>>>>>>> a
>>>>>>>>>> look
>>>>>>>>>> in the c193 to see if you have any core files? Also please make
>>>>>>>>>> sure
>>>>>>>>>> that
>>>>>>>>>> core files are enabled as Linux defaults the size to 0. The first
>>>>>>>>>> step
>>>>>>>>>> here
>>>>>>>>>> is to find out why your servers are restarting.
>>>>>>>>>>
>>>>>>>>>> Andy
>>>>>>>>>>
>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>
>>>>>>>>>>>  the logs can be found here. From the log you can see atlas-bkp1
>>>>>>>>>>> manager are dropping nodes again and again which tries to connect
>>>>>>>>>>> to
>>>>>>>>>>> it.
>>>>>>>>>>>  http://higgs03.cs.wisc.edu/wguan/
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky
>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Wen, Could you start everything up and provide me a pointer to
>>>>>>>>>>>> the
>>>>>>>>>>>> manager log file, supervisor log file, and one data server
>>>>>>>>>>>> logfile
>>>>>>>>>>>> all
>>>>>>>>>>>> of
>>>>>>>>>>>> which cover the same time-frame (from start to some point where
>>>>>>>>>>>> you
>>>>>>>>>>>> think
>>>>>>>>>>>> things are working or not). That way I can see what is happening.
>>>>>>>>>>>> At
>>>>>>>>>>>> the
>>>>>>>>>>>> moment I only see two "bad" things in the config file:
>>>>>>>>>>>>
>>>>>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a manager but you
>>>>>>>>>>>> claim,
>>>>>>>>>>>> via
>>>>>>>>>>>> the all.manager directive, that there are three (bkp2 and bkp3).
>>>>>>>>>>>> While
>>>>>>>>>>>> it
>>>>>>>>>>>> should work, the log file will be dense with error messages.
>>>>>>>>>>>> Please
>>>>>>>>>>>> correct
>>>>>>>>>>>> this to be consistent and make it easier to see real errors.
>>>>>>>>>>>
>>>>>>>>>>> This is not a problem for me. Because this config is used in
>>>>>>>>>>> dataserver. In manager, I updated the if atlas-bkp1.cs.wisc.edu to
>>>>>>>>>>> atlas-bkp2 or something. This is a history problem. at first only
>>>>>>>>>>> atlas-bkp1 is used.  atlas-bkp2 and atlas-bkp3 are added  later.
>>>>>>>>>>>
>>>>>>>>>>>> 2) Please use cms.space not olb.space (for historical reasons the
>>>>>>>>>>>> latter
>>>>>>>>>>>> is
>>>>>>>>>>>> still accepted and over-rides the former, but that will soon
>>>>>>>>>>>> end),
>>>>>>>>>>>> and
>>>>>>>>>>>> please use only one (the config file uses both directives).
>>>>>>>>>>>
>>>>>>>>>>> yes. I should remove this line. in fact cms.space is in the cfg
>>>>>>>>>>> too.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Wen
>>>>>>>>>>>
>>>>>>>>>>>> The xrootd has an internal mechanism to connect servers with
>>>>>>>>>>>> supervisors
>>>>>>>>>>>> to
>>>>>>>>>>>> allow for maximum reliability. You cannot change that algorithm
>>>>>>>>>>>> and
>>>>>>>>>>>> there
>>>>>>>>>>>> is
>>>>>>>>>>>> no need to do so. You should *never* tell anyone to directly
>>>>>>>>>>>> connect
>>>>>>>>>>>> to
>>>>>>>>>>>> a
>>>>>>>>>>>> supervisor. If you do, you will likely get unreachable nodes.
>>>>>>>>>>>>
>>>>>>>>>>>> As for dropping data servers, it would appear to me, given the
>>>>>>>>>>>> flurry
>>>>>>>>>>>> of
>>>>>>>>>>>> such activity, that something either crashed or was restarted.
>>>>>>>>>>>> That's
>>>>>>>>>>>> why
>>>>>>>>>>>> it
>>>>>>>>>>>> would be good to see the complete log of each one of the
>>>>>>>>>>>> entities.
>>>>>>>>>>>>
>>>>>>>>>>>> Andy
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>
>>>>>>>>>>>>>    I read the document. and write a config
>>>>>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg).
>>>>>>>>>>>>>    I used my conf, I can see manager is dispatch message to
>>>>>>>>>>>>> supervisor. But I cannot see any dataserver tries to connect to
>>>>>>>>>>>>> the
>>>>>>>>>>>>> supervisor. At the same time, in the manager's log, I can see
>>>>>>>>>>>>> some
>>>>>>>>>>>>> dataserver are Dropped.
>>>>>>>>>>>>>   How does xrootd decide which dataserver will connect
>>>>>>>>>>>>> supervisor?
>>>>>>>>>>>>> should I specify some dataservers to connect the supervisor?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> (*) supervisor log
>>>>>>>>>>>>> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2 for state
>>>>>>>>>>>>> dlen=42
>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State:
>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_StateFWD: Path
>>>>>>>>>>>>> find
>>>>>>>>>>>>> failed for state /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>>>>
>>>>>>>>>>>>> (*)manager log
>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu TSpace=5587GB
>>>>>>>>>>>>> NumFS=1
>>>>>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0
>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding path: w
>>>>>>>>>>>>> /atlas
>>>>>>>>>>>>> 091211 04:13:24 15661 server.10585:[log in to unmask]:1094
>>>>>>>>>>>>> do_Space: 5696231MB free; 0% util
>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 logged in.
>>>>>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection from
>>>>>>>>>>>>> [log in to unmask]
>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: running ?:[log in to unmask]
>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched protocol cmsd
>>>>>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll: FD 79
>>>>>>>>>>>>> attached
>>>>>>>>>>>>> to poller 2; num=22
>>>>>>>>>>>>> 091211 04:13:24 15661 Add server.21739:[log in to unmask]
>>>>>>>>>>>>> bumps
>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 #63
>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>> 091211 04:13:24 15661 Drop_Node:
>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 dropped.
>>>>>>>>>>>>> 091211 04:13:24 15661 Add Shoved
>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 to cluster; id=63.78;
>>>>>>>>>>>>> num=64;
>>>>>>>>>>>>> min=51
>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0
>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu TSpace=5587GB
>>>>>>>>>>>>> NumFS=1
>>>>>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0
>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding path: w
>>>>>>>>>>>>> /atlas
>>>>>>>>>>>>> 091211 04:13:24 15661 server.21739:[log in to unmask]:1094
>>>>>>>>>>>>> do_Space: 5721854MB free; 0% util
>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 logged in.
>>>>>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve from
>>>>>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer
>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop node in 60
>>>>>>>>>>>>> seconds
>>>>>>>>>>>>> 091211 04:13:24 15661 Remove_Node
>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 node 63.78
>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>> logged
>>>>>>>>>>>>> out.
>>>>>>>>>>>>> 091211 04:13:24 15661 server.21739:[log in to unmask]
>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>> FD
>>>>>>>>>>>>> 79 detached from poller 2; num=21
>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>> server.24718:[log in to unmask]:1094
>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>> 091211 04:13:27 15661 server.24718:[log in to unmask]:1094
>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>> suspend
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service suspended
>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>> c177.chtc.wisc.edu
>>>>>>>>>>>>> FD=16
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 node 0.3
>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>> server.21656:[log in to unmask]
>>>>>>>>>>>>> logged
>>>>>>>>>>>>> out.
>>>>>>>>>>>>> 091211 04:13:27 15661 server.21656:[log in to unmask]
>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>> FD
>>>>>>>>>>>>> 16 detached from poller 2; num=20
>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>> c179.chtc.wisc.edu
>>>>>>>>>>>>> FD=21
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>> server.17065:[log in to unmask]:1094 node 1.4
>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>> server.7978:[log in to unmask]
>>>>>>>>>>>>> logged
>>>>>>>>>>>>> out.
>>>>>>>>>>>>> 091211 04:13:27 15661 server.7978:[log in to unmask] XrdPoll:
>>>>>>>>>>>>> FD
>>>>>>>>>>>>> 21
>>>>>>>>>>>>> detached from poller 1; num=21
>>>>>>>>>>>>> 091211 04:13:27 15661 State: Status changed to suspended
>>>>>>>>>>>>> 091211 04:13:27 15661 Send status to
>>>>>>>>>>>>> redirector.15656:14@atlas-bkp2
>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>> server.12937:[log in to unmask]:1094
>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>> 091211 04:13:27 15661 server.12937:[log in to unmask]:1094
>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>> suspend
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service suspended
>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>> c182.chtc.wisc.edu
>>>>>>>>>>>>> FD=19
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 node 7.10
>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>> server.26620:[log in to unmask]
>>>>>>>>>>>>> logged
>>>>>>>>>>>>> out.
>>>>>>>>>>>>> 091211 04:13:27 15661 server.26620:[log in to unmask]
>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>> FD
>>>>>>>>>>>>> 19 detached from poller 2; num=19
>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>> server.10842:[log in to unmask]:1094
>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>> 091211 04:13:27 15661 server.10842:[log in to unmask]:1094
>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>> suspend
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service suspended
>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>> c178.chtc.wisc.edu
>>>>>>>>>>>>> FD=15
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 node 9.12
>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>> server.11901:[log in to unmask]
>>>>>>>>>>>>> logged
>>>>>>>>>>>>> out.
>>>>>>>>>>>>> 091211 04:13:27 15661 server.11901:[log in to unmask]
>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>> FD
>>>>>>>>>>>>> 15 detached from poller 1; num=20
>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>> server.5535:[log in to unmask]:1094
>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>> 091211 04:13:27 15661 server.5535:[log in to unmask]:1094
>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>> suspend
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service suspended
>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>> c181.chtc.wisc.edu
>>>>>>>>>>>>> FD=17
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 node 5.8
>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>> server.13984:[log in to unmask]
>>>>>>>>>>>>> logged
>>>>>>>>>>>>> out.
>>>>>>>>>>>>> 091211 04:13:27 15661 server.13984:[log in to unmask]
>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>> FD
>>>>>>>>>>>>> 17 detached from poller 0; num=21
>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>> server.23711:[log in to unmask]:1094
>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>> 091211 04:13:27 15661 server.23711:[log in to unmask]:1094
>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>> suspend
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service suspended
>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>> c183.chtc.wisc.edu
>>>>>>>>>>>>> FD=22
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 node 8.11
>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>> server.27735:[log in to unmask]
>>>>>>>>>>>>> logged
>>>>>>>>>>>>> out.
>>>>>>>>>>>>> 091211 04:13:27 15661 server.27735:[log in to unmask]
>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>> FD
>>>>>>>>>>>>> 22 detached from poller 2; num=18
>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>> c184.chtc.wisc.edu
>>>>>>>>>>>>> FD=20
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>> server.4131:[log in to unmask]:1094 node 3.6
>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>> server.26787:[log in to unmask]
>>>>>>>>>>>>> logged
>>>>>>>>>>>>> out.
>>>>>>>>>>>>> 091211 04:13:27 15661 server.26787:[log in to unmask]
>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>> FD
>>>>>>>>>>>>> 20 detached from poller 0; num=20
>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>> 091211 04:13:27 15661 server.10585:[log in to unmask]:1094
>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>> suspend
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service suspended
>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>> c185.chtc.wisc.edu
>>>>>>>>>>>>> FD=23
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 node 6.9
>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>> server.8524:[log in to unmask]
>>>>>>>>>>>>> logged
>>>>>>>>>>>>> out.
>>>>>>>>>>>>> 091211 04:13:27 15661 server.8524:[log in to unmask] XrdPoll:
>>>>>>>>>>>>> FD
>>>>>>>>>>>>> 23
>>>>>>>>>>>>> detached from poller 0; num=19
>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>> server.20264:[log in to unmask]:1094
>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>> 091211 04:13:27 15661 server.20264:[log in to unmask]:1094
>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>> suspend
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service suspended
>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>> c180.chtc.wisc.edu
>>>>>>>>>>>>> FD=18
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 node 4.7
>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>> server.14636:[log in to unmask]
>>>>>>>>>>>>> logged
>>>>>>>>>>>>> out.
>>>>>>>>>>>>> 091211 04:13:27 15661 server.14636:[log in to unmask]
>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>> FD
>>>>>>>>>>>>> 18 detached from poller 1; num=19
>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>> server.1656:[log in to unmask]:1094
>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>> 091211 04:13:27 15661 server.1656:[log in to unmask]:1094
>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>> suspend
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service suspended
>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>> c186.chtc.wisc.edu
>>>>>>>>>>>>> FD=24
>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 node 2.5
>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>> server.7849:[log in to unmask]
>>>>>>>>>>>>> logged
>>>>>>>>>>>>> out.
>>>>>>>>>>>>> 091211 04:13:27 15661 server.7849:[log in to unmask] XrdPoll:
>>>>>>>>>>>>> FD
>>>>>>>>>>>>> 24
>>>>>>>>>>>>> detached from poller 1; num=18
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>> seconds
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>> seconds
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>> seconds
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>> seconds
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>> seconds
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>> seconds
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>> seconds
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>> seconds
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>> seconds
>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>> seconds
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled.
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled.
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled.
>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled.
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled.
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled.
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled.
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled.
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled.
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled.
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled.
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled.
>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>
>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky
>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> To go past 64 data servers you will need to setup one or more
>>>>>>>>>>>>>> supervisors.
>>>>>>>>>>>>>> This does not logically change the current configuration you
>>>>>>>>>>>>>> have.
>>>>>>>>>>>>>> You
>>>>>>>>>>>>>> only
>>>>>>>>>>>>>> need to configure one or more *new* servers (or at least xrootd
>>>>>>>>>>>>>> processes)
>>>>>>>>>>>>>> whose role is supervisor. We'd like them to run in separate
>>>>>>>>>>>>>> machines
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>> reliability purposes, but they could run on the manager node as
>>>>>>>>>>>>>> long
>>>>>>>>>>>>>> as
>>>>>>>>>>>>>> you
>>>>>>>>>>>>>> give each one a unique instance name (i.e., -n option).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The front part of the cmsd reference explains how to do this.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   Is there any change to configure xrootd with more than 65
>>>>>>>>>>>>>>> machines? I used the configure below but it doesn't work.
>>>>>>>>>>>>>>>  Should
>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>> configure some machines' manager to be supvervisor?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>
>