Print

Print


Hi Wen

You will find the source replacement at:

http://www.slac.stanford.edu/~abh/cmsd/

It's XrdCmsCluster.cc and it replaces xrootd/src/XrdCms/XrdCmsCluster.cc

I'm stepping out for a couple of hours but will be back to see how things 
went. Sorry for the issues :-(

Andy

On Sun, 13 Dec 2009, wen guan wrote:

> Hi Andrew,
>
>       I prefer a source replacement.  Then I can compile it.
>
> Thanks
> Wen
>> I can do one of two things here:
>>
>> 1) Supply a source replacement and then you would recompile, or
>>
>> 2) Give me the uname -a of where the cmsd will run and I'll supply a binary
>> replacement for you.
>>
>> Your choice.
>>
>> Andy
>>
>> On Sun, 13 Dec 2009, wen guan wrote:
>>
>>> Hi Andrew
>>>
>>> The problem is found. Great. Thanks.
>>>
>>> Where can I find the patched cmsd?
>>>
>>> Wen
>>>
>>> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky
>>> <[log in to unmask]> wrote:
>>>>
>>>> Hi Wen,
>>>>
>>>> I found the problem. Looks like a regression from way back when. There is
>>>> a
>>>> missing flag on the redirect. This will require a patched cmsd but you
>>>> need
>>>> only to replace the redirector's cmsd as this only affects the
>>>> redirector.
>>>> How would you like to proceed?
>>>>
>>>> Andy
>>>>
>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>
>>>>> Hi Andrew,
>>>>>
>>>>>     It doesn't work. atlas-bkp1 manager still dropping nodes again.
>>>>> In supervisor, I still haven't seen any dataserver registered. I said
>>>>> "I updated the ntp"  because you said "the log timestamp do not
>>>>> overlap".
>>>>>
>>>>> Wen
>>>>>
>>>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky
>>>>> <[log in to unmask]> wrote:
>>>>>>
>>>>>> Hi Wen,
>>>>>>
>>>>>> Do you mean that everything is now working? It could be that you
>>>>>> removed
>>>>>> the
>>>>>> xrd.timeout directive. That really could cause problems. As for the
>>>>>> delays,
>>>>>> that is normal when the redirector thinks something is going wrong. The
>>>>>> strategy is to delay clients until it can get back to a stable
>>>>>> configuration. This usually prevents jobs from crashing during
>>>>>> stressful
>>>>>> periods.
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>
>>>>>>> Hi  Andrew,
>>>>>>>
>>>>>>>   I restarted it to do supervisor test.  Also because xrootd manager
>>>>>>> frequently doesn't response. (*) is the cms.log, the file select is
>>>>>>> delayed again and again. When do a restart, all things are fine. Now I
>>>>>>> am trying to find a clue about it.
>>>>>>>
>>>>>>> (*)
>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] do_Select: wc
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>> 091212 00:00:19 21318 Select seeking
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>> 091212 00:00:19 21318 UnkFile rc=1
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] do_Select:
>>>>>>> delay 5
>>>>>>>
>>>>>>>
>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 2+-1 post=0
>>>>>>> 091212 00:00:19 21318 Dispatch redirector.21313:14@atlas-bkp2 for
>>>>>>> select dlen=166
>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1 post=0
>>>>>>> 091212 00:00:19 21318 XrdSched: running redirector inq=0
>>>>>>>
>>>>>>>
>>>>>>> There is no core file. I copied a new copies of the logs to the link
>>>>>>> below.
>>>>>>> http://higgs03.cs.wisc.edu/wguan/
>>>>>>>
>>>>>>> Wen
>>>>>>>
>>>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky
>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>
>>>>>>>> Hi Wen,
>>>>>>>>
>>>>>>>> I see in the server log that it is restarting often. Could you take a
>>>>>>>> look
>>>>>>>> in the c193 to see if you have any core files? Also please make sure
>>>>>>>> that
>>>>>>>> core files are enabled as Linux defaults the size to 0. The first
>>>>>>>> step
>>>>>>>> here
>>>>>>>> is to find out why your servers are restarting.
>>>>>>>>
>>>>>>>> Andy
>>>>>>>>
>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>
>>>>>>>>> Hi Andrew,
>>>>>>>>>
>>>>>>>>>  the logs can be found here. From the log you can see atlas-bkp1
>>>>>>>>> manager are dropping nodes again and again which tries to connect to
>>>>>>>>> it.
>>>>>>>>>  http://higgs03.cs.wisc.edu/wguan/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky
>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Wen, Could you start everything up and provide me a pointer to
>>>>>>>>>> the
>>>>>>>>>> manager log file, supervisor log file, and one data server logfile
>>>>>>>>>> all
>>>>>>>>>> of
>>>>>>>>>> which cover the same time-frame (from start to some point where you
>>>>>>>>>> think
>>>>>>>>>> things are working or not). That way I can see what is happening.
>>>>>>>>>> At
>>>>>>>>>> the
>>>>>>>>>> moment I only see two "bad" things in the config file:
>>>>>>>>>>
>>>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a manager but you
>>>>>>>>>> claim,
>>>>>>>>>> via
>>>>>>>>>> the all.manager directive, that there are three (bkp2 and bkp3).
>>>>>>>>>> While
>>>>>>>>>> it
>>>>>>>>>> should work, the log file will be dense with error messages. Please
>>>>>>>>>> correct
>>>>>>>>>> this to be consistent and make it easier to see real errors.
>>>>>>>>>
>>>>>>>>> This is not a problem for me. Because this config is used in
>>>>>>>>> dataserver. In manager, I updated the if atlas-bkp1.cs.wisc.edu to
>>>>>>>>> atlas-bkp2 or something. This is a history problem. at first only
>>>>>>>>> atlas-bkp1 is used.  atlas-bkp2 and atlas-bkp3 are added  later.
>>>>>>>>>
>>>>>>>>>> 2) Please use cms.space not olb.space (for historical reasons the
>>>>>>>>>> latter
>>>>>>>>>> is
>>>>>>>>>> still accepted and over-rides the former, but that will soon end),
>>>>>>>>>> and
>>>>>>>>>> please use only one (the config file uses both directives).
>>>>>>>>>
>>>>>>>>> yes. I should remove this line. in fact cms.space is in the cfg too.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Wen
>>>>>>>>>
>>>>>>>>>> The xrootd has an internal mechanism to connect servers with
>>>>>>>>>> supervisors
>>>>>>>>>> to
>>>>>>>>>> allow for maximum reliability. You cannot change that algorithm and
>>>>>>>>>> there
>>>>>>>>>> is
>>>>>>>>>> no need to do so. You should *never* tell anyone to directly
>>>>>>>>>> connect
>>>>>>>>>> to
>>>>>>>>>> a
>>>>>>>>>> supervisor. If you do, you will likely get unreachable nodes.
>>>>>>>>>>
>>>>>>>>>> As for dropping data servers, it would appear to me, given the
>>>>>>>>>> flurry
>>>>>>>>>> of
>>>>>>>>>> such activity, that something either crashed or was restarted.
>>>>>>>>>> That's
>>>>>>>>>> why
>>>>>>>>>> it
>>>>>>>>>> would be good to see the complete log of each one of the entities.
>>>>>>>>>>
>>>>>>>>>> Andy
>>>>>>>>>>
>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>
>>>>>>>>>>>    I read the document. and write a config
>>>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg).
>>>>>>>>>>>    I used my conf, I can see manager is dispatch message to
>>>>>>>>>>> supervisor. But I cannot see any dataserver tries to connect to
>>>>>>>>>>> the
>>>>>>>>>>> supervisor. At the same time, in the manager's log, I can see some
>>>>>>>>>>> dataserver are Dropped.
>>>>>>>>>>>   How does xrootd decide which dataserver will connect supervisor?
>>>>>>>>>>> should I specify some dataservers to connect the supervisor?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> (*) supervisor log
>>>>>>>>>>> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2 for state
>>>>>>>>>>> dlen=42
>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State:
>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_StateFWD: Path
>>>>>>>>>>> find
>>>>>>>>>>> failed for state /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>>
>>>>>>>>>>> (*)manager log
>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu TSpace=5587GB
>>>>>>>>>>> NumFS=1
>>>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0
>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding path: w
>>>>>>>>>>> /atlas
>>>>>>>>>>> 091211 04:13:24 15661 server.10585:[log in to unmask]:1094
>>>>>>>>>>> do_Space: 5696231MB free; 0% util
>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>> server.10585:[log in to unmask]:1094 logged in.
>>>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection from
>>>>>>>>>>> [log in to unmask]
>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: running ?:[log in to unmask]
>>>>>>>>>>> inq=0
>>>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched protocol cmsd
>>>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll: FD 79
>>>>>>>>>>> attached
>>>>>>>>>>> to poller 2; num=22
>>>>>>>>>>> 091211 04:13:24 15661 Add server.21739:[log in to unmask] bumps
>>>>>>>>>>> server.15905:[log in to unmask]:1094 #63
>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>> 091211 04:13:24 15661 Drop_Node:
>>>>>>>>>>> server.15905:[log in to unmask]:1094 dropped.
>>>>>>>>>>> 091211 04:13:24 15661 Add Shoved
>>>>>>>>>>> server.21739:[log in to unmask]:1094 to cluster; id=63.78;
>>>>>>>>>>> num=64;
>>>>>>>>>>> min=51
>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0
>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu TSpace=5587GB
>>>>>>>>>>> NumFS=1
>>>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0
>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding path: w
>>>>>>>>>>> /atlas
>>>>>>>>>>> 091211 04:13:24 15661 server.21739:[log in to unmask]:1094
>>>>>>>>>>> do_Space: 5721854MB free; 0% util
>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>> server.21739:[log in to unmask]:1094 logged in.
>>>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve from
>>>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer
>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop node in 60 seconds
>>>>>>>>>>> 091211 04:13:24 15661 Remove_Node
>>>>>>>>>>> server.21739:[log in to unmask]:1094 node 63.78
>>>>>>>>>>> 091211 04:13:24 15661 Protocol: server.21739:[log in to unmask]
>>>>>>>>>>> logged
>>>>>>>>>>> out.
>>>>>>>>>>> 091211 04:13:24 15661 server.21739:[log in to unmask] XrdPoll:
>>>>>>>>>>> FD
>>>>>>>>>>> 79 detached from poller 2; num=21
>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>> server.24718:[log in to unmask]:1094
>>>>>>>>>>> for status dlen=0
>>>>>>>>>>> 091211 04:13:27 15661 server.24718:[log in to unmask]:1094
>>>>>>>>>>> do_Status:
>>>>>>>>>>> suspend
>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service suspended
>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>> c177.chtc.wisc.edu
>>>>>>>>>>> FD=16
>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>> server.24718:[log in to unmask]:1094 node 0.3
>>>>>>>>>>> 091211 04:13:27 15661 Protocol: server.21656:[log in to unmask]
>>>>>>>>>>> logged
>>>>>>>>>>> out.
>>>>>>>>>>> 091211 04:13:27 15661 server.21656:[log in to unmask] XrdPoll:
>>>>>>>>>>> FD
>>>>>>>>>>> 16 detached from poller 2; num=20
>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>> c179.chtc.wisc.edu
>>>>>>>>>>> FD=21
>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>> server.17065:[log in to unmask]:1094 node 1.4
>>>>>>>>>>> 091211 04:13:27 15661 Protocol: server.7978:[log in to unmask]
>>>>>>>>>>> logged
>>>>>>>>>>> out.
>>>>>>>>>>> 091211 04:13:27 15661 server.7978:[log in to unmask] XrdPoll:
>>>>>>>>>>> FD
>>>>>>>>>>> 21
>>>>>>>>>>> detached from poller 1; num=21
>>>>>>>>>>> 091211 04:13:27 15661 State: Status changed to suspended
>>>>>>>>>>> 091211 04:13:27 15661 Send status to
>>>>>>>>>>> redirector.15656:14@atlas-bkp2
>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>> server.12937:[log in to unmask]:1094
>>>>>>>>>>> for status dlen=0
>>>>>>>>>>> 091211 04:13:27 15661 server.12937:[log in to unmask]:1094
>>>>>>>>>>> do_Status:
>>>>>>>>>>> suspend
>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service suspended
>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>> c182.chtc.wisc.edu
>>>>>>>>>>> FD=19
>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>> server.12937:[log in to unmask]:1094 node 7.10
>>>>>>>>>>> 091211 04:13:27 15661 Protocol: server.26620:[log in to unmask]
>>>>>>>>>>> logged
>>>>>>>>>>> out.
>>>>>>>>>>> 091211 04:13:27 15661 server.26620:[log in to unmask] XrdPoll:
>>>>>>>>>>> FD
>>>>>>>>>>> 19 detached from poller 2; num=19
>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>> server.10842:[log in to unmask]:1094
>>>>>>>>>>> for status dlen=0
>>>>>>>>>>> 091211 04:13:27 15661 server.10842:[log in to unmask]:1094
>>>>>>>>>>> do_Status:
>>>>>>>>>>> suspend
>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service suspended
>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>> c178.chtc.wisc.edu
>>>>>>>>>>> FD=15
>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>> server.10842:[log in to unmask]:1094 node 9.12
>>>>>>>>>>> 091211 04:13:27 15661 Protocol: server.11901:[log in to unmask]
>>>>>>>>>>> logged
>>>>>>>>>>> out.
>>>>>>>>>>> 091211 04:13:27 15661 server.11901:[log in to unmask] XrdPoll:
>>>>>>>>>>> FD
>>>>>>>>>>> 15 detached from poller 1; num=20
>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>> server.5535:[log in to unmask]:1094
>>>>>>>>>>> for status dlen=0
>>>>>>>>>>> 091211 04:13:27 15661 server.5535:[log in to unmask]:1094
>>>>>>>>>>> do_Status:
>>>>>>>>>>> suspend
>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service suspended
>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>> c181.chtc.wisc.edu
>>>>>>>>>>> FD=17
>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>> server.5535:[log in to unmask]:1094 node 5.8
>>>>>>>>>>> 091211 04:13:27 15661 Protocol: server.13984:[log in to unmask]
>>>>>>>>>>> logged
>>>>>>>>>>> out.
>>>>>>>>>>> 091211 04:13:27 15661 server.13984:[log in to unmask] XrdPoll:
>>>>>>>>>>> FD
>>>>>>>>>>> 17 detached from poller 0; num=21
>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>> server.23711:[log in to unmask]:1094
>>>>>>>>>>> for status dlen=0
>>>>>>>>>>> 091211 04:13:27 15661 server.23711:[log in to unmask]:1094
>>>>>>>>>>> do_Status:
>>>>>>>>>>> suspend
>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service suspended
>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>> c183.chtc.wisc.edu
>>>>>>>>>>> FD=22
>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>> server.23711:[log in to unmask]:1094 node 8.11
>>>>>>>>>>> 091211 04:13:27 15661 Protocol: server.27735:[log in to unmask]
>>>>>>>>>>> logged
>>>>>>>>>>> out.
>>>>>>>>>>> 091211 04:13:27 15661 server.27735:[log in to unmask] XrdPoll:
>>>>>>>>>>> FD
>>>>>>>>>>> 22 detached from poller 2; num=18
>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>> c184.chtc.wisc.edu
>>>>>>>>>>> FD=20
>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>> server.4131:[log in to unmask]:1094 node 3.6
>>>>>>>>>>> 091211 04:13:27 15661 Protocol: server.26787:[log in to unmask]
>>>>>>>>>>> logged
>>>>>>>>>>> out.
>>>>>>>>>>> 091211 04:13:27 15661 server.26787:[log in to unmask] XrdPoll:
>>>>>>>>>>> FD
>>>>>>>>>>> 20 detached from poller 0; num=20
>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>> for status dlen=0
>>>>>>>>>>> 091211 04:13:27 15661 server.10585:[log in to unmask]:1094
>>>>>>>>>>> do_Status:
>>>>>>>>>>> suspend
>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service suspended
>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>> c185.chtc.wisc.edu
>>>>>>>>>>> FD=23
>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>> server.10585:[log in to unmask]:1094 node 6.9
>>>>>>>>>>> 091211 04:13:27 15661 Protocol: server.8524:[log in to unmask]
>>>>>>>>>>> logged
>>>>>>>>>>> out.
>>>>>>>>>>> 091211 04:13:27 15661 server.8524:[log in to unmask] XrdPoll:
>>>>>>>>>>> FD
>>>>>>>>>>> 23
>>>>>>>>>>> detached from poller 0; num=19
>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>> server.20264:[log in to unmask]:1094
>>>>>>>>>>> for status dlen=0
>>>>>>>>>>> 091211 04:13:27 15661 server.20264:[log in to unmask]:1094
>>>>>>>>>>> do_Status:
>>>>>>>>>>> suspend
>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service suspended
>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>> c180.chtc.wisc.edu
>>>>>>>>>>> FD=18
>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>> server.20264:[log in to unmask]:1094 node 4.7
>>>>>>>>>>> 091211 04:13:27 15661 Protocol: server.14636:[log in to unmask]
>>>>>>>>>>> logged
>>>>>>>>>>> out.
>>>>>>>>>>> 091211 04:13:27 15661 server.14636:[log in to unmask] XrdPoll:
>>>>>>>>>>> FD
>>>>>>>>>>> 18 detached from poller 1; num=19
>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>> server.1656:[log in to unmask]:1094
>>>>>>>>>>> for status dlen=0
>>>>>>>>>>> 091211 04:13:27 15661 server.1656:[log in to unmask]:1094
>>>>>>>>>>> do_Status:
>>>>>>>>>>> suspend
>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service suspended
>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>> c186.chtc.wisc.edu
>>>>>>>>>>> FD=24
>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>> server.1656:[log in to unmask]:1094 node 2.5
>>>>>>>>>>> 091211 04:13:27 15661 Protocol: server.7849:[log in to unmask]
>>>>>>>>>>> logged
>>>>>>>>>>> out.
>>>>>>>>>>> 091211 04:13:27 15661 server.7849:[log in to unmask] XrdPoll:
>>>>>>>>>>> FD
>>>>>>>>>>> 24
>>>>>>>>>>> detached from poller 1; num=18
>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled.
>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled.
>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled.
>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled.
>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled.
>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled.
>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled.
>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled.
>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled.
>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled.
>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled.
>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers
>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled.
>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>
>>>>>>>>>>> Wen
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky
>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>
>>>>>>>>>>>> To go past 64 data servers you will need to setup one or more
>>>>>>>>>>>> supervisors.
>>>>>>>>>>>> This does not logically change the current configuration you
>>>>>>>>>>>> have.
>>>>>>>>>>>> You
>>>>>>>>>>>> only
>>>>>>>>>>>> need to configure one or more *new* servers (or at least xrootd
>>>>>>>>>>>> processes)
>>>>>>>>>>>> whose role is supervisor. We'd like them to run in separate
>>>>>>>>>>>> machines
>>>>>>>>>>>> for
>>>>>>>>>>>> reliability purposes, but they could run on the manager node as
>>>>>>>>>>>> long
>>>>>>>>>>>> as
>>>>>>>>>>>> you
>>>>>>>>>>>> give each one a unique instance name (i.e., -n option).
>>>>>>>>>>>>
>>>>>>>>>>>> The front part of the cmsd reference explains how to do this.
>>>>>>>>>>>>
>>>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm
>>>>>>>>>>>>
>>>>>>>>>>>> Andy
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>
>>>>>>>>>>>>>   Is there any change to configure xrootd with more than 65
>>>>>>>>>>>>> machines? I used the configure below but it doesn't work.
>>>>>>>>>>>>>  Should
>>>>>>>>>>>>> I
>>>>>>>>>>>>> configure some machines' manager to be supvervisor?
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>
>