LISTSERV 16.5 - XROOTD-L Archives

Hi Andrew,

       I prefer a source replacement.  Then I can compile it.

Thanks
Wen
> I can do one of two things here:
>
> 1) Supply a source replacement and then you would recompile, or
>
> 2) Give me the uname -a of where the cmsd will run and I'll supply a binary
> replacement for you.
>
> Your choice.
>
> Andy
>
> On Sun, 13 Dec 2009, wen guan wrote:
>
>> Hi Andrew
>>
>> The problem is found. Great. Thanks.
>>
>> Where can I find the patched cmsd?
>>
>> Wen
>>
>> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky
>> <[log in to unmask]> wrote:
>>>
>>> Hi Wen,
>>>
>>> I found the problem. Looks like a regression from way back when. There is
>>> a
>>> missing flag on the redirect. This will require a patched cmsd but you
>>> need
>>> only to replace the redirector's cmsd as this only affects the
>>> redirector.
>>> How would you like to proceed?
>>>
>>> Andy
>>>
>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>
>>>> Hi Andrew,
>>>>
>>>>     It doesn't work. atlas-bkp1 manager still dropping nodes again.
>>>> In supervisor, I still haven't seen any dataserver registered. I said
>>>> "I updated the ntp"  because you said "the log timestamp do not
>>>> overlap".
>>>>
>>>> Wen
>>>>
>>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky
>>>> <[log in to unmask]> wrote:
>>>>>
>>>>> Hi Wen,
>>>>>
>>>>> Do you mean that everything is now working? It could be that you
>>>>> removed
>>>>> the
>>>>> xrd.timeout directive. That really could cause problems. As for the
>>>>> delays,
>>>>> that is normal when the redirector thinks something is going wrong. The
>>>>> strategy is to delay clients until it can get back to a stable
>>>>> configuration. This usually prevents jobs from crashing during
>>>>> stressful
>>>>> periods.
>>>>>
>>>>> Andy
>>>>>
>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>
>>>>>> Hi  Andrew,
>>>>>>
>>>>>>   I restarted it to do supervisor test.  Also because xrootd manager
>>>>>> frequently doesn't response. (*) is the cms.log, the file select is
>>>>>> delayed again and again. When do a restart, all things are fine. Now I
>>>>>> am trying to find a clue about it.
>>>>>>
>>>>>> (*)
>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] do_Select: wc
>>>>>>
>>>>>>
>>>>>>
>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>> 091212 00:00:19 21318 Select seeking
>>>>>>
>>>>>>
>>>>>>
>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>> 091212 00:00:19 21318 UnkFile rc=1
>>>>>>
>>>>>>
>>>>>>
>>>>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] do_Select:
>>>>>> delay 5
>>>>>>
>>>>>>
>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 2+-1 post=0
>>>>>> 091212 00:00:19 21318 Dispatch redirector.21313:14@atlas-bkp2 for
>>>>>> select dlen=166
>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1 post=0
>>>>>> 091212 00:00:19 21318 XrdSched: running redirector inq=0
>>>>>>
>>>>>>
>>>>>> There is no core file. I copied a new copies of the logs to the link
>>>>>> below.
>>>>>> http://higgs03.cs.wisc.edu/wguan/
>>>>>>
>>>>>> Wen
>>>>>>
>>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky
>>>>>> <[log in to unmask]> wrote:
>>>>>>>
>>>>>>> Hi Wen,
>>>>>>>
>>>>>>> I see in the server log that it is restarting often. Could you take a
>>>>>>> look
>>>>>>> in the c193 to see if you have any core files? Also please make sure
>>>>>>> that
>>>>>>> core files are enabled as Linux defaults the size to 0. The first
>>>>>>> step
>>>>>>> here
>>>>>>> is to find out why your servers are restarting.
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>
>>>>>>>> Hi Andrew,
>>>>>>>>
>>>>>>>>  the logs can be found here. From the log you can see atlas-bkp1
>>>>>>>> manager are dropping nodes again and again which tries to connect to
>>>>>>>> it.
>>>>>>>>  http://higgs03.cs.wisc.edu/wguan/
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky
>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>
>>>>>>>>> Hi Wen, Could you start everything up and provide me a pointer to
>>>>>>>>> the
>>>>>>>>> manager log file, supervisor log file, and one data server logfile
>>>>>>>>> all
>>>>>>>>> of
>>>>>>>>> which cover the same time-frame (from start to some point where you
>>>>>>>>> think
>>>>>>>>> things are working or not). That way I can see what is happening.
>>>>>>>>> At
>>>>>>>>> the
>>>>>>>>> moment I only see two "bad" things in the config file:
>>>>>>>>>
>>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a manager but you
>>>>>>>>> claim,
>>>>>>>>> via
>>>>>>>>> the all.manager directive, that there are three (bkp2 and bkp3).
>>>>>>>>> While
>>>>>>>>> it
>>>>>>>>> should work, the log file will be dense with error messages. Please
>>>>>>>>> correct
>>>>>>>>> this to be consistent and make it easier to see real errors.
>>>>>>>>
>>>>>>>> This is not a problem for me. Because this config is used in
>>>>>>>> dataserver. In manager, I updated the if atlas-bkp1.cs.wisc.edu to
>>>>>>>> atlas-bkp2 or something. This is a history problem. at first only
>>>>>>>> atlas-bkp1 is used.  atlas-bkp2 and atlas-bkp3 are added  later.
>>>>>>>>
>>>>>>>>> 2) Please use cms.space not olb.space (for historical reasons the
>>>>>>>>> latter
>>>>>>>>> is
>>>>>>>>> still accepted and over-rides the former, but that will soon end),
>>>>>>>>> and
>>>>>>>>> please use only one (the config file uses both directives).
>>>>>>>>
>>>>>>>> yes. I should remove this line. in fact cms.space is in the cfg too.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Wen
>>>>>>>>
>>>>>>>>> The xrootd has an internal mechanism to connect servers with
>>>>>>>>> supervisors
>>>>>>>>> to
>>>>>>>>> allow for maximum reliability. You cannot change that algorithm and
>>>>>>>>> there
>>>>>>>>> is
>>>>>>>>> no need to do so. You should *never* tell anyone to directly
>>>>>>>>> connect
>>>>>>>>> to
>>>>>>>>> a
>>>>>>>>> supervisor. If you do, you will likely get unreachable nodes.
>>>>>>>>>
>>>>>>>>> As for dropping data servers, it would appear to me, given the
>>>>>>>>> flurry
>>>>>>>>> of
>>>>>>>>> such activity, that something either crashed or was restarted.
>>>>>>>>> That's
>>>>>>>>> why
>>>>>>>>> it
>>>>>>>>> would be good to see the complete log of each one of the entities.
>>>>>>>>>
>>>>>>>>> Andy
>>>>>>>>>
>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>
>>>>>>>>>> Hi Andrew,
>>>>>>>>>>
>>>>>>>>>>    I read the document. and write a config
>>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg).
>>>>>>>>>>    I used my conf, I can see manager is dispatch message to
>>>>>>>>>> supervisor. But I cannot see any dataserver tries to connect to
>>>>>>>>>> the
>>>>>>>>>> supervisor. At the same time, in the manager's log, I can see some
>>>>>>>>>> dataserver are Dropped.
>>>>>>>>>>   How does xrootd decide which dataserver will connect supervisor?
>>>>>>>>>> should I specify some dataservers to connect the supervisor?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> (*) supervisor log
>>>>>>>>>> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2 for state
>>>>>>>>>> dlen=42
>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State:
>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_StateFWD: Path
>>>>>>>>>> find
>>>>>>>>>> failed for state /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>
>>>>>>>>>> (*)manager log
>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu TSpace=5587GB
>>>>>>>>>> NumFS=1
>>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0
>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding path: w
>>>>>>>>>> /atlas
>>>>>>>>>> 091211 04:13:24 15661 server.10585:[log in to unmask]:1094
>>>>>>>>>> do_Space: 5696231MB free; 0% util
>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>> server.10585:[log in to unmask]:1094 logged in.
>>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection from
>>>>>>>>>> [log in to unmask]
>>>>>>>>>> 091211 04:13:24 15661 XrdSched: running ?:[log in to unmask]
>>>>>>>>>> inq=0
>>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched protocol cmsd
>>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll: FD 79
>>>>>>>>>> attached
>>>>>>>>>> to poller 2; num=22
>>>>>>>>>> 091211 04:13:24 15661 Add server.21739:[log in to unmask] bumps
>>>>>>>>>> server.15905:[log in to unmask]:1094 #63
>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>> 091211 04:13:24 15661 Drop_Node:
>>>>>>>>>> server.15905:[log in to unmask]:1094 dropped.
>>>>>>>>>> 091211 04:13:24 15661 Add Shoved
>>>>>>>>>> server.21739:[log in to unmask]:1094 to cluster; id=63.78;
>>>>>>>>>> num=64;
>>>>>>>>>> min=51
>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0
>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu TSpace=5587GB
>>>>>>>>>> NumFS=1
>>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0
>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding path: w
>>>>>>>>>> /atlas
>>>>>>>>>> 091211 04:13:24 15661 server.21739:[log in to unmask]:1094
>>>>>>>>>> do_Space: 5721854MB free; 0% util
>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>> server.21739:[log in to unmask]:1094 logged in.
>>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve from
>>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer
>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop node in 60 seconds
>>>>>>>>>> 091211 04:13:24 15661 Remove_Node
>>>>>>>>>> server.21739:[log in to unmask]:1094 node 63.78
>>>>>>>>>> 091211 04:13:24 15661 Protocol: server.21739:[log in to unmask]
>>>>>>>>>> logged
>>>>>>>>>> out.
>>>>>>>>>> 091211 04:13:24 15661 server.21739:[log in to unmask] XrdPoll:
>>>>>>>>>> FD
>>>>>>>>>> 79 detached from poller 2; num=21
>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>> server.24718:[log in to unmask]:1094
>>>>>>>>>> for status dlen=0
>>>>>>>>>> 091211 04:13:27 15661 server.24718:[log in to unmask]:1094
>>>>>>>>>> do_Status:
>>>>>>>>>> suspend
>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service suspended
>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>> c177.chtc.wisc.edu
>>>>>>>>>> FD=16
>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>> server.24718:[log in to unmask]:1094 node 0.3
>>>>>>>>>> 091211 04:13:27 15661 Protocol: server.21656:[log in to unmask]
>>>>>>>>>> logged
>>>>>>>>>> out.
>>>>>>>>>> 091211 04:13:27 15661 server.21656:[log in to unmask] XrdPoll:
>>>>>>>>>> FD
>>>>>>>>>> 16 detached from poller 2; num=20
>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>> c179.chtc.wisc.edu
>>>>>>>>>> FD=21
>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>> server.17065:[log in to unmask]:1094 node 1.4
>>>>>>>>>> 091211 04:13:27 15661 Protocol: server.7978:[log in to unmask]
>>>>>>>>>> logged
>>>>>>>>>> out.
>>>>>>>>>> 091211 04:13:27 15661 server.7978:[log in to unmask] XrdPoll:
>>>>>>>>>> FD
>>>>>>>>>> 21
>>>>>>>>>> detached from poller 1; num=21
>>>>>>>>>> 091211 04:13:27 15661 State: Status changed to suspended
>>>>>>>>>> 091211 04:13:27 15661 Send status to
>>>>>>>>>> redirector.15656:14@atlas-bkp2
>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>> server.12937:[log in to unmask]:1094
>>>>>>>>>> for status dlen=0
>>>>>>>>>> 091211 04:13:27 15661 server.12937:[log in to unmask]:1094
>>>>>>>>>> do_Status:
>>>>>>>>>> suspend
>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service suspended
>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>> c182.chtc.wisc.edu
>>>>>>>>>> FD=19
>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>> server.12937:[log in to unmask]:1094 node 7.10
>>>>>>>>>> 091211 04:13:27 15661 Protocol: server.26620:[log in to unmask]
>>>>>>>>>> logged
>>>>>>>>>> out.
>>>>>>>>>> 091211 04:13:27 15661 server.26620:[log in to unmask] XrdPoll:
>>>>>>>>>> FD
>>>>>>>>>> 19 detached from poller 2; num=19
>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>> server.10842:[log in to unmask]:1094
>>>>>>>>>> for status dlen=0
>>>>>>>>>> 091211 04:13:27 15661 server.10842:[log in to unmask]:1094
>>>>>>>>>> do_Status:
>>>>>>>>>> suspend
>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service suspended
>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>> c178.chtc.wisc.edu
>>>>>>>>>> FD=15
>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>> server.10842:[log in to unmask]:1094 node 9.12
>>>>>>>>>> 091211 04:13:27 15661 Protocol: server.11901:[log in to unmask]
>>>>>>>>>> logged
>>>>>>>>>> out.
>>>>>>>>>> 091211 04:13:27 15661 server.11901:[log in to unmask] XrdPoll:
>>>>>>>>>> FD
>>>>>>>>>> 15 detached from poller 1; num=20
>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>> server.5535:[log in to unmask]:1094
>>>>>>>>>> for status dlen=0
>>>>>>>>>> 091211 04:13:27 15661 server.5535:[log in to unmask]:1094
>>>>>>>>>> do_Status:
>>>>>>>>>> suspend
>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service suspended
>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>> c181.chtc.wisc.edu
>>>>>>>>>> FD=17
>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>> server.5535:[log in to unmask]:1094 node 5.8
>>>>>>>>>> 091211 04:13:27 15661 Protocol: server.13984:[log in to unmask]
>>>>>>>>>> logged
>>>>>>>>>> out.
>>>>>>>>>> 091211 04:13:27 15661 server.13984:[log in to unmask] XrdPoll:
>>>>>>>>>> FD
>>>>>>>>>> 17 detached from poller 0; num=21
>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>> server.23711:[log in to unmask]:1094
>>>>>>>>>> for status dlen=0
>>>>>>>>>> 091211 04:13:27 15661 server.23711:[log in to unmask]:1094
>>>>>>>>>> do_Status:
>>>>>>>>>> suspend
>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service suspended
>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>> c183.chtc.wisc.edu
>>>>>>>>>> FD=22
>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>> server.23711:[log in to unmask]:1094 node 8.11
>>>>>>>>>> 091211 04:13:27 15661 Protocol: server.27735:[log in to unmask]
>>>>>>>>>> logged
>>>>>>>>>> out.
>>>>>>>>>> 091211 04:13:27 15661 server.27735:[log in to unmask] XrdPoll:
>>>>>>>>>> FD
>>>>>>>>>> 22 detached from poller 2; num=18
>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>> c184.chtc.wisc.edu
>>>>>>>>>> FD=20
>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>> server.4131:[log in to unmask]:1094 node 3.6
>>>>>>>>>> 091211 04:13:27 15661 Protocol: server.26787:[log in to unmask]
>>>>>>>>>> logged
>>>>>>>>>> out.
>>>>>>>>>> 091211 04:13:27 15661 server.26787:[log in to unmask] XrdPoll:
>>>>>>>>>> FD
>>>>>>>>>> 20 detached from poller 0; num=20
>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>> for status dlen=0
>>>>>>>>>> 091211 04:13:27 15661 server.10585:[log in to unmask]:1094
>>>>>>>>>> do_Status:
>>>>>>>>>> suspend
>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service suspended
>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>> c185.chtc.wisc.edu
>>>>>>>>>> FD=23
>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>> server.10585:[log in to unmask]:1094 node 6.9
>>>>>>>>>> 091211 04:13:27 15661 Protocol: server.8524:[log in to unmask]
>>>>>>>>>> logged
>>>>>>>>>> out.
>>>>>>>>>> 091211 04:13:27 15661 server.8524:[log in to unmask] XrdPoll:
>>>>>>>>>> FD
>>>>>>>>>> 23
>>>>>>>>>> detached from poller 0; num=19
>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>> server.20264:[log in to unmask]:1094
>>>>>>>>>> for status dlen=0
>>>>>>>>>> 091211 04:13:27 15661 server.20264:[log in to unmask]:1094
>>>>>>>>>> do_Status:
>>>>>>>>>> suspend
>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service suspended
>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>> c180.chtc.wisc.edu
>>>>>>>>>> FD=18
>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>> server.20264:[log in to unmask]:1094 node 4.7
>>>>>>>>>> 091211 04:13:27 15661 Protocol: server.14636:[log in to unmask]
>>>>>>>>>> logged
>>>>>>>>>> out.
>>>>>>>>>> 091211 04:13:27 15661 server.14636:[log in to unmask] XrdPoll:
>>>>>>>>>> FD
>>>>>>>>>> 18 detached from poller 1; num=19
>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>> server.1656:[log in to unmask]:1094
>>>>>>>>>> for status dlen=0
>>>>>>>>>> 091211 04:13:27 15661 server.1656:[log in to unmask]:1094
>>>>>>>>>> do_Status:
>>>>>>>>>> suspend
>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service suspended
>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>> c186.chtc.wisc.edu
>>>>>>>>>> FD=24
>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>> server.1656:[log in to unmask]:1094 node 2.5
>>>>>>>>>> 091211 04:13:27 15661 Protocol: server.7849:[log in to unmask]
>>>>>>>>>> logged
>>>>>>>>>> out.
>>>>>>>>>> 091211 04:13:27 15661 server.7849:[log in to unmask] XrdPoll:
>>>>>>>>>> FD
>>>>>>>>>> 24
>>>>>>>>>> detached from poller 1; num=18
>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled.
>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled.
>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled.
>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled.
>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled.
>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled.
>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled.
>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled.
>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled.
>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled.
>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled.
>>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers
>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled.
>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>>>>>>>>>
>>>>>>>>>> Wen
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky
>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>
>>>>>>>>>>> To go past 64 data servers you will need to setup one or more
>>>>>>>>>>> supervisors.
>>>>>>>>>>> This does not logically change the current configuration you
>>>>>>>>>>> have.
>>>>>>>>>>> You
>>>>>>>>>>> only
>>>>>>>>>>> need to configure one or more *new* servers (or at least xrootd
>>>>>>>>>>> processes)
>>>>>>>>>>> whose role is supervisor. We'd like them to run in separate
>>>>>>>>>>> machines
>>>>>>>>>>> for
>>>>>>>>>>> reliability purposes, but they could run on the manager node as
>>>>>>>>>>> long
>>>>>>>>>>> as
>>>>>>>>>>> you
>>>>>>>>>>> give each one a unique instance name (i.e., -n option).
>>>>>>>>>>>
>>>>>>>>>>> The front part of the cmsd reference explains how to do this.
>>>>>>>>>>>
>>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm
>>>>>>>>>>>
>>>>>>>>>>> Andy
>>>>>>>>>>>
>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>
>>>>>>>>>>>>   Is there any change to configure xrootd with more than 65
>>>>>>>>>>>> machines? I used the configure below but it doesn't work.
>>>>>>>>>>>>  Should
>>>>>>>>>>>> I
>>>>>>>>>>>> configure some machines' manager to be supvervisor?
>>>>>>>>>>>>
>>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Wen
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>