Hi Wen, I can do one of two things here: 1) Supply a source replacement and then you would recompile, or 2) Give me the uname -a of where the cmsd will run and I'll supply a binary replacement for you. Your choice. Andy On Sun, 13 Dec 2009, wen guan wrote: > Hi Andrew > > The problem is found. Great. Thanks. > > Where can I find the patched cmsd? > > Wen > > On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky > <[log in to unmask]> wrote: >> Hi Wen, >> >> I found the problem. Looks like a regression from way back when. There is a >> missing flag on the redirect. This will require a patched cmsd but you need >> only to replace the redirector's cmsd as this only affects the redirector. >> How would you like to proceed? >> >> Andy >> >> On Sat, 12 Dec 2009, wen guan wrote: >> >>> Hi Andrew, >>> >>> It doesn't work. atlas-bkp1 manager still dropping nodes again. >>> In supervisor, I still haven't seen any dataserver registered. I said >>> "I updated the ntp" because you said "the log timestamp do not >>> overlap". >>> >>> Wen >>> >>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky >>> <[log in to unmask]> wrote: >>>> >>>> Hi Wen, >>>> >>>> Do you mean that everything is now working? It could be that you removed >>>> the >>>> xrd.timeout directive. That really could cause problems. As for the >>>> delays, >>>> that is normal when the redirector thinks something is going wrong. The >>>> strategy is to delay clients until it can get back to a stable >>>> configuration. This usually prevents jobs from crashing during stressful >>>> periods. >>>> >>>> Andy >>>> >>>> On Sat, 12 Dec 2009, wen guan wrote: >>>> >>>>> Hi Andrew, >>>>> >>>>> I restarted it to do supervisor test. Also because xrootd manager >>>>> frequently doesn't response. (*) is the cms.log, the file select is >>>>> delayed again and again. When do a restart, all things are fine. Now I >>>>> am trying to find a clue about it. >>>>> >>>>> (*) >>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] do_Select: wc >>>>> >>>>> >>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>> 091212 00:00:19 21318 Select seeking >>>>> >>>>> >>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>> 091212 00:00:19 21318 UnkFile rc=1 >>>>> >>>>> >>>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] do_Select: >>>>> delay 5 >>>>> >>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 2+-1 post=0 >>>>> 091212 00:00:19 21318 Dispatch redirector.21313:14@atlas-bkp2 for >>>>> select dlen=166 >>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1 post=0 >>>>> 091212 00:00:19 21318 XrdSched: running redirector inq=0 >>>>> >>>>> >>>>> There is no core file. I copied a new copies of the logs to the link >>>>> below. >>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>> >>>>> Wen >>>>> >>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky >>>>> <[log in to unmask]> wrote: >>>>>> >>>>>> Hi Wen, >>>>>> >>>>>> I see in the server log that it is restarting often. Could you take a >>>>>> look >>>>>> in the c193 to see if you have any core files? Also please make sure >>>>>> that >>>>>> core files are enabled as Linux defaults the size to 0. The first step >>>>>> here >>>>>> is to find out why your servers are restarting. >>>>>> >>>>>> Andy >>>>>> >>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>> >>>>>>> Hi Andrew, >>>>>>> >>>>>>> the logs can be found here. From the log you can see atlas-bkp1 >>>>>>> manager are dropping nodes again and again which tries to connect to >>>>>>> it. >>>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>>> >>>>>>> >>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky >>>>>>> <[log in to unmask]> wrote: >>>>>>>> >>>>>>>> Hi Wen, Could you start everything up and provide me a pointer to the >>>>>>>> manager log file, supervisor log file, and one data server logfile >>>>>>>> all >>>>>>>> of >>>>>>>> which cover the same time-frame (from start to some point where you >>>>>>>> think >>>>>>>> things are working or not). That way I can see what is happening. At >>>>>>>> the >>>>>>>> moment I only see two "bad" things in the config file: >>>>>>>> >>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a manager but you >>>>>>>> claim, >>>>>>>> via >>>>>>>> the all.manager directive, that there are three (bkp2 and bkp3). >>>>>>>> While >>>>>>>> it >>>>>>>> should work, the log file will be dense with error messages. Please >>>>>>>> correct >>>>>>>> this to be consistent and make it easier to see real errors. >>>>>>> >>>>>>> This is not a problem for me. Because this config is used in >>>>>>> dataserver. In manager, I updated the if atlas-bkp1.cs.wisc.edu to >>>>>>> atlas-bkp2 or something. This is a history problem. at first only >>>>>>> atlas-bkp1 is used. atlas-bkp2 and atlas-bkp3 are added later. >>>>>>> >>>>>>>> 2) Please use cms.space not olb.space (for historical reasons the >>>>>>>> latter >>>>>>>> is >>>>>>>> still accepted and over-rides the former, but that will soon end), >>>>>>>> and >>>>>>>> please use only one (the config file uses both directives). >>>>>>> >>>>>>> yes. I should remove this line. in fact cms.space is in the cfg too. >>>>>>> >>>>>>> >>>>>>> Thanks >>>>>>> Wen >>>>>>> >>>>>>>> The xrootd has an internal mechanism to connect servers with >>>>>>>> supervisors >>>>>>>> to >>>>>>>> allow for maximum reliability. You cannot change that algorithm and >>>>>>>> there >>>>>>>> is >>>>>>>> no need to do so. You should *never* tell anyone to directly connect >>>>>>>> to >>>>>>>> a >>>>>>>> supervisor. If you do, you will likely get unreachable nodes. >>>>>>>> >>>>>>>> As for dropping data servers, it would appear to me, given the flurry >>>>>>>> of >>>>>>>> such activity, that something either crashed or was restarted. That's >>>>>>>> why >>>>>>>> it >>>>>>>> would be good to see the complete log of each one of the entities. >>>>>>>> >>>>>>>> Andy >>>>>>>> >>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>> >>>>>>>>> Hi Andrew, >>>>>>>>> >>>>>>>>> I read the document. and write a config >>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg). >>>>>>>>> I used my conf, I can see manager is dispatch message to >>>>>>>>> supervisor. But I cannot see any dataserver tries to connect to the >>>>>>>>> supervisor. At the same time, in the manager's log, I can see some >>>>>>>>> dataserver are Dropped. >>>>>>>>> How does xrootd decide which dataserver will connect supervisor? >>>>>>>>> should I specify some dataservers to connect the supervisor? >>>>>>>>> >>>>>>>>> >>>>>>>>> (*) supervisor log >>>>>>>>> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2 for state >>>>>>>>> dlen=42 >>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State: >>>>>>>>> /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_StateFWD: Path find >>>>>>>>> failed for state /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>> >>>>>>>>> (*)manager log >>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu TSpace=5587GB NumFS=1 >>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0 >>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding path: w /atlas >>>>>>>>> 091211 04:13:24 15661 server.10585:[log in to unmask]:1094 >>>>>>>>> do_Space: 5696231MB free; 0% util >>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>> server.10585:[log in to unmask]:1094 logged in. >>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection from >>>>>>>>> [log in to unmask] >>>>>>>>> 091211 04:13:24 15661 XrdSched: running ?:[log in to unmask] >>>>>>>>> inq=0 >>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched protocol cmsd >>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll: FD 79 >>>>>>>>> attached >>>>>>>>> to poller 2; num=22 >>>>>>>>> 091211 04:13:24 15661 Add server.21739:[log in to unmask] bumps >>>>>>>>> server.15905:[log in to unmask]:1094 #63 >>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>> 091211 04:13:24 15661 Drop_Node: >>>>>>>>> server.15905:[log in to unmask]:1094 dropped. >>>>>>>>> 091211 04:13:24 15661 Add Shoved >>>>>>>>> server.21739:[log in to unmask]:1094 to cluster; id=63.78; >>>>>>>>> num=64; >>>>>>>>> min=51 >>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0 >>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu TSpace=5587GB NumFS=1 >>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0 >>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding path: w /atlas >>>>>>>>> 091211 04:13:24 15661 server.21739:[log in to unmask]:1094 >>>>>>>>> do_Space: 5721854MB free; 0% util >>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>> server.21739:[log in to unmask]:1094 logged in. >>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve from >>>>>>>>> c187.chtc.wisc.edu; connection reset by peer >>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop node in 60 seconds >>>>>>>>> 091211 04:13:24 15661 Remove_Node >>>>>>>>> server.21739:[log in to unmask]:1094 node 63.78 >>>>>>>>> 091211 04:13:24 15661 Protocol: server.21739:[log in to unmask] >>>>>>>>> logged >>>>>>>>> out. >>>>>>>>> 091211 04:13:24 15661 server.21739:[log in to unmask] XrdPoll: FD >>>>>>>>> 79 detached from poller 2; num=21 >>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>> server.24718:[log in to unmask]:1094 >>>>>>>>> for status dlen=0 >>>>>>>>> 091211 04:13:27 15661 server.24718:[log in to unmask]:1094 >>>>>>>>> do_Status: >>>>>>>>> suspend >>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service suspended >>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>> c177.chtc.wisc.edu >>>>>>>>> FD=16 >>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>> server.24718:[log in to unmask]:1094 node 0.3 >>>>>>>>> 091211 04:13:27 15661 Protocol: server.21656:[log in to unmask] >>>>>>>>> logged >>>>>>>>> out. >>>>>>>>> 091211 04:13:27 15661 server.21656:[log in to unmask] XrdPoll: FD >>>>>>>>> 16 detached from poller 2; num=20 >>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>> c179.chtc.wisc.edu >>>>>>>>> FD=21 >>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>> server.17065:[log in to unmask]:1094 node 1.4 >>>>>>>>> 091211 04:13:27 15661 Protocol: server.7978:[log in to unmask] >>>>>>>>> logged >>>>>>>>> out. >>>>>>>>> 091211 04:13:27 15661 server.7978:[log in to unmask] XrdPoll: FD >>>>>>>>> 21 >>>>>>>>> detached from poller 1; num=21 >>>>>>>>> 091211 04:13:27 15661 State: Status changed to suspended >>>>>>>>> 091211 04:13:27 15661 Send status to redirector.15656:14@atlas-bkp2 >>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>> server.12937:[log in to unmask]:1094 >>>>>>>>> for status dlen=0 >>>>>>>>> 091211 04:13:27 15661 server.12937:[log in to unmask]:1094 >>>>>>>>> do_Status: >>>>>>>>> suspend >>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service suspended >>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>> c182.chtc.wisc.edu >>>>>>>>> FD=19 >>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>> server.12937:[log in to unmask]:1094 node 7.10 >>>>>>>>> 091211 04:13:27 15661 Protocol: server.26620:[log in to unmask] >>>>>>>>> logged >>>>>>>>> out. >>>>>>>>> 091211 04:13:27 15661 server.26620:[log in to unmask] XrdPoll: FD >>>>>>>>> 19 detached from poller 2; num=19 >>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>> server.10842:[log in to unmask]:1094 >>>>>>>>> for status dlen=0 >>>>>>>>> 091211 04:13:27 15661 server.10842:[log in to unmask]:1094 >>>>>>>>> do_Status: >>>>>>>>> suspend >>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service suspended >>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>> c178.chtc.wisc.edu >>>>>>>>> FD=15 >>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>> server.10842:[log in to unmask]:1094 node 9.12 >>>>>>>>> 091211 04:13:27 15661 Protocol: server.11901:[log in to unmask] >>>>>>>>> logged >>>>>>>>> out. >>>>>>>>> 091211 04:13:27 15661 server.11901:[log in to unmask] XrdPoll: FD >>>>>>>>> 15 detached from poller 1; num=20 >>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>> server.5535:[log in to unmask]:1094 >>>>>>>>> for status dlen=0 >>>>>>>>> 091211 04:13:27 15661 server.5535:[log in to unmask]:1094 >>>>>>>>> do_Status: >>>>>>>>> suspend >>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service suspended >>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>> c181.chtc.wisc.edu >>>>>>>>> FD=17 >>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>> server.5535:[log in to unmask]:1094 node 5.8 >>>>>>>>> 091211 04:13:27 15661 Protocol: server.13984:[log in to unmask] >>>>>>>>> logged >>>>>>>>> out. >>>>>>>>> 091211 04:13:27 15661 server.13984:[log in to unmask] XrdPoll: FD >>>>>>>>> 17 detached from poller 0; num=21 >>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>> server.23711:[log in to unmask]:1094 >>>>>>>>> for status dlen=0 >>>>>>>>> 091211 04:13:27 15661 server.23711:[log in to unmask]:1094 >>>>>>>>> do_Status: >>>>>>>>> suspend >>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service suspended >>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>> c183.chtc.wisc.edu >>>>>>>>> FD=22 >>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>> server.23711:[log in to unmask]:1094 node 8.11 >>>>>>>>> 091211 04:13:27 15661 Protocol: server.27735:[log in to unmask] >>>>>>>>> logged >>>>>>>>> out. >>>>>>>>> 091211 04:13:27 15661 server.27735:[log in to unmask] XrdPoll: FD >>>>>>>>> 22 detached from poller 2; num=18 >>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>> c184.chtc.wisc.edu >>>>>>>>> FD=20 >>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>> server.4131:[log in to unmask]:1094 node 3.6 >>>>>>>>> 091211 04:13:27 15661 Protocol: server.26787:[log in to unmask] >>>>>>>>> logged >>>>>>>>> out. >>>>>>>>> 091211 04:13:27 15661 server.26787:[log in to unmask] XrdPoll: FD >>>>>>>>> 20 detached from poller 0; num=20 >>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>> for status dlen=0 >>>>>>>>> 091211 04:13:27 15661 server.10585:[log in to unmask]:1094 >>>>>>>>> do_Status: >>>>>>>>> suspend >>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service suspended >>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>> c185.chtc.wisc.edu >>>>>>>>> FD=23 >>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>> server.10585:[log in to unmask]:1094 node 6.9 >>>>>>>>> 091211 04:13:27 15661 Protocol: server.8524:[log in to unmask] >>>>>>>>> logged >>>>>>>>> out. >>>>>>>>> 091211 04:13:27 15661 server.8524:[log in to unmask] XrdPoll: FD >>>>>>>>> 23 >>>>>>>>> detached from poller 0; num=19 >>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>> server.20264:[log in to unmask]:1094 >>>>>>>>> for status dlen=0 >>>>>>>>> 091211 04:13:27 15661 server.20264:[log in to unmask]:1094 >>>>>>>>> do_Status: >>>>>>>>> suspend >>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service suspended >>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>> c180.chtc.wisc.edu >>>>>>>>> FD=18 >>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>> server.20264:[log in to unmask]:1094 node 4.7 >>>>>>>>> 091211 04:13:27 15661 Protocol: server.14636:[log in to unmask] >>>>>>>>> logged >>>>>>>>> out. >>>>>>>>> 091211 04:13:27 15661 server.14636:[log in to unmask] XrdPoll: FD >>>>>>>>> 18 detached from poller 1; num=19 >>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>> server.1656:[log in to unmask]:1094 >>>>>>>>> for status dlen=0 >>>>>>>>> 091211 04:13:27 15661 server.1656:[log in to unmask]:1094 >>>>>>>>> do_Status: >>>>>>>>> suspend >>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service suspended >>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>> c186.chtc.wisc.edu >>>>>>>>> FD=24 >>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>> server.1656:[log in to unmask]:1094 node 2.5 >>>>>>>>> 091211 04:13:27 15661 Protocol: server.7849:[log in to unmask] >>>>>>>>> logged >>>>>>>>> out. >>>>>>>>> 091211 04:13:27 15661 server.7849:[log in to unmask] XrdPoll: FD >>>>>>>>> 24 >>>>>>>>> detached from poller 1; num=18 >>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled. >>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled. >>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled. >>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled. >>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled. >>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled. >>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled. >>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled. >>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled. >>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled. >>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled. >>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers >>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0 >>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled. >>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0 >>>>>>>>> >>>>>>>>> Wen >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky >>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>> >>>>>>>>>> Hi Wen, >>>>>>>>>> >>>>>>>>>> To go past 64 data servers you will need to setup one or more >>>>>>>>>> supervisors. >>>>>>>>>> This does not logically change the current configuration you have. >>>>>>>>>> You >>>>>>>>>> only >>>>>>>>>> need to configure one or more *new* servers (or at least xrootd >>>>>>>>>> processes) >>>>>>>>>> whose role is supervisor. We'd like them to run in separate >>>>>>>>>> machines >>>>>>>>>> for >>>>>>>>>> reliability purposes, but they could run on the manager node as >>>>>>>>>> long >>>>>>>>>> as >>>>>>>>>> you >>>>>>>>>> give each one a unique instance name (i.e., -n option). >>>>>>>>>> >>>>>>>>>> The front part of the cmsd reference explains how to do this. >>>>>>>>>> >>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm >>>>>>>>>> >>>>>>>>>> Andy >>>>>>>>>> >>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>>> >>>>>>>>>>> Hi Andrew, >>>>>>>>>>> >>>>>>>>>>> Is there any change to configure xrootd with more than 65 >>>>>>>>>>> machines? I used the configure below but it doesn't work. Should >>>>>>>>>>> I >>>>>>>>>>> configure some machines' manager to be supvervisor? >>>>>>>>>>> >>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Wen >>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>> >> >