Hi Andrew, I prefer a source replacement. Then I can compile it. Thanks Wen > I can do one of two things here: > > 1) Supply a source replacement and then you would recompile, or > > 2) Give me the uname -a of where the cmsd will run and I'll supply a binary > replacement for you. > > Your choice. > > Andy > > On Sun, 13 Dec 2009, wen guan wrote: > >> Hi Andrew >> >> The problem is found. Great. Thanks. >> >> Where can I find the patched cmsd? >> >> Wen >> >> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky >> <[log in to unmask]> wrote: >>> >>> Hi Wen, >>> >>> I found the problem. Looks like a regression from way back when. There is >>> a >>> missing flag on the redirect. This will require a patched cmsd but you >>> need >>> only to replace the redirector's cmsd as this only affects the >>> redirector. >>> How would you like to proceed? >>> >>> Andy >>> >>> On Sat, 12 Dec 2009, wen guan wrote: >>> >>>> Hi Andrew, >>>> >>>> It doesn't work. atlas-bkp1 manager still dropping nodes again. >>>> In supervisor, I still haven't seen any dataserver registered. I said >>>> "I updated the ntp" because you said "the log timestamp do not >>>> overlap". >>>> >>>> Wen >>>> >>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky >>>> <[log in to unmask]> wrote: >>>>> >>>>> Hi Wen, >>>>> >>>>> Do you mean that everything is now working? It could be that you >>>>> removed >>>>> the >>>>> xrd.timeout directive. That really could cause problems. As for the >>>>> delays, >>>>> that is normal when the redirector thinks something is going wrong. The >>>>> strategy is to delay clients until it can get back to a stable >>>>> configuration. This usually prevents jobs from crashing during >>>>> stressful >>>>> periods. >>>>> >>>>> Andy >>>>> >>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>> >>>>>> Hi Andrew, >>>>>> >>>>>> I restarted it to do supervisor test. Also because xrootd manager >>>>>> frequently doesn't response. (*) is the cms.log, the file select is >>>>>> delayed again and again. When do a restart, all things are fine. Now I >>>>>> am trying to find a clue about it. >>>>>> >>>>>> (*) >>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] do_Select: wc >>>>>> >>>>>> >>>>>> >>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>> 091212 00:00:19 21318 Select seeking >>>>>> >>>>>> >>>>>> >>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>> 091212 00:00:19 21318 UnkFile rc=1 >>>>>> >>>>>> >>>>>> >>>>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] do_Select: >>>>>> delay 5 >>>>>> >>>>>> >>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 2+-1 post=0 >>>>>> 091212 00:00:19 21318 Dispatch redirector.21313:14@atlas-bkp2 for >>>>>> select dlen=166 >>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1 post=0 >>>>>> 091212 00:00:19 21318 XrdSched: running redirector inq=0 >>>>>> >>>>>> >>>>>> There is no core file. I copied a new copies of the logs to the link >>>>>> below. >>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>> >>>>>> Wen >>>>>> >>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky >>>>>> <[log in to unmask]> wrote: >>>>>>> >>>>>>> Hi Wen, >>>>>>> >>>>>>> I see in the server log that it is restarting often. Could you take a >>>>>>> look >>>>>>> in the c193 to see if you have any core files? Also please make sure >>>>>>> that >>>>>>> core files are enabled as Linux defaults the size to 0. The first >>>>>>> step >>>>>>> here >>>>>>> is to find out why your servers are restarting. >>>>>>> >>>>>>> Andy >>>>>>> >>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>> >>>>>>>> Hi Andrew, >>>>>>>> >>>>>>>> the logs can be found here. From the log you can see atlas-bkp1 >>>>>>>> manager are dropping nodes again and again which tries to connect to >>>>>>>> it. >>>>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky >>>>>>>> <[log in to unmask]> wrote: >>>>>>>>> >>>>>>>>> Hi Wen, Could you start everything up and provide me a pointer to >>>>>>>>> the >>>>>>>>> manager log file, supervisor log file, and one data server logfile >>>>>>>>> all >>>>>>>>> of >>>>>>>>> which cover the same time-frame (from start to some point where you >>>>>>>>> think >>>>>>>>> things are working or not). That way I can see what is happening. >>>>>>>>> At >>>>>>>>> the >>>>>>>>> moment I only see two "bad" things in the config file: >>>>>>>>> >>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a manager but you >>>>>>>>> claim, >>>>>>>>> via >>>>>>>>> the all.manager directive, that there are three (bkp2 and bkp3). >>>>>>>>> While >>>>>>>>> it >>>>>>>>> should work, the log file will be dense with error messages. Please >>>>>>>>> correct >>>>>>>>> this to be consistent and make it easier to see real errors. >>>>>>>> >>>>>>>> This is not a problem for me. Because this config is used in >>>>>>>> dataserver. In manager, I updated the if atlas-bkp1.cs.wisc.edu to >>>>>>>> atlas-bkp2 or something. This is a history problem. at first only >>>>>>>> atlas-bkp1 is used. atlas-bkp2 and atlas-bkp3 are added later. >>>>>>>> >>>>>>>>> 2) Please use cms.space not olb.space (for historical reasons the >>>>>>>>> latter >>>>>>>>> is >>>>>>>>> still accepted and over-rides the former, but that will soon end), >>>>>>>>> and >>>>>>>>> please use only one (the config file uses both directives). >>>>>>>> >>>>>>>> yes. I should remove this line. in fact cms.space is in the cfg too. >>>>>>>> >>>>>>>> >>>>>>>> Thanks >>>>>>>> Wen >>>>>>>> >>>>>>>>> The xrootd has an internal mechanism to connect servers with >>>>>>>>> supervisors >>>>>>>>> to >>>>>>>>> allow for maximum reliability. You cannot change that algorithm and >>>>>>>>> there >>>>>>>>> is >>>>>>>>> no need to do so. You should *never* tell anyone to directly >>>>>>>>> connect >>>>>>>>> to >>>>>>>>> a >>>>>>>>> supervisor. If you do, you will likely get unreachable nodes. >>>>>>>>> >>>>>>>>> As for dropping data servers, it would appear to me, given the >>>>>>>>> flurry >>>>>>>>> of >>>>>>>>> such activity, that something either crashed or was restarted. >>>>>>>>> That's >>>>>>>>> why >>>>>>>>> it >>>>>>>>> would be good to see the complete log of each one of the entities. >>>>>>>>> >>>>>>>>> Andy >>>>>>>>> >>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>> >>>>>>>>>> Hi Andrew, >>>>>>>>>> >>>>>>>>>> I read the document. and write a config >>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg). >>>>>>>>>> I used my conf, I can see manager is dispatch message to >>>>>>>>>> supervisor. But I cannot see any dataserver tries to connect to >>>>>>>>>> the >>>>>>>>>> supervisor. At the same time, in the manager's log, I can see some >>>>>>>>>> dataserver are Dropped. >>>>>>>>>> How does xrootd decide which dataserver will connect supervisor? >>>>>>>>>> should I specify some dataservers to connect the supervisor? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> (*) supervisor log >>>>>>>>>> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2 for state >>>>>>>>>> dlen=42 >>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State: >>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_StateFWD: Path >>>>>>>>>> find >>>>>>>>>> failed for state /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>>> >>>>>>>>>> (*)manager log >>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu TSpace=5587GB >>>>>>>>>> NumFS=1 >>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0 >>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding path: w >>>>>>>>>> /atlas >>>>>>>>>> 091211 04:13:24 15661 server.10585:[log in to unmask]:1094 >>>>>>>>>> do_Space: 5696231MB free; 0% util >>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>> server.10585:[log in to unmask]:1094 logged in. >>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection from >>>>>>>>>> [log in to unmask] >>>>>>>>>> 091211 04:13:24 15661 XrdSched: running ?:[log in to unmask] >>>>>>>>>> inq=0 >>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched protocol cmsd >>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll: FD 79 >>>>>>>>>> attached >>>>>>>>>> to poller 2; num=22 >>>>>>>>>> 091211 04:13:24 15661 Add server.21739:[log in to unmask] bumps >>>>>>>>>> server.15905:[log in to unmask]:1094 #63 >>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>> 091211 04:13:24 15661 Drop_Node: >>>>>>>>>> server.15905:[log in to unmask]:1094 dropped. >>>>>>>>>> 091211 04:13:24 15661 Add Shoved >>>>>>>>>> server.21739:[log in to unmask]:1094 to cluster; id=63.78; >>>>>>>>>> num=64; >>>>>>>>>> min=51 >>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0 >>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu TSpace=5587GB >>>>>>>>>> NumFS=1 >>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0 >>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding path: w >>>>>>>>>> /atlas >>>>>>>>>> 091211 04:13:24 15661 server.21739:[log in to unmask]:1094 >>>>>>>>>> do_Space: 5721854MB free; 0% util >>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>> server.21739:[log in to unmask]:1094 logged in. >>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve from >>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer >>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop node in 60 seconds >>>>>>>>>> 091211 04:13:24 15661 Remove_Node >>>>>>>>>> server.21739:[log in to unmask]:1094 node 63.78 >>>>>>>>>> 091211 04:13:24 15661 Protocol: server.21739:[log in to unmask] >>>>>>>>>> logged >>>>>>>>>> out. >>>>>>>>>> 091211 04:13:24 15661 server.21739:[log in to unmask] XrdPoll: >>>>>>>>>> FD >>>>>>>>>> 79 detached from poller 2; num=21 >>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>> server.24718:[log in to unmask]:1094 >>>>>>>>>> for status dlen=0 >>>>>>>>>> 091211 04:13:27 15661 server.24718:[log in to unmask]:1094 >>>>>>>>>> do_Status: >>>>>>>>>> suspend >>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service suspended >>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>> c177.chtc.wisc.edu >>>>>>>>>> FD=16 >>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>> server.24718:[log in to unmask]:1094 node 0.3 >>>>>>>>>> 091211 04:13:27 15661 Protocol: server.21656:[log in to unmask] >>>>>>>>>> logged >>>>>>>>>> out. >>>>>>>>>> 091211 04:13:27 15661 server.21656:[log in to unmask] XrdPoll: >>>>>>>>>> FD >>>>>>>>>> 16 detached from poller 2; num=20 >>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>> c179.chtc.wisc.edu >>>>>>>>>> FD=21 >>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>> server.17065:[log in to unmask]:1094 node 1.4 >>>>>>>>>> 091211 04:13:27 15661 Protocol: server.7978:[log in to unmask] >>>>>>>>>> logged >>>>>>>>>> out. >>>>>>>>>> 091211 04:13:27 15661 server.7978:[log in to unmask] XrdPoll: >>>>>>>>>> FD >>>>>>>>>> 21 >>>>>>>>>> detached from poller 1; num=21 >>>>>>>>>> 091211 04:13:27 15661 State: Status changed to suspended >>>>>>>>>> 091211 04:13:27 15661 Send status to >>>>>>>>>> redirector.15656:14@atlas-bkp2 >>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>> server.12937:[log in to unmask]:1094 >>>>>>>>>> for status dlen=0 >>>>>>>>>> 091211 04:13:27 15661 server.12937:[log in to unmask]:1094 >>>>>>>>>> do_Status: >>>>>>>>>> suspend >>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service suspended >>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>> c182.chtc.wisc.edu >>>>>>>>>> FD=19 >>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>> server.12937:[log in to unmask]:1094 node 7.10 >>>>>>>>>> 091211 04:13:27 15661 Protocol: server.26620:[log in to unmask] >>>>>>>>>> logged >>>>>>>>>> out. >>>>>>>>>> 091211 04:13:27 15661 server.26620:[log in to unmask] XrdPoll: >>>>>>>>>> FD >>>>>>>>>> 19 detached from poller 2; num=19 >>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>> server.10842:[log in to unmask]:1094 >>>>>>>>>> for status dlen=0 >>>>>>>>>> 091211 04:13:27 15661 server.10842:[log in to unmask]:1094 >>>>>>>>>> do_Status: >>>>>>>>>> suspend >>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service suspended >>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>> c178.chtc.wisc.edu >>>>>>>>>> FD=15 >>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>> server.10842:[log in to unmask]:1094 node 9.12 >>>>>>>>>> 091211 04:13:27 15661 Protocol: server.11901:[log in to unmask] >>>>>>>>>> logged >>>>>>>>>> out. >>>>>>>>>> 091211 04:13:27 15661 server.11901:[log in to unmask] XrdPoll: >>>>>>>>>> FD >>>>>>>>>> 15 detached from poller 1; num=20 >>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>> server.5535:[log in to unmask]:1094 >>>>>>>>>> for status dlen=0 >>>>>>>>>> 091211 04:13:27 15661 server.5535:[log in to unmask]:1094 >>>>>>>>>> do_Status: >>>>>>>>>> suspend >>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service suspended >>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>> c181.chtc.wisc.edu >>>>>>>>>> FD=17 >>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>> server.5535:[log in to unmask]:1094 node 5.8 >>>>>>>>>> 091211 04:13:27 15661 Protocol: server.13984:[log in to unmask] >>>>>>>>>> logged >>>>>>>>>> out. >>>>>>>>>> 091211 04:13:27 15661 server.13984:[log in to unmask] XrdPoll: >>>>>>>>>> FD >>>>>>>>>> 17 detached from poller 0; num=21 >>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>> server.23711:[log in to unmask]:1094 >>>>>>>>>> for status dlen=0 >>>>>>>>>> 091211 04:13:27 15661 server.23711:[log in to unmask]:1094 >>>>>>>>>> do_Status: >>>>>>>>>> suspend >>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service suspended >>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>> c183.chtc.wisc.edu >>>>>>>>>> FD=22 >>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>> server.23711:[log in to unmask]:1094 node 8.11 >>>>>>>>>> 091211 04:13:27 15661 Protocol: server.27735:[log in to unmask] >>>>>>>>>> logged >>>>>>>>>> out. >>>>>>>>>> 091211 04:13:27 15661 server.27735:[log in to unmask] XrdPoll: >>>>>>>>>> FD >>>>>>>>>> 22 detached from poller 2; num=18 >>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>> c184.chtc.wisc.edu >>>>>>>>>> FD=20 >>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>> server.4131:[log in to unmask]:1094 node 3.6 >>>>>>>>>> 091211 04:13:27 15661 Protocol: server.26787:[log in to unmask] >>>>>>>>>> logged >>>>>>>>>> out. >>>>>>>>>> 091211 04:13:27 15661 server.26787:[log in to unmask] XrdPoll: >>>>>>>>>> FD >>>>>>>>>> 20 detached from poller 0; num=20 >>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>> for status dlen=0 >>>>>>>>>> 091211 04:13:27 15661 server.10585:[log in to unmask]:1094 >>>>>>>>>> do_Status: >>>>>>>>>> suspend >>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service suspended >>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>> c185.chtc.wisc.edu >>>>>>>>>> FD=23 >>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>> server.10585:[log in to unmask]:1094 node 6.9 >>>>>>>>>> 091211 04:13:27 15661 Protocol: server.8524:[log in to unmask] >>>>>>>>>> logged >>>>>>>>>> out. >>>>>>>>>> 091211 04:13:27 15661 server.8524:[log in to unmask] XrdPoll: >>>>>>>>>> FD >>>>>>>>>> 23 >>>>>>>>>> detached from poller 0; num=19 >>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>> server.20264:[log in to unmask]:1094 >>>>>>>>>> for status dlen=0 >>>>>>>>>> 091211 04:13:27 15661 server.20264:[log in to unmask]:1094 >>>>>>>>>> do_Status: >>>>>>>>>> suspend >>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service suspended >>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>> c180.chtc.wisc.edu >>>>>>>>>> FD=18 >>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>> server.20264:[log in to unmask]:1094 node 4.7 >>>>>>>>>> 091211 04:13:27 15661 Protocol: server.14636:[log in to unmask] >>>>>>>>>> logged >>>>>>>>>> out. >>>>>>>>>> 091211 04:13:27 15661 server.14636:[log in to unmask] XrdPoll: >>>>>>>>>> FD >>>>>>>>>> 18 detached from poller 1; num=19 >>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>> server.1656:[log in to unmask]:1094 >>>>>>>>>> for status dlen=0 >>>>>>>>>> 091211 04:13:27 15661 server.1656:[log in to unmask]:1094 >>>>>>>>>> do_Status: >>>>>>>>>> suspend >>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service suspended >>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>> c186.chtc.wisc.edu >>>>>>>>>> FD=24 >>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>> server.1656:[log in to unmask]:1094 node 2.5 >>>>>>>>>> 091211 04:13:27 15661 Protocol: server.7849:[log in to unmask] >>>>>>>>>> logged >>>>>>>>>> out. >>>>>>>>>> 091211 04:13:27 15661 server.7849:[log in to unmask] XrdPoll: >>>>>>>>>> FD >>>>>>>>>> 24 >>>>>>>>>> detached from poller 1; num=18 >>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled. >>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled. >>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled. >>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled. >>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled. >>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled. >>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled. >>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled. >>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled. >>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled. >>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled. >>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers >>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0 >>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled. >>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0 >>>>>>>>>> >>>>>>>>>> Wen >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky >>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Wen, >>>>>>>>>>> >>>>>>>>>>> To go past 64 data servers you will need to setup one or more >>>>>>>>>>> supervisors. >>>>>>>>>>> This does not logically change the current configuration you >>>>>>>>>>> have. >>>>>>>>>>> You >>>>>>>>>>> only >>>>>>>>>>> need to configure one or more *new* servers (or at least xrootd >>>>>>>>>>> processes) >>>>>>>>>>> whose role is supervisor. We'd like them to run in separate >>>>>>>>>>> machines >>>>>>>>>>> for >>>>>>>>>>> reliability purposes, but they could run on the manager node as >>>>>>>>>>> long >>>>>>>>>>> as >>>>>>>>>>> you >>>>>>>>>>> give each one a unique instance name (i.e., -n option). >>>>>>>>>>> >>>>>>>>>>> The front part of the cmsd reference explains how to do this. >>>>>>>>>>> >>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm >>>>>>>>>>> >>>>>>>>>>> Andy >>>>>>>>>>> >>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>> >>>>>>>>>>>> Is there any change to configure xrootd with more than 65 >>>>>>>>>>>> machines? I used the configure below but it doesn't work. >>>>>>>>>>>> Should >>>>>>>>>>>> I >>>>>>>>>>>> configure some machines' manager to be supvervisor? >>>>>>>>>>>> >>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Wen >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> >