Hi Andrew, Thanks. I used the new cmsd at atlas-bkp1 manager. But it's still dropping nodes. And in supervisor's log, I cannot find any dataserver to register to it. The new logs are in http://higgs03.cs.wisc.edu/wguan/*.20091213. The manager is patched at 091213 08:38:15. Wen On Sun, Dec 13, 2009 at 1:52 AM, Andrew Hanushevsky <[log in to unmask]> wrote: > Hi Wen > > You will find the source replacement at: > > http://www.slac.stanford.edu/~abh/cmsd/ > > It's XrdCmsCluster.cc and it replaces xrootd/src/XrdCms/XrdCmsCluster.cc > > I'm stepping out for a couple of hours but will be back to see how things > went. Sorry for the issues :-( > > Andy > > On Sun, 13 Dec 2009, wen guan wrote: > >> Hi Andrew, >> >> I prefer a source replacement. Then I can compile it. >> >> Thanks >> Wen >>> >>> I can do one of two things here: >>> >>> 1) Supply a source replacement and then you would recompile, or >>> >>> 2) Give me the uname -a of where the cmsd will run and I'll supply a >>> binary >>> replacement for you. >>> >>> Your choice. >>> >>> Andy >>> >>> On Sun, 13 Dec 2009, wen guan wrote: >>> >>>> Hi Andrew >>>> >>>> The problem is found. Great. Thanks. >>>> >>>> Where can I find the patched cmsd? >>>> >>>> Wen >>>> >>>> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky >>>> <[log in to unmask]> wrote: >>>>> >>>>> Hi Wen, >>>>> >>>>> I found the problem. Looks like a regression from way back when. There >>>>> is >>>>> a >>>>> missing flag on the redirect. This will require a patched cmsd but you >>>>> need >>>>> only to replace the redirector's cmsd as this only affects the >>>>> redirector. >>>>> How would you like to proceed? >>>>> >>>>> Andy >>>>> >>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>> >>>>>> Hi Andrew, >>>>>> >>>>>> It doesn't work. atlas-bkp1 manager still dropping nodes again. >>>>>> In supervisor, I still haven't seen any dataserver registered. I said >>>>>> "I updated the ntp" because you said "the log timestamp do not >>>>>> overlap". >>>>>> >>>>>> Wen >>>>>> >>>>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky >>>>>> <[log in to unmask]> wrote: >>>>>>> >>>>>>> Hi Wen, >>>>>>> >>>>>>> Do you mean that everything is now working? It could be that you >>>>>>> removed >>>>>>> the >>>>>>> xrd.timeout directive. That really could cause problems. As for the >>>>>>> delays, >>>>>>> that is normal when the redirector thinks something is going wrong. >>>>>>> The >>>>>>> strategy is to delay clients until it can get back to a stable >>>>>>> configuration. This usually prevents jobs from crashing during >>>>>>> stressful >>>>>>> periods. >>>>>>> >>>>>>> Andy >>>>>>> >>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>> >>>>>>>> Hi Andrew, >>>>>>>> >>>>>>>> I restarted it to do supervisor test. Also because xrootd manager >>>>>>>> frequently doesn't response. (*) is the cms.log, the file select is >>>>>>>> delayed again and again. When do a restart, all things are fine. Now >>>>>>>> I >>>>>>>> am trying to find a clue about it. >>>>>>>> >>>>>>>> (*) >>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] do_Select: >>>>>>>> wc >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>> 091212 00:00:19 21318 Select seeking >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>> 091212 00:00:19 21318 UnkFile rc=1 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] do_Select: >>>>>>>> delay 5 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 2+-1 post=0 >>>>>>>> 091212 00:00:19 21318 Dispatch redirector.21313:14@atlas-bkp2 for >>>>>>>> select dlen=166 >>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1 post=0 >>>>>>>> 091212 00:00:19 21318 XrdSched: running redirector inq=0 >>>>>>>> >>>>>>>> >>>>>>>> There is no core file. I copied a new copies of the logs to the link >>>>>>>> below. >>>>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>>>> >>>>>>>> Wen >>>>>>>> >>>>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky >>>>>>>> <[log in to unmask]> wrote: >>>>>>>>> >>>>>>>>> Hi Wen, >>>>>>>>> >>>>>>>>> I see in the server log that it is restarting often. Could you take >>>>>>>>> a >>>>>>>>> look >>>>>>>>> in the c193 to see if you have any core files? Also please make >>>>>>>>> sure >>>>>>>>> that >>>>>>>>> core files are enabled as Linux defaults the size to 0. The first >>>>>>>>> step >>>>>>>>> here >>>>>>>>> is to find out why your servers are restarting. >>>>>>>>> >>>>>>>>> Andy >>>>>>>>> >>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>> >>>>>>>>>> Hi Andrew, >>>>>>>>>> >>>>>>>>>> the logs can be found here. From the log you can see atlas-bkp1 >>>>>>>>>> manager are dropping nodes again and again which tries to connect >>>>>>>>>> to >>>>>>>>>> it. >>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky >>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Wen, Could you start everything up and provide me a pointer to >>>>>>>>>>> the >>>>>>>>>>> manager log file, supervisor log file, and one data server >>>>>>>>>>> logfile >>>>>>>>>>> all >>>>>>>>>>> of >>>>>>>>>>> which cover the same time-frame (from start to some point where >>>>>>>>>>> you >>>>>>>>>>> think >>>>>>>>>>> things are working or not). That way I can see what is happening. >>>>>>>>>>> At >>>>>>>>>>> the >>>>>>>>>>> moment I only see two "bad" things in the config file: >>>>>>>>>>> >>>>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a manager but you >>>>>>>>>>> claim, >>>>>>>>>>> via >>>>>>>>>>> the all.manager directive, that there are three (bkp2 and bkp3). >>>>>>>>>>> While >>>>>>>>>>> it >>>>>>>>>>> should work, the log file will be dense with error messages. >>>>>>>>>>> Please >>>>>>>>>>> correct >>>>>>>>>>> this to be consistent and make it easier to see real errors. >>>>>>>>>> >>>>>>>>>> This is not a problem for me. Because this config is used in >>>>>>>>>> dataserver. In manager, I updated the if atlas-bkp1.cs.wisc.edu to >>>>>>>>>> atlas-bkp2 or something. This is a history problem. at first only >>>>>>>>>> atlas-bkp1 is used. atlas-bkp2 and atlas-bkp3 are added later. >>>>>>>>>> >>>>>>>>>>> 2) Please use cms.space not olb.space (for historical reasons the >>>>>>>>>>> latter >>>>>>>>>>> is >>>>>>>>>>> still accepted and over-rides the former, but that will soon >>>>>>>>>>> end), >>>>>>>>>>> and >>>>>>>>>>> please use only one (the config file uses both directives). >>>>>>>>>> >>>>>>>>>> yes. I should remove this line. in fact cms.space is in the cfg >>>>>>>>>> too. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Wen >>>>>>>>>> >>>>>>>>>>> The xrootd has an internal mechanism to connect servers with >>>>>>>>>>> supervisors >>>>>>>>>>> to >>>>>>>>>>> allow for maximum reliability. You cannot change that algorithm >>>>>>>>>>> and >>>>>>>>>>> there >>>>>>>>>>> is >>>>>>>>>>> no need to do so. You should *never* tell anyone to directly >>>>>>>>>>> connect >>>>>>>>>>> to >>>>>>>>>>> a >>>>>>>>>>> supervisor. If you do, you will likely get unreachable nodes. >>>>>>>>>>> >>>>>>>>>>> As for dropping data servers, it would appear to me, given the >>>>>>>>>>> flurry >>>>>>>>>>> of >>>>>>>>>>> such activity, that something either crashed or was restarted. >>>>>>>>>>> That's >>>>>>>>>>> why >>>>>>>>>>> it >>>>>>>>>>> would be good to see the complete log of each one of the >>>>>>>>>>> entities. >>>>>>>>>>> >>>>>>>>>>> Andy >>>>>>>>>>> >>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>> >>>>>>>>>>>> I read the document. and write a config >>>>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg). >>>>>>>>>>>> I used my conf, I can see manager is dispatch message to >>>>>>>>>>>> supervisor. But I cannot see any dataserver tries to connect to >>>>>>>>>>>> the >>>>>>>>>>>> supervisor. At the same time, in the manager's log, I can see >>>>>>>>>>>> some >>>>>>>>>>>> dataserver are Dropped. >>>>>>>>>>>> How does xrootd decide which dataserver will connect >>>>>>>>>>>> supervisor? >>>>>>>>>>>> should I specify some dataservers to connect the supervisor? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> (*) supervisor log >>>>>>>>>>>> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2 for state >>>>>>>>>>>> dlen=42 >>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State: >>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_StateFWD: Path >>>>>>>>>>>> find >>>>>>>>>>>> failed for state /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>>>>> >>>>>>>>>>>> (*)manager log >>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu TSpace=5587GB >>>>>>>>>>>> NumFS=1 >>>>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0 >>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding path: w >>>>>>>>>>>> /atlas >>>>>>>>>>>> 091211 04:13:24 15661 server.10585:[log in to unmask]:1094 >>>>>>>>>>>> do_Space: 5696231MB free; 0% util >>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>> server.10585:[log in to unmask]:1094 logged in. >>>>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection from >>>>>>>>>>>> [log in to unmask] >>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: running ?:[log in to unmask] >>>>>>>>>>>> inq=0 >>>>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched protocol cmsd >>>>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll: FD 79 >>>>>>>>>>>> attached >>>>>>>>>>>> to poller 2; num=22 >>>>>>>>>>>> 091211 04:13:24 15661 Add server.21739:[log in to unmask] >>>>>>>>>>>> bumps >>>>>>>>>>>> server.15905:[log in to unmask]:1094 #63 >>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>> 091211 04:13:24 15661 Drop_Node: >>>>>>>>>>>> server.15905:[log in to unmask]:1094 dropped. >>>>>>>>>>>> 091211 04:13:24 15661 Add Shoved >>>>>>>>>>>> server.21739:[log in to unmask]:1094 to cluster; id=63.78; >>>>>>>>>>>> num=64; >>>>>>>>>>>> min=51 >>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0 >>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu TSpace=5587GB >>>>>>>>>>>> NumFS=1 >>>>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0 >>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding path: w >>>>>>>>>>>> /atlas >>>>>>>>>>>> 091211 04:13:24 15661 server.21739:[log in to unmask]:1094 >>>>>>>>>>>> do_Space: 5721854MB free; 0% util >>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>> server.21739:[log in to unmask]:1094 logged in. >>>>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve from >>>>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer >>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop node in 60 >>>>>>>>>>>> seconds >>>>>>>>>>>> 091211 04:13:24 15661 Remove_Node >>>>>>>>>>>> server.21739:[log in to unmask]:1094 node 63.78 >>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>> logged >>>>>>>>>>>> out. >>>>>>>>>>>> 091211 04:13:24 15661 server.21739:[log in to unmask] >>>>>>>>>>>> XrdPoll: >>>>>>>>>>>> FD >>>>>>>>>>>> 79 detached from poller 2; num=21 >>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>> server.24718:[log in to unmask]:1094 >>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>> 091211 04:13:27 15661 server.24718:[log in to unmask]:1094 >>>>>>>>>>>> do_Status: >>>>>>>>>>>> suspend >>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service suspended >>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>> c177.chtc.wisc.edu >>>>>>>>>>>> FD=16 >>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>> server.24718:[log in to unmask]:1094 node 0.3 >>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>> server.21656:[log in to unmask] >>>>>>>>>>>> logged >>>>>>>>>>>> out. >>>>>>>>>>>> 091211 04:13:27 15661 server.21656:[log in to unmask] >>>>>>>>>>>> XrdPoll: >>>>>>>>>>>> FD >>>>>>>>>>>> 16 detached from poller 2; num=20 >>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>> c179.chtc.wisc.edu >>>>>>>>>>>> FD=21 >>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>> server.17065:[log in to unmask]:1094 node 1.4 >>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>> server.7978:[log in to unmask] >>>>>>>>>>>> logged >>>>>>>>>>>> out. >>>>>>>>>>>> 091211 04:13:27 15661 server.7978:[log in to unmask] XrdPoll: >>>>>>>>>>>> FD >>>>>>>>>>>> 21 >>>>>>>>>>>> detached from poller 1; num=21 >>>>>>>>>>>> 091211 04:13:27 15661 State: Status changed to suspended >>>>>>>>>>>> 091211 04:13:27 15661 Send status to >>>>>>>>>>>> redirector.15656:14@atlas-bkp2 >>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>> server.12937:[log in to unmask]:1094 >>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>> 091211 04:13:27 15661 server.12937:[log in to unmask]:1094 >>>>>>>>>>>> do_Status: >>>>>>>>>>>> suspend >>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service suspended >>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>> c182.chtc.wisc.edu >>>>>>>>>>>> FD=19 >>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>> server.12937:[log in to unmask]:1094 node 7.10 >>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>> server.26620:[log in to unmask] >>>>>>>>>>>> logged >>>>>>>>>>>> out. >>>>>>>>>>>> 091211 04:13:27 15661 server.26620:[log in to unmask] >>>>>>>>>>>> XrdPoll: >>>>>>>>>>>> FD >>>>>>>>>>>> 19 detached from poller 2; num=19 >>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>> server.10842:[log in to unmask]:1094 >>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>> 091211 04:13:27 15661 server.10842:[log in to unmask]:1094 >>>>>>>>>>>> do_Status: >>>>>>>>>>>> suspend >>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service suspended >>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>> c178.chtc.wisc.edu >>>>>>>>>>>> FD=15 >>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>> server.10842:[log in to unmask]:1094 node 9.12 >>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>> server.11901:[log in to unmask] >>>>>>>>>>>> logged >>>>>>>>>>>> out. >>>>>>>>>>>> 091211 04:13:27 15661 server.11901:[log in to unmask] >>>>>>>>>>>> XrdPoll: >>>>>>>>>>>> FD >>>>>>>>>>>> 15 detached from poller 1; num=20 >>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>> server.5535:[log in to unmask]:1094 >>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>> 091211 04:13:27 15661 server.5535:[log in to unmask]:1094 >>>>>>>>>>>> do_Status: >>>>>>>>>>>> suspend >>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service suspended >>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>> c181.chtc.wisc.edu >>>>>>>>>>>> FD=17 >>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>> server.5535:[log in to unmask]:1094 node 5.8 >>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>> server.13984:[log in to unmask] >>>>>>>>>>>> logged >>>>>>>>>>>> out. >>>>>>>>>>>> 091211 04:13:27 15661 server.13984:[log in to unmask] >>>>>>>>>>>> XrdPoll: >>>>>>>>>>>> FD >>>>>>>>>>>> 17 detached from poller 0; num=21 >>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>> server.23711:[log in to unmask]:1094 >>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>> 091211 04:13:27 15661 server.23711:[log in to unmask]:1094 >>>>>>>>>>>> do_Status: >>>>>>>>>>>> suspend >>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service suspended >>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>> c183.chtc.wisc.edu >>>>>>>>>>>> FD=22 >>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>> server.23711:[log in to unmask]:1094 node 8.11 >>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>> server.27735:[log in to unmask] >>>>>>>>>>>> logged >>>>>>>>>>>> out. >>>>>>>>>>>> 091211 04:13:27 15661 server.27735:[log in to unmask] >>>>>>>>>>>> XrdPoll: >>>>>>>>>>>> FD >>>>>>>>>>>> 22 detached from poller 2; num=18 >>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>> c184.chtc.wisc.edu >>>>>>>>>>>> FD=20 >>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>> server.4131:[log in to unmask]:1094 node 3.6 >>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>> server.26787:[log in to unmask] >>>>>>>>>>>> logged >>>>>>>>>>>> out. >>>>>>>>>>>> 091211 04:13:27 15661 server.26787:[log in to unmask] >>>>>>>>>>>> XrdPoll: >>>>>>>>>>>> FD >>>>>>>>>>>> 20 detached from poller 0; num=20 >>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>> 091211 04:13:27 15661 server.10585:[log in to unmask]:1094 >>>>>>>>>>>> do_Status: >>>>>>>>>>>> suspend >>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service suspended >>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>> c185.chtc.wisc.edu >>>>>>>>>>>> FD=23 >>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>> server.10585:[log in to unmask]:1094 node 6.9 >>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>> server.8524:[log in to unmask] >>>>>>>>>>>> logged >>>>>>>>>>>> out. >>>>>>>>>>>> 091211 04:13:27 15661 server.8524:[log in to unmask] XrdPoll: >>>>>>>>>>>> FD >>>>>>>>>>>> 23 >>>>>>>>>>>> detached from poller 0; num=19 >>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>> server.20264:[log in to unmask]:1094 >>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>> 091211 04:13:27 15661 server.20264:[log in to unmask]:1094 >>>>>>>>>>>> do_Status: >>>>>>>>>>>> suspend >>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service suspended >>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>> c180.chtc.wisc.edu >>>>>>>>>>>> FD=18 >>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>> server.20264:[log in to unmask]:1094 node 4.7 >>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>> server.14636:[log in to unmask] >>>>>>>>>>>> logged >>>>>>>>>>>> out. >>>>>>>>>>>> 091211 04:13:27 15661 server.14636:[log in to unmask] >>>>>>>>>>>> XrdPoll: >>>>>>>>>>>> FD >>>>>>>>>>>> 18 detached from poller 1; num=19 >>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>> server.1656:[log in to unmask]:1094 >>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>> 091211 04:13:27 15661 server.1656:[log in to unmask]:1094 >>>>>>>>>>>> do_Status: >>>>>>>>>>>> suspend >>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service suspended >>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>> c186.chtc.wisc.edu >>>>>>>>>>>> FD=24 >>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>> server.1656:[log in to unmask]:1094 node 2.5 >>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>> server.7849:[log in to unmask] >>>>>>>>>>>> logged >>>>>>>>>>>> out. >>>>>>>>>>>> 091211 04:13:27 15661 server.7849:[log in to unmask] XrdPoll: >>>>>>>>>>>> FD >>>>>>>>>>>> 24 >>>>>>>>>>>> detached from poller 1; num=18 >>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>> seconds >>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>> seconds >>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>> seconds >>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>> seconds >>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>> seconds >>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>> seconds >>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>> seconds >>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>> seconds >>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>> seconds >>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>> seconds >>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled. >>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled. >>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled. >>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled. >>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled. >>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled. >>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled. >>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled. >>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled. >>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled. >>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled. >>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers >>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled. >>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>> >>>>>>>>>>>> Wen >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky >>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>> >>>>>>>>>>>>> To go past 64 data servers you will need to setup one or more >>>>>>>>>>>>> supervisors. >>>>>>>>>>>>> This does not logically change the current configuration you >>>>>>>>>>>>> have. >>>>>>>>>>>>> You >>>>>>>>>>>>> only >>>>>>>>>>>>> need to configure one or more *new* servers (or at least xrootd >>>>>>>>>>>>> processes) >>>>>>>>>>>>> whose role is supervisor. We'd like them to run in separate >>>>>>>>>>>>> machines >>>>>>>>>>>>> for >>>>>>>>>>>>> reliability purposes, but they could run on the manager node as >>>>>>>>>>>>> long >>>>>>>>>>>>> as >>>>>>>>>>>>> you >>>>>>>>>>>>> give each one a unique instance name (i.e., -n option). >>>>>>>>>>>>> >>>>>>>>>>>>> The front part of the cmsd reference explains how to do this. >>>>>>>>>>>>> >>>>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm >>>>>>>>>>>>> >>>>>>>>>>>>> Andy >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Is there any change to configure xrootd with more than 65 >>>>>>>>>>>>>> machines? I used the configure below but it doesn't work. >>>>>>>>>>>>>> Should >>>>>>>>>>>>>> I >>>>>>>>>>>>>> configure some machines' manager to be supvervisor? >>>>>>>>>>>>>> >>>>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Wen >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> >