Hi Wen, Oh yes, the permanent fix should be available Monday late afternoon PST. Andy On Sun, 13 Dec 2009, wen guan wrote: > Hi Andrew, > > > Thanks. > I used the new cmsd at atlas-bkp1 manager. But it's still dropping > nodes. And in supervisor's log, I cannot find any dataserver to > register to it. > > The new logs are in http://higgs03.cs.wisc.edu/wguan/*.20091213. > The manager is patched at 091213 08:38:15. > > Wen > > On Sun, Dec 13, 2009 at 1:52 AM, Andrew Hanushevsky > <[log in to unmask]> wrote: >> Hi Wen >> >> You will find the source replacement at: >> >> http://www.slac.stanford.edu/~abh/cmsd/ >> >> It's XrdCmsCluster.cc and it replaces xrootd/src/XrdCms/XrdCmsCluster.cc >> >> I'm stepping out for a couple of hours but will be back to see how things >> went. Sorry for the issues :-( >> >> Andy >> >> On Sun, 13 Dec 2009, wen guan wrote: >> >>> Hi Andrew, >>> >>> I prefer a source replacement. Then I can compile it. >>> >>> Thanks >>> Wen >>>> >>>> I can do one of two things here: >>>> >>>> 1) Supply a source replacement and then you would recompile, or >>>> >>>> 2) Give me the uname -a of where the cmsd will run and I'll supply a >>>> binary >>>> replacement for you. >>>> >>>> Your choice. >>>> >>>> Andy >>>> >>>> On Sun, 13 Dec 2009, wen guan wrote: >>>> >>>>> Hi Andrew >>>>> >>>>> The problem is found. Great. Thanks. >>>>> >>>>> Where can I find the patched cmsd? >>>>> >>>>> Wen >>>>> >>>>> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky >>>>> <[log in to unmask]> wrote: >>>>>> >>>>>> Hi Wen, >>>>>> >>>>>> I found the problem. Looks like a regression from way back when. There >>>>>> is >>>>>> a >>>>>> missing flag on the redirect. This will require a patched cmsd but you >>>>>> need >>>>>> only to replace the redirector's cmsd as this only affects the >>>>>> redirector. >>>>>> How would you like to proceed? >>>>>> >>>>>> Andy >>>>>> >>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>> >>>>>>> Hi Andrew, >>>>>>> >>>>>>> It doesn't work. atlas-bkp1 manager still dropping nodes again. >>>>>>> In supervisor, I still haven't seen any dataserver registered. I said >>>>>>> "I updated the ntp" because you said "the log timestamp do not >>>>>>> overlap". >>>>>>> >>>>>>> Wen >>>>>>> >>>>>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky >>>>>>> <[log in to unmask]> wrote: >>>>>>>> >>>>>>>> Hi Wen, >>>>>>>> >>>>>>>> Do you mean that everything is now working? It could be that you >>>>>>>> removed >>>>>>>> the >>>>>>>> xrd.timeout directive. That really could cause problems. As for the >>>>>>>> delays, >>>>>>>> that is normal when the redirector thinks something is going wrong. >>>>>>>> The >>>>>>>> strategy is to delay clients until it can get back to a stable >>>>>>>> configuration. This usually prevents jobs from crashing during >>>>>>>> stressful >>>>>>>> periods. >>>>>>>> >>>>>>>> Andy >>>>>>>> >>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>> >>>>>>>>> Hi Andrew, >>>>>>>>> >>>>>>>>> I restarted it to do supervisor test. Also because xrootd manager >>>>>>>>> frequently doesn't response. (*) is the cms.log, the file select is >>>>>>>>> delayed again and again. When do a restart, all things are fine. Now >>>>>>>>> I >>>>>>>>> am trying to find a clue about it. >>>>>>>>> >>>>>>>>> (*) >>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] do_Select: >>>>>>>>> wc >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>> 091212 00:00:19 21318 Select seeking >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>> 091212 00:00:19 21318 UnkFile rc=1 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] do_Select: >>>>>>>>> delay 5 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 2+-1 post=0 >>>>>>>>> 091212 00:00:19 21318 Dispatch redirector.21313:14@atlas-bkp2 for >>>>>>>>> select dlen=166 >>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1 post=0 >>>>>>>>> 091212 00:00:19 21318 XrdSched: running redirector inq=0 >>>>>>>>> >>>>>>>>> >>>>>>>>> There is no core file. I copied a new copies of the logs to the link >>>>>>>>> below. >>>>>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>>>>> >>>>>>>>> Wen >>>>>>>>> >>>>>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky >>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>> >>>>>>>>>> Hi Wen, >>>>>>>>>> >>>>>>>>>> I see in the server log that it is restarting often. Could you take >>>>>>>>>> a >>>>>>>>>> look >>>>>>>>>> in the c193 to see if you have any core files? Also please make >>>>>>>>>> sure >>>>>>>>>> that >>>>>>>>>> core files are enabled as Linux defaults the size to 0. The first >>>>>>>>>> step >>>>>>>>>> here >>>>>>>>>> is to find out why your servers are restarting. >>>>>>>>>> >>>>>>>>>> Andy >>>>>>>>>> >>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>> >>>>>>>>>>> Hi Andrew, >>>>>>>>>>> >>>>>>>>>>> the logs can be found here. From the log you can see atlas-bkp1 >>>>>>>>>>> manager are dropping nodes again and again which tries to connect >>>>>>>>>>> to >>>>>>>>>>> it. >>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky >>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Wen, Could you start everything up and provide me a pointer to >>>>>>>>>>>> the >>>>>>>>>>>> manager log file, supervisor log file, and one data server >>>>>>>>>>>> logfile >>>>>>>>>>>> all >>>>>>>>>>>> of >>>>>>>>>>>> which cover the same time-frame (from start to some point where >>>>>>>>>>>> you >>>>>>>>>>>> think >>>>>>>>>>>> things are working or not). That way I can see what is happening. >>>>>>>>>>>> At >>>>>>>>>>>> the >>>>>>>>>>>> moment I only see two "bad" things in the config file: >>>>>>>>>>>> >>>>>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a manager but you >>>>>>>>>>>> claim, >>>>>>>>>>>> via >>>>>>>>>>>> the all.manager directive, that there are three (bkp2 and bkp3). >>>>>>>>>>>> While >>>>>>>>>>>> it >>>>>>>>>>>> should work, the log file will be dense with error messages. >>>>>>>>>>>> Please >>>>>>>>>>>> correct >>>>>>>>>>>> this to be consistent and make it easier to see real errors. >>>>>>>>>>> >>>>>>>>>>> This is not a problem for me. Because this config is used in >>>>>>>>>>> dataserver. In manager, I updated the if atlas-bkp1.cs.wisc.edu to >>>>>>>>>>> atlas-bkp2 or something. This is a history problem. at first only >>>>>>>>>>> atlas-bkp1 is used. atlas-bkp2 and atlas-bkp3 are added later. >>>>>>>>>>> >>>>>>>>>>>> 2) Please use cms.space not olb.space (for historical reasons the >>>>>>>>>>>> latter >>>>>>>>>>>> is >>>>>>>>>>>> still accepted and over-rides the former, but that will soon >>>>>>>>>>>> end), >>>>>>>>>>>> and >>>>>>>>>>>> please use only one (the config file uses both directives). >>>>>>>>>>> >>>>>>>>>>> yes. I should remove this line. in fact cms.space is in the cfg >>>>>>>>>>> too. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> Wen >>>>>>>>>>> >>>>>>>>>>>> The xrootd has an internal mechanism to connect servers with >>>>>>>>>>>> supervisors >>>>>>>>>>>> to >>>>>>>>>>>> allow for maximum reliability. You cannot change that algorithm >>>>>>>>>>>> and >>>>>>>>>>>> there >>>>>>>>>>>> is >>>>>>>>>>>> no need to do so. You should *never* tell anyone to directly >>>>>>>>>>>> connect >>>>>>>>>>>> to >>>>>>>>>>>> a >>>>>>>>>>>> supervisor. If you do, you will likely get unreachable nodes. >>>>>>>>>>>> >>>>>>>>>>>> As for dropping data servers, it would appear to me, given the >>>>>>>>>>>> flurry >>>>>>>>>>>> of >>>>>>>>>>>> such activity, that something either crashed or was restarted. >>>>>>>>>>>> That's >>>>>>>>>>>> why >>>>>>>>>>>> it >>>>>>>>>>>> would be good to see the complete log of each one of the >>>>>>>>>>>> entities. >>>>>>>>>>>> >>>>>>>>>>>> Andy >>>>>>>>>>>> >>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>> >>>>>>>>>>>>> I read the document. and write a config >>>>>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg). >>>>>>>>>>>>> I used my conf, I can see manager is dispatch message to >>>>>>>>>>>>> supervisor. But I cannot see any dataserver tries to connect to >>>>>>>>>>>>> the >>>>>>>>>>>>> supervisor. At the same time, in the manager's log, I can see >>>>>>>>>>>>> some >>>>>>>>>>>>> dataserver are Dropped. >>>>>>>>>>>>> How does xrootd decide which dataserver will connect >>>>>>>>>>>>> supervisor? >>>>>>>>>>>>> should I specify some dataservers to connect the supervisor? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> (*) supervisor log >>>>>>>>>>>>> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2 for state >>>>>>>>>>>>> dlen=42 >>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State: >>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_StateFWD: Path >>>>>>>>>>>>> find >>>>>>>>>>>>> failed for state /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>>>>>> >>>>>>>>>>>>> (*)manager log >>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu TSpace=5587GB >>>>>>>>>>>>> NumFS=1 >>>>>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0 >>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding path: w >>>>>>>>>>>>> /atlas >>>>>>>>>>>>> 091211 04:13:24 15661 server.10585:[log in to unmask]:1094 >>>>>>>>>>>>> do_Space: 5696231MB free; 0% util >>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>> server.10585:[log in to unmask]:1094 logged in. >>>>>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection from >>>>>>>>>>>>> [log in to unmask] >>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: running ?:[log in to unmask] >>>>>>>>>>>>> inq=0 >>>>>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched protocol cmsd >>>>>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll: FD 79 >>>>>>>>>>>>> attached >>>>>>>>>>>>> to poller 2; num=22 >>>>>>>>>>>>> 091211 04:13:24 15661 Add server.21739:[log in to unmask] >>>>>>>>>>>>> bumps >>>>>>>>>>>>> server.15905:[log in to unmask]:1094 #63 >>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>> 091211 04:13:24 15661 Drop_Node: >>>>>>>>>>>>> server.15905:[log in to unmask]:1094 dropped. >>>>>>>>>>>>> 091211 04:13:24 15661 Add Shoved >>>>>>>>>>>>> server.21739:[log in to unmask]:1094 to cluster; id=63.78; >>>>>>>>>>>>> num=64; >>>>>>>>>>>>> min=51 >>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0 >>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu TSpace=5587GB >>>>>>>>>>>>> NumFS=1 >>>>>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0 >>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding path: w >>>>>>>>>>>>> /atlas >>>>>>>>>>>>> 091211 04:13:24 15661 server.21739:[log in to unmask]:1094 >>>>>>>>>>>>> do_Space: 5721854MB free; 0% util >>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>> server.21739:[log in to unmask]:1094 logged in. >>>>>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve from >>>>>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer >>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop node in 60 >>>>>>>>>>>>> seconds >>>>>>>>>>>>> 091211 04:13:24 15661 Remove_Node >>>>>>>>>>>>> server.21739:[log in to unmask]:1094 node 63.78 >>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>> logged >>>>>>>>>>>>> out. >>>>>>>>>>>>> 091211 04:13:24 15661 server.21739:[log in to unmask] >>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>> FD >>>>>>>>>>>>> 79 detached from poller 2; num=21 >>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>> server.24718:[log in to unmask]:1094 >>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>> 091211 04:13:27 15661 server.24718:[log in to unmask]:1094 >>>>>>>>>>>>> do_Status: >>>>>>>>>>>>> suspend >>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service suspended >>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>> c177.chtc.wisc.edu >>>>>>>>>>>>> FD=16 >>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>> server.24718:[log in to unmask]:1094 node 0.3 >>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>> server.21656:[log in to unmask] >>>>>>>>>>>>> logged >>>>>>>>>>>>> out. >>>>>>>>>>>>> 091211 04:13:27 15661 server.21656:[log in to unmask] >>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>> FD >>>>>>>>>>>>> 16 detached from poller 2; num=20 >>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>> c179.chtc.wisc.edu >>>>>>>>>>>>> FD=21 >>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>> server.17065:[log in to unmask]:1094 node 1.4 >>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>> server.7978:[log in to unmask] >>>>>>>>>>>>> logged >>>>>>>>>>>>> out. >>>>>>>>>>>>> 091211 04:13:27 15661 server.7978:[log in to unmask] XrdPoll: >>>>>>>>>>>>> FD >>>>>>>>>>>>> 21 >>>>>>>>>>>>> detached from poller 1; num=21 >>>>>>>>>>>>> 091211 04:13:27 15661 State: Status changed to suspended >>>>>>>>>>>>> 091211 04:13:27 15661 Send status to >>>>>>>>>>>>> redirector.15656:14@atlas-bkp2 >>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>> server.12937:[log in to unmask]:1094 >>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>> 091211 04:13:27 15661 server.12937:[log in to unmask]:1094 >>>>>>>>>>>>> do_Status: >>>>>>>>>>>>> suspend >>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service suspended >>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>> c182.chtc.wisc.edu >>>>>>>>>>>>> FD=19 >>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>> server.12937:[log in to unmask]:1094 node 7.10 >>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>> server.26620:[log in to unmask] >>>>>>>>>>>>> logged >>>>>>>>>>>>> out. >>>>>>>>>>>>> 091211 04:13:27 15661 server.26620:[log in to unmask] >>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>> FD >>>>>>>>>>>>> 19 detached from poller 2; num=19 >>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>> server.10842:[log in to unmask]:1094 >>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>> 091211 04:13:27 15661 server.10842:[log in to unmask]:1094 >>>>>>>>>>>>> do_Status: >>>>>>>>>>>>> suspend >>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service suspended >>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>> c178.chtc.wisc.edu >>>>>>>>>>>>> FD=15 >>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>> server.10842:[log in to unmask]:1094 node 9.12 >>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>> server.11901:[log in to unmask] >>>>>>>>>>>>> logged >>>>>>>>>>>>> out. >>>>>>>>>>>>> 091211 04:13:27 15661 server.11901:[log in to unmask] >>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>> FD >>>>>>>>>>>>> 15 detached from poller 1; num=20 >>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>> server.5535:[log in to unmask]:1094 >>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>> 091211 04:13:27 15661 server.5535:[log in to unmask]:1094 >>>>>>>>>>>>> do_Status: >>>>>>>>>>>>> suspend >>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service suspended >>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>> c181.chtc.wisc.edu >>>>>>>>>>>>> FD=17 >>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>> server.5535:[log in to unmask]:1094 node 5.8 >>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>> server.13984:[log in to unmask] >>>>>>>>>>>>> logged >>>>>>>>>>>>> out. >>>>>>>>>>>>> 091211 04:13:27 15661 server.13984:[log in to unmask] >>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>> FD >>>>>>>>>>>>> 17 detached from poller 0; num=21 >>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>> server.23711:[log in to unmask]:1094 >>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>> 091211 04:13:27 15661 server.23711:[log in to unmask]:1094 >>>>>>>>>>>>> do_Status: >>>>>>>>>>>>> suspend >>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service suspended >>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>> c183.chtc.wisc.edu >>>>>>>>>>>>> FD=22 >>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>> server.23711:[log in to unmask]:1094 node 8.11 >>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>> server.27735:[log in to unmask] >>>>>>>>>>>>> logged >>>>>>>>>>>>> out. >>>>>>>>>>>>> 091211 04:13:27 15661 server.27735:[log in to unmask] >>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>> FD >>>>>>>>>>>>> 22 detached from poller 2; num=18 >>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>> c184.chtc.wisc.edu >>>>>>>>>>>>> FD=20 >>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>> server.4131:[log in to unmask]:1094 node 3.6 >>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>> server.26787:[log in to unmask] >>>>>>>>>>>>> logged >>>>>>>>>>>>> out. >>>>>>>>>>>>> 091211 04:13:27 15661 server.26787:[log in to unmask] >>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>> FD >>>>>>>>>>>>> 20 detached from poller 0; num=20 >>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>> 091211 04:13:27 15661 server.10585:[log in to unmask]:1094 >>>>>>>>>>>>> do_Status: >>>>>>>>>>>>> suspend >>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service suspended >>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>> c185.chtc.wisc.edu >>>>>>>>>>>>> FD=23 >>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>> server.10585:[log in to unmask]:1094 node 6.9 >>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>> server.8524:[log in to unmask] >>>>>>>>>>>>> logged >>>>>>>>>>>>> out. >>>>>>>>>>>>> 091211 04:13:27 15661 server.8524:[log in to unmask] XrdPoll: >>>>>>>>>>>>> FD >>>>>>>>>>>>> 23 >>>>>>>>>>>>> detached from poller 0; num=19 >>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>> server.20264:[log in to unmask]:1094 >>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>> 091211 04:13:27 15661 server.20264:[log in to unmask]:1094 >>>>>>>>>>>>> do_Status: >>>>>>>>>>>>> suspend >>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service suspended >>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>> c180.chtc.wisc.edu >>>>>>>>>>>>> FD=18 >>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>> server.20264:[log in to unmask]:1094 node 4.7 >>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>> server.14636:[log in to unmask] >>>>>>>>>>>>> logged >>>>>>>>>>>>> out. >>>>>>>>>>>>> 091211 04:13:27 15661 server.14636:[log in to unmask] >>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>> FD >>>>>>>>>>>>> 18 detached from poller 1; num=19 >>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>> server.1656:[log in to unmask]:1094 >>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>> 091211 04:13:27 15661 server.1656:[log in to unmask]:1094 >>>>>>>>>>>>> do_Status: >>>>>>>>>>>>> suspend >>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service suspended >>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>> c186.chtc.wisc.edu >>>>>>>>>>>>> FD=24 >>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>> server.1656:[log in to unmask]:1094 node 2.5 >>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>> server.7849:[log in to unmask] >>>>>>>>>>>>> logged >>>>>>>>>>>>> out. >>>>>>>>>>>>> 091211 04:13:27 15661 server.7849:[log in to unmask] XrdPoll: >>>>>>>>>>>>> FD >>>>>>>>>>>>> 24 >>>>>>>>>>>>> detached from poller 1; num=18 >>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>> seconds >>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>> seconds >>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>> seconds >>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>> seconds >>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>> seconds >>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>> seconds >>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>> seconds >>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>> seconds >>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>> seconds >>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>> seconds >>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled. >>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled. >>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled. >>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled. >>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled. >>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled. >>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled. >>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled. >>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled. >>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled. >>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled. >>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers >>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled. >>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>> >>>>>>>>>>>>> Wen >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky >>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>> >>>>>>>>>>>>>> To go past 64 data servers you will need to setup one or more >>>>>>>>>>>>>> supervisors. >>>>>>>>>>>>>> This does not logically change the current configuration you >>>>>>>>>>>>>> have. >>>>>>>>>>>>>> You >>>>>>>>>>>>>> only >>>>>>>>>>>>>> need to configure one or more *new* servers (or at least xrootd >>>>>>>>>>>>>> processes) >>>>>>>>>>>>>> whose role is supervisor. We'd like them to run in separate >>>>>>>>>>>>>> machines >>>>>>>>>>>>>> for >>>>>>>>>>>>>> reliability purposes, but they could run on the manager node as >>>>>>>>>>>>>> long >>>>>>>>>>>>>> as >>>>>>>>>>>>>> you >>>>>>>>>>>>>> give each one a unique instance name (i.e., -n option). >>>>>>>>>>>>>> >>>>>>>>>>>>>> The front part of the cmsd reference explains how to do this. >>>>>>>>>>>>>> >>>>>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm >>>>>>>>>>>>>> >>>>>>>>>>>>>> Andy >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Is there any change to configure xrootd with more than 65 >>>>>>>>>>>>>>> machines? I used the configure below but it doesn't work. >>>>>>>>>>>>>>> Should >>>>>>>>>>>>>>> I >>>>>>>>>>>>>>> configure some machines' manager to be supvervisor? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>> >> >