Hi Andrew, But when I tried to xrdcp a file to it, it doesn't response. In atlas-bkp1-xrd.log.20091213, it always prints "stalling client for 10 sec". But in cms.log, I can find any message about the file. > I don't see why you say it doesn't work. With the debugging level set so > high the noise may make it look like something is going wrong but that isn't > necessarily the case. > > 1) The 'too many subscribers' is correct. The manager was simply redirecting > them because there were already 64 servers. However, in your case the > supervisor wasn't started until almost 30 minutes after everyone else (i.e., > 10:42 AM). Why was that? I'm not suprised about the flurry of messages with > a critical component missing for 30 minutes. Because the manager is 64bit machine but supervisor is 32 bit machine. Then I have to recompile the it. At that time, I was interrupted by something else. > 2) Once the supervisor started, it started accepting the redirected servers. > > 3) Then 10 seconds (10:42:10) later the supervisor was restarted. So, that > would cause a flurry of activity to occur as there is no backup supervisor > to take over. > > 4) This happened again at 10:42:34 AM then again at 10:48:49. Is the > supervisor crashing? Is there a core file? > > 5) At 11:11 AM the manager restarted. Again, is there a core file here or > was this a manual action? > > During the course of all of this. All nodes connected were operating propely > and files were being located. > > So, the two big questions are: > > a) Why was the supervisor not started until 30 minutes after the system was > started? > > b) Is there an explanation of the restarts? If this was a crash then we need > a core file to figure out what happened. It's not a crash. There are some reasons that I restarted some daemons. (1)I thought if a dataserver tried many times to connect to a redirector but failed, the dataserver would not try to connect a redirector again. The supervisor was missing for long time. So maybe some dataservers would not try to connect to atlas-bkp1 again. To reactive these dataservers, I restarted any servers. (2)When I tried to xrdcp, it was hanging for long time. I thought maybe manager was affected by some others things. then I restarte manager to see whether a restart can make this xrdcp work. Thanks Wen > Andy > > ----- Original Message ----- From: "wen guan" <[log in to unmask]> > To: "Andrew Hanushevsky" <[log in to unmask]> > Cc: <[log in to unmask]> > Sent: Wednesday, December 16, 2009 9:38 AM > Subject: Re: xrootd with more than 65 machines > > > Hi Andrew, > > It still doesn't work. > The log file is in higgs03.cs.wisc.edu/wguan/. The name is *.20091216 > The manager complains there are too many subscribers and the removes > nodes. > > (*) > Add server.10040:[log in to unmask] redirected; too many subscribers. > > Wen > > On Wed, Dec 16, 2009 at 4:25 AM, Andrew Hanushevsky <[log in to unmask]> > wrote: >> >> Hi Wen, >> >> It will be easier for me to retroft as the changes were pretty minor. >> Please >> lift the new XrdCmsNode.cc file from >> >> http://www.slac.stanford.edu/~abh/cmsd >> >> Andy >> >> ----- Original Message ----- From: "wen guan" <[log in to unmask]> >> To: "Andrew Hanushevsky" <[log in to unmask]> >> Cc: <[log in to unmask]> >> Sent: Tuesday, December 15, 2009 5:12 PM >> Subject: Re: xrootd with more than 65 machines >> >> >> Hi Andy, >> >> I can switch to 20091104-1102. Then you don't need to patch >> another version. How can I download v20091104-1102? >> >> Thanks >> Wen >> >> On Wed, Dec 16, 2009 at 12:52 AM, Andrew Hanushevsky <[log in to unmask]> >> wrote: >>> >>> Hi Wen, >>> >>> Ah yes, I see that now. The file I gave you is based on v20091104-1102. >>> Let >>> me see if I can retrofit the patch for you. >>> >>> Andy >>> >>> ----- Original Message ----- From: "wen guan" <[log in to unmask]> >>> To: "Andrew Hanushevsky" <[log in to unmask]> >>> Cc: <[log in to unmask]> >>> Sent: Tuesday, December 15, 2009 1:04 PM >>> Subject: Re: xrootd with more than 65 machines >>> >>> >>> Hi Andy, >>> >>> Which xrootd version are you using? XrdCmsConfig.hh is different. >>> XrdCmsConfig.hh is downloaded from >>> http://xrootd.slac.stanford.edu/download/20091028-1003/. >>> >>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsNode.cc >>> 6fb3ae40fe4e10bdd4d372818a341f2c src/XrdCms/XrdCmsNode.cc >>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsConfig.hh >>> 7d57753847d9448186c718f98e963cbe src/XrdCms/XrdCmsConfig.hh >>> >>> Thanks >>> Wen >>> >>> On Tue, Dec 15, 2009 at 10:50 PM, Andrew Hanushevsky <[log in to unmask]> >>> wrote: >>>> >>>> Hi Wen, >>>> >>>> Just compiled on Linux and it was clean. Something is really wrong with >>>> your >>>> source files, specifically XrdCmsConfig.cc >>>> >>>> The MD5 checksums on the relevant files are: >>>> >>>> MD5 (XrdCmsNode.cc) = 6fb3ae40fe4e10bdd4d372818a341f2c >>>> >>>> MD5 (XrdCmsConfig.hh) = 4a7d655582a7cd43b098947d0676924b >>>> >>>> Andy >>>> >>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]> >>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>> Cc: <[log in to unmask]> >>>> Sent: Tuesday, December 15, 2009 4:24 AM >>>> Subject: Re: xrootd with more than 65 machines >>>> >>>> >>>> Hi Andy, >>>> >>>> No problem. Thanks for the fix. But it cannot be compiled. The >>>> version I am using is >>>> http://xrootd.slac.stanford.edu/download/20091028-1003/. >>>> >>>> Making cms component... >>>> Compiling XrdCmsNode.cc >>>> XrdCmsNode.cc: In member function `const char* >>>> XrdCmsNode::do_Chmod(XrdCmsRRData&)': >>>> XrdCmsNode.cc:268: error: `fsExec' was not declared in this scope >>>> XrdCmsNode.cc:268: warning: unused variable 'fsExec' >>>> XrdCmsNode.cc:269: error: 'class XrdCmsConfig' has no member named >>>> 'ossFS' >>>> XrdCmsNode.cc:273: error: `fsFail' was not declared in this scope >>>> XrdCmsNode.cc:273: warning: unused variable 'fsFail' >>>> XrdCmsNode.cc: In member function `const char* >>>> XrdCmsNode::do_Mkdir(XrdCmsRRData&)': >>>> XrdCmsNode.cc:600: error: `fsExec' was not declared in this scope >>>> XrdCmsNode.cc:600: warning: unused variable 'fsExec' >>>> XrdCmsNode.cc:601: error: 'class XrdCmsConfig' has no member named >>>> 'ossFS' >>>> XrdCmsNode.cc:605: error: `fsFail' was not declared in this scope >>>> XrdCmsNode.cc:605: warning: unused variable 'fsFail' >>>> XrdCmsNode.cc: In member function `const char* >>>> XrdCmsNode::do_Mkpath(XrdCmsRRData&)': >>>> XrdCmsNode.cc:640: error: `fsExec' was not declared in this scope >>>> XrdCmsNode.cc:640: warning: unused variable 'fsExec' >>>> XrdCmsNode.cc:641: error: 'class XrdCmsConfig' has no member named >>>> 'ossFS' >>>> XrdCmsNode.cc:645: error: `fsFail' was not declared in this scope >>>> XrdCmsNode.cc:645: warning: unused variable 'fsFail' >>>> XrdCmsNode.cc: In member function `const char* >>>> XrdCmsNode::do_Mv(XrdCmsRRData&)': >>>> XrdCmsNode.cc:704: error: `fsExec' was not declared in this scope >>>> XrdCmsNode.cc:704: warning: unused variable 'fsExec' >>>> XrdCmsNode.cc:705: error: 'class XrdCmsConfig' has no member named >>>> 'ossFS' >>>> XrdCmsNode.cc:709: error: `fsFail' was not declared in this scope >>>> XrdCmsNode.cc:709: warning: unused variable 'fsFail' >>>> XrdCmsNode.cc: In member function `const char* >>>> XrdCmsNode::do_Rm(XrdCmsRRData&)': >>>> XrdCmsNode.cc:831: error: `fsExec' was not declared in this scope >>>> XrdCmsNode.cc:831: warning: unused variable 'fsExec' >>>> XrdCmsNode.cc:832: error: 'class XrdCmsConfig' has no member named >>>> 'ossFS' >>>> XrdCmsNode.cc:836: error: `fsFail' was not declared in this scope >>>> XrdCmsNode.cc:836: warning: unused variable 'fsFail' >>>> XrdCmsNode.cc: In member function `const char* >>>> XrdCmsNode::do_Rmdir(XrdCmsRRData&)': >>>> XrdCmsNode.cc:873: error: `fsExec' was not declared in this scope >>>> XrdCmsNode.cc:873: warning: unused variable 'fsExec' >>>> XrdCmsNode.cc:874: error: 'class XrdCmsConfig' has no member named >>>> 'ossFS' >>>> XrdCmsNode.cc:878: error: `fsFail' was not declared in this scope >>>> XrdCmsNode.cc:878: warning: unused variable 'fsFail' >>>> XrdCmsNode.cc: In member function `const char* >>>> XrdCmsNode::do_Trunc(XrdCmsRRData&)': >>>> XrdCmsNode.cc:1377: error: `fsExec' was not declared in this scope >>>> XrdCmsNode.cc:1377: warning: unused variable 'fsExec' >>>> XrdCmsNode.cc:1378: error: 'class XrdCmsConfig' has no member named >>>> 'ossFS' >>>> XrdCmsNode.cc:1382: error: `fsFail' was not declared in this scope >>>> XrdCmsNode.cc:1382: warning: unused variable 'fsFail' >>>> XrdCmsNode.cc: At global scope: >>>> XrdCmsNode.cc:1524: error: no `int XrdCmsNode::fsExec(XrdOucProg*, >>>> char*, char*)' member function declared in class `XrdCmsNode' >>>> XrdCmsNode.cc: In member function `int XrdCmsNode::fsExec(XrdOucProg*, >>>> char*, char*)': >>>> XrdCmsNode.cc:1533: error: `fsL2PFail1' was not declared in this scope >>>> XrdCmsNode.cc:1533: warning: unused variable 'fsL2PFail1' >>>> XrdCmsNode.cc:1537: error: `fsL2PFail2' was not declared in this scope >>>> XrdCmsNode.cc:1537: warning: unused variable 'fsL2PFail2' >>>> XrdCmsNode.cc: At global scope: >>>> XrdCmsNode.cc:1553: error: no `const char* XrdCmsNode::fsFail(const >>>> char*, const char*, const char*, int)' member function declared in >>>> class `XrdCmsNode' >>>> XrdCmsNode.cc: In member function `const char* >>>> XrdCmsNode::fsFail(const char*, const char*, const char*, int)': >>>> XrdCmsNode.cc:1559: error: `fsL2PFail1' was not declared in this scope >>>> XrdCmsNode.cc:1559: warning: unused variable 'fsL2PFail1' >>>> XrdCmsNode.cc:1560: error: `fsL2PFail2' was not declared in this scope >>>> XrdCmsNode.cc:1560: warning: unused variable 'fsL2PFail2' >>>> XrdCmsNode.cc: In static member function `static int >>>> XrdCmsNode::isOnline(char*, int)': >>>> XrdCmsNode.cc:1608: error: 'class XrdCmsConfig' has no member named >>>> 'ossFS' >>>> make[4]: *** [../../obj/XrdCmsNode.o] Error 1 >>>> make[3]: *** [Linuxall] Error 2 >>>> make[2]: *** [all] Error 2 >>>> make[1]: *** [XrdCms] Error 2 >>>> make: *** [all] Error 2 >>>> >>>> >>>> Wen >>>> >>>> On Tue, Dec 15, 2009 at 2:08 AM, Andrew Hanushevsky <[log in to unmask]> >>>> wrote: >>>>> >>>>> Hi Wen, >>>>> >>>>> I have developed a permanent fix. You will find the source files in >>>>> >>>>> http://www.slac.stanford.edu/~abh/cmsd/ >>>>> >>>>> There are three files: XrdCmsCluster.cc XrdCmsNode.cc XrdCmsProtocol.cc >>>>> >>>>> Please do a source replacement and recompile. Unfortunately, the cmsd >>>>> will >>>>> need to be replaced on each node regardless of role. My apologies for >>>>> the >>>>> disruption. Please let me know how it goes. >>>>> >>>>> Andy >>>>> >>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]> >>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>> Cc: <[log in to unmask]> >>>>> Sent: Sunday, December 13, 2009 7:04 AM >>>>> Subject: Re: xrootd with more than 65 machines >>>>> >>>>> >>>>> Hi Andrew, >>>>> >>>>> >>>>> Thanks. >>>>> I used the new cmsd at atlas-bkp1 manager. But it's still dropping >>>>> nodes. And in supervisor's log, I cannot find any dataserver to >>>>> register to it. >>>>> >>>>> The new logs are in http://higgs03.cs.wisc.edu/wguan/*.20091213. >>>>> The manager is patched at 091213 08:38:15. >>>>> >>>>> Wen >>>>> >>>>> On Sun, Dec 13, 2009 at 1:52 AM, Andrew Hanushevsky >>>>> <[log in to unmask]> wrote: >>>>>> >>>>>> Hi Wen >>>>>> >>>>>> You will find the source replacement at: >>>>>> >>>>>> http://www.slac.stanford.edu/~abh/cmsd/ >>>>>> >>>>>> It's XrdCmsCluster.cc and it replaces >>>>>> xrootd/src/XrdCms/XrdCmsCluster.cc >>>>>> >>>>>> I'm stepping out for a couple of hours but will be back to see how >>>>>> things >>>>>> went. Sorry for the issues :-( >>>>>> >>>>>> Andy >>>>>> >>>>>> On Sun, 13 Dec 2009, wen guan wrote: >>>>>> >>>>>>> Hi Andrew, >>>>>>> >>>>>>> I prefer a source replacement. Then I can compile it. >>>>>>> >>>>>>> Thanks >>>>>>> Wen >>>>>>>> >>>>>>>> I can do one of two things here: >>>>>>>> >>>>>>>> 1) Supply a source replacement and then you would recompile, or >>>>>>>> >>>>>>>> 2) Give me the uname -a of where the cmsd will run and I'll supply a >>>>>>>> binary >>>>>>>> replacement for you. >>>>>>>> >>>>>>>> Your choice. >>>>>>>> >>>>>>>> Andy >>>>>>>> >>>>>>>> On Sun, 13 Dec 2009, wen guan wrote: >>>>>>>> >>>>>>>>> Hi Andrew >>>>>>>>> >>>>>>>>> The problem is found. Great. Thanks. >>>>>>>>> >>>>>>>>> Where can I find the patched cmsd? >>>>>>>>> >>>>>>>>> Wen >>>>>>>>> >>>>>>>>> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky >>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>> >>>>>>>>>> Hi Wen, >>>>>>>>>> >>>>>>>>>> I found the problem. Looks like a regression from way back when. >>>>>>>>>> There >>>>>>>>>> is >>>>>>>>>> a >>>>>>>>>> missing flag on the redirect. This will require a patched cmsd but >>>>>>>>>> you >>>>>>>>>> need >>>>>>>>>> only to replace the redirector's cmsd as this only affects the >>>>>>>>>> redirector. >>>>>>>>>> How would you like to proceed? >>>>>>>>>> >>>>>>>>>> Andy >>>>>>>>>> >>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>> >>>>>>>>>>> Hi Andrew, >>>>>>>>>>> >>>>>>>>>>> It doesn't work. atlas-bkp1 manager still dropping nodes again. >>>>>>>>>>> In supervisor, I still haven't seen any dataserver registered. I >>>>>>>>>>> said >>>>>>>>>>> "I updated the ntp" because you said "the log timestamp do not >>>>>>>>>>> overlap". >>>>>>>>>>> >>>>>>>>>>> Wen >>>>>>>>>>> >>>>>>>>>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky >>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Wen, >>>>>>>>>>>> >>>>>>>>>>>> Do you mean that everything is now working? It could be that you >>>>>>>>>>>> removed >>>>>>>>>>>> the >>>>>>>>>>>> xrd.timeout directive. That really could cause problems. As for >>>>>>>>>>>> the >>>>>>>>>>>> delays, >>>>>>>>>>>> that is normal when the redirector thinks something is going >>>>>>>>>>>> wrong. >>>>>>>>>>>> The >>>>>>>>>>>> strategy is to delay clients until it can get back to a stable >>>>>>>>>>>> configuration. This usually prevents jobs from crashing during >>>>>>>>>>>> stressful >>>>>>>>>>>> periods. >>>>>>>>>>>> >>>>>>>>>>>> Andy >>>>>>>>>>>> >>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>> >>>>>>>>>>>>> I restarted it to do supervisor test. Also because xrootd >>>>>>>>>>>>> manager >>>>>>>>>>>>> frequently doesn't response. (*) is the cms.log, the file >>>>>>>>>>>>> select >>>>>>>>>>>>> is >>>>>>>>>>>>> delayed again and again. When do a restart, all things are >>>>>>>>>>>>> fine. >>>>>>>>>>>>> Now >>>>>>>>>>>>> I >>>>>>>>>>>>> am trying to find a clue about it. >>>>>>>>>>>>> >>>>>>>>>>>>> (*) >>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] >>>>>>>>>>>>> do_Select: >>>>>>>>>>>>> wc >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>> 091212 00:00:19 21318 Select seeking >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>> 091212 00:00:19 21318 UnkFile rc=1 >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] >>>>>>>>>>>>> do_Select: >>>>>>>>>>>>> delay 5 >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 2+-1 post=0 >>>>>>>>>>>>> 091212 00:00:19 21318 Dispatch redirector.21313:14@atlas-bkp2 >>>>>>>>>>>>> for >>>>>>>>>>>>> select dlen=166 >>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1 post=0 >>>>>>>>>>>>> 091212 00:00:19 21318 XrdSched: running redirector inq=0 >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> There is no core file. I copied a new copies of the logs to the >>>>>>>>>>>>> link >>>>>>>>>>>>> below. >>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>>>>>>>>> >>>>>>>>>>>>> Wen >>>>>>>>>>>>> >>>>>>>>>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky >>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I see in the server log that it is restarting often. Could you >>>>>>>>>>>>>> take >>>>>>>>>>>>>> a >>>>>>>>>>>>>> look >>>>>>>>>>>>>> in the c193 to see if you have any core files? Also please >>>>>>>>>>>>>> make >>>>>>>>>>>>>> sure >>>>>>>>>>>>>> that >>>>>>>>>>>>>> core files are enabled as Linux defaults the size to 0. The >>>>>>>>>>>>>> first >>>>>>>>>>>>>> step >>>>>>>>>>>>>> here >>>>>>>>>>>>>> is to find out why your servers are restarting. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Andy >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> the logs can be found here. From the log you can see >>>>>>>>>>>>>>> atlas-bkp1 >>>>>>>>>>>>>>> manager are dropping nodes again and again which tries to >>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>> to >>>>>>>>>>>>>>> it. >>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky >>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Wen, Could you start everything up and provide me a >>>>>>>>>>>>>>>> pointer >>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> manager log file, supervisor log file, and one data server >>>>>>>>>>>>>>>> logfile >>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>> which cover the same time-frame (from start to some point >>>>>>>>>>>>>>>> where >>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>> think >>>>>>>>>>>>>>>> things are working or not). That way I can see what is >>>>>>>>>>>>>>>> happening. >>>>>>>>>>>>>>>> At >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> moment I only see two "bad" things in the config file: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a manager >>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>> claim, >>>>>>>>>>>>>>>> via >>>>>>>>>>>>>>>> the all.manager directive, that there are three (bkp2 and >>>>>>>>>>>>>>>> bkp3). >>>>>>>>>>>>>>>> While >>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>> should work, the log file will be dense with error messages. >>>>>>>>>>>>>>>> Please >>>>>>>>>>>>>>>> correct >>>>>>>>>>>>>>>> this to be consistent and make it easier to see real errors. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This is not a problem for me. Because this config is used in >>>>>>>>>>>>>>> dataserver. In manager, I updated the if >>>>>>>>>>>>>>> atlas-bkp1.cs.wisc.edu >>>>>>>>>>>>>>> to >>>>>>>>>>>>>>> atlas-bkp2 or something. This is a history problem. at first >>>>>>>>>>>>>>> only >>>>>>>>>>>>>>> atlas-bkp1 is used. atlas-bkp2 and atlas-bkp3 are added >>>>>>>>>>>>>>> later. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2) Please use cms.space not olb.space (for historical >>>>>>>>>>>>>>>> reasons >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> latter >>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>> still accepted and over-rides the former, but that will soon >>>>>>>>>>>>>>>> end), >>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>> please use only one (the config file uses both directives). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> yes. I should remove this line. in fact cms.space is in the >>>>>>>>>>>>>>> cfg >>>>>>>>>>>>>>> too. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The xrootd has an internal mechanism to connect servers with >>>>>>>>>>>>>>>> supervisors >>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>> allow for maximum reliability. You cannot change that >>>>>>>>>>>>>>>> algorithm >>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>> there >>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>> no need to do so. You should *never* tell anyone to directly >>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>> supervisor. If you do, you will likely get unreachable >>>>>>>>>>>>>>>> nodes. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> As for dropping data servers, it would appear to me, given >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> flurry >>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>> such activity, that something either crashed or was >>>>>>>>>>>>>>>> restarted. >>>>>>>>>>>>>>>> That's >>>>>>>>>>>>>>>> why >>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>> would be good to see the complete log of each one of the >>>>>>>>>>>>>>>> entities. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I read the document. and write a config >>>>>>>>>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg). >>>>>>>>>>>>>>>>> I used my conf, I can see manager is dispatch message to >>>>>>>>>>>>>>>>> supervisor. But I cannot see any dataserver tries to >>>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> supervisor. At the same time, in the manager's log, I can >>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>> some >>>>>>>>>>>>>>>>> dataserver are Dropped. >>>>>>>>>>>>>>>>> How does xrootd decide which dataserver will connect >>>>>>>>>>>>>>>>> supervisor? >>>>>>>>>>>>>>>>> should I specify some dataservers to connect the >>>>>>>>>>>>>>>>> supervisor? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> (*) supervisor log >>>>>>>>>>>>>>>>> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2 for >>>>>>>>>>>>>>>>> state >>>>>>>>>>>>>>>>> dlen=42 >>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State: >>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_StateFWD: >>>>>>>>>>>>>>>>> Path >>>>>>>>>>>>>>>>> find >>>>>>>>>>>>>>>>> failed for state /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> (*)manager log >>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu >>>>>>>>>>>>>>>>> TSpace=5587GB >>>>>>>>>>>>>>>>> NumFS=1 >>>>>>>>>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0 >>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding path: >>>>>>>>>>>>>>>>> w >>>>>>>>>>>>>>>>> /atlas >>>>>>>>>>>>>>>>> 091211 04:13:24 15661 >>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>> do_Space: 5696231MB free; 0% util >>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 logged in. >>>>>>>>>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection from >>>>>>>>>>>>>>>>> [log in to unmask] >>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: running >>>>>>>>>>>>>>>>> ?:[log in to unmask] >>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched protocol cmsd >>>>>>>>>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll: FD >>>>>>>>>>>>>>>>> 79 >>>>>>>>>>>>>>>>> attached >>>>>>>>>>>>>>>>> to poller 2; num=22 >>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add >>>>>>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>>>>>> bumps >>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 #63 >>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Drop_Node: >>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 dropped. >>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add Shoved >>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 to cluster; >>>>>>>>>>>>>>>>> id=63.78; >>>>>>>>>>>>>>>>> num=64; >>>>>>>>>>>>>>>>> min=51 >>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0 >>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu >>>>>>>>>>>>>>>>> TSpace=5587GB >>>>>>>>>>>>>>>>> NumFS=1 >>>>>>>>>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0 >>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding path: >>>>>>>>>>>>>>>>> w >>>>>>>>>>>>>>>>> /atlas >>>>>>>>>>>>>>>>> 091211 04:13:24 15661 >>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 >>>>>>>>>>>>>>>>> do_Space: 5721854MB free; 0% util >>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 logged in. >>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve from >>>>>>>>>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer >>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop node in 60 >>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Remove_Node >>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 node 63.78 >>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>> 091211 04:13:24 15661 server.21739:[log in to unmask] >>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>> 79 detached from poller 2; num=21 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 >>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 >>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service >>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>> c177.chtc.wisc.edu >>>>>>>>>>>>>>>>> FD=16 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 node 0.3 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>> server.21656:[log in to unmask] >>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.21656:[log in to unmask] >>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>> 16 detached from poller 2; num=20 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>> c179.chtc.wisc.edu >>>>>>>>>>>>>>>>> FD=21 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>> server.17065:[log in to unmask]:1094 node 1.4 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>> server.7978:[log in to unmask] >>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.7978:[log in to unmask] >>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>> 21 >>>>>>>>>>>>>>>>> detached from poller 1; num=21 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 State: Status changed to suspended >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Send status to >>>>>>>>>>>>>>>>> redirector.15656:14@atlas-bkp2 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 >>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 >>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service >>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>> c182.chtc.wisc.edu >>>>>>>>>>>>>>>>> FD=19 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 node 7.10 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>> server.26620:[log in to unmask] >>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.26620:[log in to unmask] >>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>> 19 detached from poller 2; num=19 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 >>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 >>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service >>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>> c178.chtc.wisc.edu >>>>>>>>>>>>>>>>> FD=15 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 node 9.12 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>> server.11901:[log in to unmask] >>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.11901:[log in to unmask] >>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>> 15 detached from poller 1; num=20 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 >>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 >>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service >>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>> c181.chtc.wisc.edu >>>>>>>>>>>>>>>>> FD=17 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 node 5.8 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>> server.13984:[log in to unmask] >>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.13984:[log in to unmask] >>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>> 17 detached from poller 0; num=21 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 >>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 >>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service >>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>> c183.chtc.wisc.edu >>>>>>>>>>>>>>>>> FD=22 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 node 8.11 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>> server.27735:[log in to unmask] >>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.27735:[log in to unmask] >>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>> 22 detached from poller 2; num=18 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>> c184.chtc.wisc.edu >>>>>>>>>>>>>>>>> FD=20 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>> server.4131:[log in to unmask]:1094 node 3.6 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>> server.26787:[log in to unmask] >>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.26787:[log in to unmask] >>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>> 20 detached from poller 0; num=20 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service >>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>> c185.chtc.wisc.edu >>>>>>>>>>>>>>>>> FD=23 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 node 6.9 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>> server.8524:[log in to unmask] >>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.8524:[log in to unmask] >>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>> 23 >>>>>>>>>>>>>>>>> detached from poller 0; num=19 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 >>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 >>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service >>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>> c180.chtc.wisc.edu >>>>>>>>>>>>>>>>> FD=18 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 node 4.7 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>> server.14636:[log in to unmask] >>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.14636:[log in to unmask] >>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>> 18 detached from poller 1; num=19 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 >>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 >>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service >>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>> c186.chtc.wisc.edu >>>>>>>>>>>>>>>>> FD=24 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 node 2.5 >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>> server.7849:[log in to unmask] >>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.7849:[log in to unmask] >>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>> 24 >>>>>>>>>>>>>>>>> detached from poller 1; num=18 >>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled. >>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled. >>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled. >>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled. >>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled. >>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled. >>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled. >>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled. >>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled. >>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled. >>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled. >>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers >>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled. >>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky >>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> To go past 64 data servers you will need to setup one or >>>>>>>>>>>>>>>>>> more >>>>>>>>>>>>>>>>>> supervisors. >>>>>>>>>>>>>>>>>> This does not logically change the current configuration >>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>> have. >>>>>>>>>>>>>>>>>> You >>>>>>>>>>>>>>>>>> only >>>>>>>>>>>>>>>>>> need to configure one or more *new* servers (or at least >>>>>>>>>>>>>>>>>> xrootd >>>>>>>>>>>>>>>>>> processes) >>>>>>>>>>>>>>>>>> whose role is supervisor. We'd like them to run in >>>>>>>>>>>>>>>>>> separate >>>>>>>>>>>>>>>>>> machines >>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>> reliability purposes, but they could run on the manager >>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>> long >>>>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>> give each one a unique instance name (i.e., -n option). >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The front part of the cmsd reference explains how to do >>>>>>>>>>>>>>>>>> this. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Is there any change to configure xrootd with more than 65 >>>>>>>>>>>>>>>>>>> machines? I used the configure below but it doesn't work. >>>>>>>>>>>>>>>>>>> Should >>>>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>>> configure some machines' manager to be supvervisor? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> >>> >>> >>> >> >> >> > > >