Hi Andy, > OK, I understand. As for stalling, too many nodes were deemed to be in > trouble for the manager to allow service resumption. > > Please make sure that all of the nodes in the cluster receive the new cmsd > as they will drop off with the old one and you'll see the same kind of > activity. Perhaps the best way to know that you suceeded in putting > everything in sync is to start with 63 data nodes plus one supervisor. Once > all connections are established; adding an additional server should simply > send it to the supervisor. I will do it. you said start 63 data server and one supervisor. Does it mean the supervisor is managed using the same policy? If I there are 64 dataservers which are connected before the supervisor, will the supervisor be dropped? Is the supervisor has high priority to be added to the manager? I mean, if there are already 64 dataservers and a supervisor comes in, will the supervisor be accepted and a datasever be redirected to the supervisor? Thanks Wen > > Hi Andrew, > > But when I tried to xrdcp a file to it, it doesn't response. In > atlas-bkp1-xrd.log.20091213, it always prints "stalling client for 10 > sec". But in cms.log, I can find any message about the file. > >> I don't see why you say it doesn't work. With the debugging level set so >> high the noise may make it look like something is going wrong but that >> isn't >> necessarily the case. >> >> 1) The 'too many subscribers' is correct. The manager was simply >> redirecting >> them because there were already 64 servers. However, in your case the >> supervisor wasn't started until almost 30 minutes after everyone else >> (i.e., >> 10:42 AM). Why was that? I'm not suprised about the flurry of messages >> with >> a critical component missing for 30 minutes. > > Because the manager is 64bit machine but supervisor is 32 bit machine. > Then I have to recompile the it. At that time, I was interrupted by > something else. > > >> 2) Once the supervisor started, it started accepting the redirected >> servers. >> >> 3) Then 10 seconds (10:42:10) later the supervisor was restarted. So, that >> would cause a flurry of activity to occur as there is no backup supervisor >> to take over. >> >> 4) This happened again at 10:42:34 AM then again at 10:48:49. Is the >> supervisor crashing? Is there a core file? >> >> 5) At 11:11 AM the manager restarted. Again, is there a core file here or >> was this a manual action? >> >> During the course of all of this. All nodes connected were operating >> propely >> and files were being located. >> >> So, the two big questions are: >> >> a) Why was the supervisor not started until 30 minutes after the system >> was >> started? >> >> b) Is there an explanation of the restarts? If this was a crash then we >> need >> a core file to figure out what happened. > > It's not a crash. There are some reasons that I restarted some daemons. > (1)I thought if a dataserver tried many times to connect to a > redirector but failed, the dataserver would not try to connect a > redirector again. The supervisor was missing for long time. So maybe > some dataservers would not try to connect to atlas-bkp1 again. To > reactive these dataservers, I restarted any servers. > (2)When I tried to xrdcp, it was hanging for long time. I thought > maybe manager was affected by some others things. then I restarte > manager to see whether a restart can make this xrdcp work. > > > Thanks > Wen > >> Andy >> >> ----- Original Message ----- From: "wen guan" <[log in to unmask]> >> To: "Andrew Hanushevsky" <[log in to unmask]> >> Cc: <[log in to unmask]> >> Sent: Wednesday, December 16, 2009 9:38 AM >> Subject: Re: xrootd with more than 65 machines >> >> >> Hi Andrew, >> >> It still doesn't work. >> The log file is in higgs03.cs.wisc.edu/wguan/. The name is *.20091216 >> The manager complains there are too many subscribers and the removes >> nodes. >> >> (*) >> Add server.10040:[log in to unmask] redirected; too many subscribers. >> >> Wen >> >> On Wed, Dec 16, 2009 at 4:25 AM, Andrew Hanushevsky <[log in to unmask]> >> wrote: >>> >>> Hi Wen, >>> >>> It will be easier for me to retroft as the changes were pretty minor. >>> Please >>> lift the new XrdCmsNode.cc file from >>> >>> http://www.slac.stanford.edu/~abh/cmsd >>> >>> Andy >>> >>> ----- Original Message ----- From: "wen guan" <[log in to unmask]> >>> To: "Andrew Hanushevsky" <[log in to unmask]> >>> Cc: <[log in to unmask]> >>> Sent: Tuesday, December 15, 2009 5:12 PM >>> Subject: Re: xrootd with more than 65 machines >>> >>> >>> Hi Andy, >>> >>> I can switch to 20091104-1102. Then you don't need to patch >>> another version. How can I download v20091104-1102? >>> >>> Thanks >>> Wen >>> >>> On Wed, Dec 16, 2009 at 12:52 AM, Andrew Hanushevsky <[log in to unmask]> >>> wrote: >>>> >>>> Hi Wen, >>>> >>>> Ah yes, I see that now. The file I gave you is based on v20091104-1102. >>>> Let >>>> me see if I can retrofit the patch for you. >>>> >>>> Andy >>>> >>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]> >>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>> Cc: <[log in to unmask]> >>>> Sent: Tuesday, December 15, 2009 1:04 PM >>>> Subject: Re: xrootd with more than 65 machines >>>> >>>> >>>> Hi Andy, >>>> >>>> Which xrootd version are you using? XrdCmsConfig.hh is different. >>>> XrdCmsConfig.hh is downloaded from >>>> http://xrootd.slac.stanford.edu/download/20091028-1003/. >>>> >>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsNode.cc >>>> 6fb3ae40fe4e10bdd4d372818a341f2c src/XrdCms/XrdCmsNode.cc >>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsConfig.hh >>>> 7d57753847d9448186c718f98e963cbe src/XrdCms/XrdCmsConfig.hh >>>> >>>> Thanks >>>> Wen >>>> >>>> On Tue, Dec 15, 2009 at 10:50 PM, Andrew Hanushevsky <[log in to unmask]> >>>> wrote: >>>>> >>>>> Hi Wen, >>>>> >>>>> Just compiled on Linux and it was clean. Something is really wrong with >>>>> your >>>>> source files, specifically XrdCmsConfig.cc >>>>> >>>>> The MD5 checksums on the relevant files are: >>>>> >>>>> MD5 (XrdCmsNode.cc) = 6fb3ae40fe4e10bdd4d372818a341f2c >>>>> >>>>> MD5 (XrdCmsConfig.hh) = 4a7d655582a7cd43b098947d0676924b >>>>> >>>>> Andy >>>>> >>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]> >>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>> Cc: <[log in to unmask]> >>>>> Sent: Tuesday, December 15, 2009 4:24 AM >>>>> Subject: Re: xrootd with more than 65 machines >>>>> >>>>> >>>>> Hi Andy, >>>>> >>>>> No problem. Thanks for the fix. But it cannot be compiled. The >>>>> version I am using is >>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/. >>>>> >>>>> Making cms component... >>>>> Compiling XrdCmsNode.cc >>>>> XrdCmsNode.cc: In member function `const char* >>>>> XrdCmsNode::do_Chmod(XrdCmsRRData&)': >>>>> XrdCmsNode.cc:268: error: `fsExec' was not declared in this scope >>>>> XrdCmsNode.cc:268: warning: unused variable 'fsExec' >>>>> XrdCmsNode.cc:269: error: 'class XrdCmsConfig' has no member named >>>>> 'ossFS' >>>>> XrdCmsNode.cc:273: error: `fsFail' was not declared in this scope >>>>> XrdCmsNode.cc:273: warning: unused variable 'fsFail' >>>>> XrdCmsNode.cc: In member function `const char* >>>>> XrdCmsNode::do_Mkdir(XrdCmsRRData&)': >>>>> XrdCmsNode.cc:600: error: `fsExec' was not declared in this scope >>>>> XrdCmsNode.cc:600: warning: unused variable 'fsExec' >>>>> XrdCmsNode.cc:601: error: 'class XrdCmsConfig' has no member named >>>>> 'ossFS' >>>>> XrdCmsNode.cc:605: error: `fsFail' was not declared in this scope >>>>> XrdCmsNode.cc:605: warning: unused variable 'fsFail' >>>>> XrdCmsNode.cc: In member function `const char* >>>>> XrdCmsNode::do_Mkpath(XrdCmsRRData&)': >>>>> XrdCmsNode.cc:640: error: `fsExec' was not declared in this scope >>>>> XrdCmsNode.cc:640: warning: unused variable 'fsExec' >>>>> XrdCmsNode.cc:641: error: 'class XrdCmsConfig' has no member named >>>>> 'ossFS' >>>>> XrdCmsNode.cc:645: error: `fsFail' was not declared in this scope >>>>> XrdCmsNode.cc:645: warning: unused variable 'fsFail' >>>>> XrdCmsNode.cc: In member function `const char* >>>>> XrdCmsNode::do_Mv(XrdCmsRRData&)': >>>>> XrdCmsNode.cc:704: error: `fsExec' was not declared in this scope >>>>> XrdCmsNode.cc:704: warning: unused variable 'fsExec' >>>>> XrdCmsNode.cc:705: error: 'class XrdCmsConfig' has no member named >>>>> 'ossFS' >>>>> XrdCmsNode.cc:709: error: `fsFail' was not declared in this scope >>>>> XrdCmsNode.cc:709: warning: unused variable 'fsFail' >>>>> XrdCmsNode.cc: In member function `const char* >>>>> XrdCmsNode::do_Rm(XrdCmsRRData&)': >>>>> XrdCmsNode.cc:831: error: `fsExec' was not declared in this scope >>>>> XrdCmsNode.cc:831: warning: unused variable 'fsExec' >>>>> XrdCmsNode.cc:832: error: 'class XrdCmsConfig' has no member named >>>>> 'ossFS' >>>>> XrdCmsNode.cc:836: error: `fsFail' was not declared in this scope >>>>> XrdCmsNode.cc:836: warning: unused variable 'fsFail' >>>>> XrdCmsNode.cc: In member function `const char* >>>>> XrdCmsNode::do_Rmdir(XrdCmsRRData&)': >>>>> XrdCmsNode.cc:873: error: `fsExec' was not declared in this scope >>>>> XrdCmsNode.cc:873: warning: unused variable 'fsExec' >>>>> XrdCmsNode.cc:874: error: 'class XrdCmsConfig' has no member named >>>>> 'ossFS' >>>>> XrdCmsNode.cc:878: error: `fsFail' was not declared in this scope >>>>> XrdCmsNode.cc:878: warning: unused variable 'fsFail' >>>>> XrdCmsNode.cc: In member function `const char* >>>>> XrdCmsNode::do_Trunc(XrdCmsRRData&)': >>>>> XrdCmsNode.cc:1377: error: `fsExec' was not declared in this scope >>>>> XrdCmsNode.cc:1377: warning: unused variable 'fsExec' >>>>> XrdCmsNode.cc:1378: error: 'class XrdCmsConfig' has no member named >>>>> 'ossFS' >>>>> XrdCmsNode.cc:1382: error: `fsFail' was not declared in this scope >>>>> XrdCmsNode.cc:1382: warning: unused variable 'fsFail' >>>>> XrdCmsNode.cc: At global scope: >>>>> XrdCmsNode.cc:1524: error: no `int XrdCmsNode::fsExec(XrdOucProg*, >>>>> char*, char*)' member function declared in class `XrdCmsNode' >>>>> XrdCmsNode.cc: In member function `int XrdCmsNode::fsExec(XrdOucProg*, >>>>> char*, char*)': >>>>> XrdCmsNode.cc:1533: error: `fsL2PFail1' was not declared in this scope >>>>> XrdCmsNode.cc:1533: warning: unused variable 'fsL2PFail1' >>>>> XrdCmsNode.cc:1537: error: `fsL2PFail2' was not declared in this scope >>>>> XrdCmsNode.cc:1537: warning: unused variable 'fsL2PFail2' >>>>> XrdCmsNode.cc: At global scope: >>>>> XrdCmsNode.cc:1553: error: no `const char* XrdCmsNode::fsFail(const >>>>> char*, const char*, const char*, int)' member function declared in >>>>> class `XrdCmsNode' >>>>> XrdCmsNode.cc: In member function `const char* >>>>> XrdCmsNode::fsFail(const char*, const char*, const char*, int)': >>>>> XrdCmsNode.cc:1559: error: `fsL2PFail1' was not declared in this scope >>>>> XrdCmsNode.cc:1559: warning: unused variable 'fsL2PFail1' >>>>> XrdCmsNode.cc:1560: error: `fsL2PFail2' was not declared in this scope >>>>> XrdCmsNode.cc:1560: warning: unused variable 'fsL2PFail2' >>>>> XrdCmsNode.cc: In static member function `static int >>>>> XrdCmsNode::isOnline(char*, int)': >>>>> XrdCmsNode.cc:1608: error: 'class XrdCmsConfig' has no member named >>>>> 'ossFS' >>>>> make[4]: *** [../../obj/XrdCmsNode.o] Error 1 >>>>> make[3]: *** [Linuxall] Error 2 >>>>> make[2]: *** [all] Error 2 >>>>> make[1]: *** [XrdCms] Error 2 >>>>> make: *** [all] Error 2 >>>>> >>>>> >>>>> Wen >>>>> >>>>> On Tue, Dec 15, 2009 at 2:08 AM, Andrew Hanushevsky <[log in to unmask]> >>>>> wrote: >>>>>> >>>>>> Hi Wen, >>>>>> >>>>>> I have developed a permanent fix. You will find the source files in >>>>>> >>>>>> http://www.slac.stanford.edu/~abh/cmsd/ >>>>>> >>>>>> There are three files: XrdCmsCluster.cc XrdCmsNode.cc >>>>>> XrdCmsProtocol.cc >>>>>> >>>>>> Please do a source replacement and recompile. Unfortunately, the cmsd >>>>>> will >>>>>> need to be replaced on each node regardless of role. My apologies for >>>>>> the >>>>>> disruption. Please let me know how it goes. >>>>>> >>>>>> Andy >>>>>> >>>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]> >>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>> Cc: <[log in to unmask]> >>>>>> Sent: Sunday, December 13, 2009 7:04 AM >>>>>> Subject: Re: xrootd with more than 65 machines >>>>>> >>>>>> >>>>>> Hi Andrew, >>>>>> >>>>>> >>>>>> Thanks. >>>>>> I used the new cmsd at atlas-bkp1 manager. But it's still dropping >>>>>> nodes. And in supervisor's log, I cannot find any dataserver to >>>>>> register to it. >>>>>> >>>>>> The new logs are in http://higgs03.cs.wisc.edu/wguan/*.20091213. >>>>>> The manager is patched at 091213 08:38:15. >>>>>> >>>>>> Wen >>>>>> >>>>>> On Sun, Dec 13, 2009 at 1:52 AM, Andrew Hanushevsky >>>>>> <[log in to unmask]> wrote: >>>>>>> >>>>>>> Hi Wen >>>>>>> >>>>>>> You will find the source replacement at: >>>>>>> >>>>>>> http://www.slac.stanford.edu/~abh/cmsd/ >>>>>>> >>>>>>> It's XrdCmsCluster.cc and it replaces >>>>>>> xrootd/src/XrdCms/XrdCmsCluster.cc >>>>>>> >>>>>>> I'm stepping out for a couple of hours but will be back to see how >>>>>>> things >>>>>>> went. Sorry for the issues :-( >>>>>>> >>>>>>> Andy >>>>>>> >>>>>>> On Sun, 13 Dec 2009, wen guan wrote: >>>>>>> >>>>>>>> Hi Andrew, >>>>>>>> >>>>>>>> I prefer a source replacement. Then I can compile it. >>>>>>>> >>>>>>>> Thanks >>>>>>>> Wen >>>>>>>>> >>>>>>>>> I can do one of two things here: >>>>>>>>> >>>>>>>>> 1) Supply a source replacement and then you would recompile, or >>>>>>>>> >>>>>>>>> 2) Give me the uname -a of where the cmsd will run and I'll supply >>>>>>>>> a >>>>>>>>> binary >>>>>>>>> replacement for you. >>>>>>>>> >>>>>>>>> Your choice. >>>>>>>>> >>>>>>>>> Andy >>>>>>>>> >>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote: >>>>>>>>> >>>>>>>>>> Hi Andrew >>>>>>>>>> >>>>>>>>>> The problem is found. Great. Thanks. >>>>>>>>>> >>>>>>>>>> Where can I find the patched cmsd? >>>>>>>>>> >>>>>>>>>> Wen >>>>>>>>>> >>>>>>>>>> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky >>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Wen, >>>>>>>>>>> >>>>>>>>>>> I found the problem. Looks like a regression from way back when. >>>>>>>>>>> There >>>>>>>>>>> is >>>>>>>>>>> a >>>>>>>>>>> missing flag on the redirect. This will require a patched cmsd >>>>>>>>>>> but >>>>>>>>>>> you >>>>>>>>>>> need >>>>>>>>>>> only to replace the redirector's cmsd as this only affects the >>>>>>>>>>> redirector. >>>>>>>>>>> How would you like to proceed? >>>>>>>>>>> >>>>>>>>>>> Andy >>>>>>>>>>> >>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>> >>>>>>>>>>>> It doesn't work. atlas-bkp1 manager still dropping nodes again. >>>>>>>>>>>> In supervisor, I still haven't seen any dataserver registered. I >>>>>>>>>>>> said >>>>>>>>>>>> "I updated the ntp" because you said "the log timestamp do not >>>>>>>>>>>> overlap". >>>>>>>>>>>> >>>>>>>>>>>> Wen >>>>>>>>>>>> >>>>>>>>>>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky >>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>> >>>>>>>>>>>>> Do you mean that everything is now working? It could be that >>>>>>>>>>>>> you >>>>>>>>>>>>> removed >>>>>>>>>>>>> the >>>>>>>>>>>>> xrd.timeout directive. That really could cause problems. As for >>>>>>>>>>>>> the >>>>>>>>>>>>> delays, >>>>>>>>>>>>> that is normal when the redirector thinks something is going >>>>>>>>>>>>> wrong. >>>>>>>>>>>>> The >>>>>>>>>>>>> strategy is to delay clients until it can get back to a stable >>>>>>>>>>>>> configuration. This usually prevents jobs from crashing during >>>>>>>>>>>>> stressful >>>>>>>>>>>>> periods. >>>>>>>>>>>>> >>>>>>>>>>>>> Andy >>>>>>>>>>>>> >>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I restarted it to do supervisor test. Also because xrootd >>>>>>>>>>>>>> manager >>>>>>>>>>>>>> frequently doesn't response. (*) is the cms.log, the file >>>>>>>>>>>>>> select >>>>>>>>>>>>>> is >>>>>>>>>>>>>> delayed again and again. When do a restart, all things are >>>>>>>>>>>>>> fine. >>>>>>>>>>>>>> Now >>>>>>>>>>>>>> I >>>>>>>>>>>>>> am trying to find a clue about it. >>>>>>>>>>>>>> >>>>>>>>>>>>>> (*) >>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] >>>>>>>>>>>>>> do_Select: >>>>>>>>>>>>>> wc >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>> 091212 00:00:19 21318 Select seeking >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>> 091212 00:00:19 21318 UnkFile rc=1 >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] >>>>>>>>>>>>>> do_Select: >>>>>>>>>>>>>> delay 5 >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 2+-1 post=0 >>>>>>>>>>>>>> 091212 00:00:19 21318 Dispatch redirector.21313:14@atlas-bkp2 >>>>>>>>>>>>>> for >>>>>>>>>>>>>> select dlen=166 >>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1 post=0 >>>>>>>>>>>>>> 091212 00:00:19 21318 XrdSched: running redirector inq=0 >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> There is no core file. I copied a new copies of the logs to >>>>>>>>>>>>>> the >>>>>>>>>>>>>> link >>>>>>>>>>>>>> below. >>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>>>>>>>>>> >>>>>>>>>>>>>> Wen >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky >>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I see in the server log that it is restarting often. Could >>>>>>>>>>>>>>> you >>>>>>>>>>>>>>> take >>>>>>>>>>>>>>> a >>>>>>>>>>>>>>> look >>>>>>>>>>>>>>> in the c193 to see if you have any core files? Also please >>>>>>>>>>>>>>> make >>>>>>>>>>>>>>> sure >>>>>>>>>>>>>>> that >>>>>>>>>>>>>>> core files are enabled as Linux defaults the size to 0. The >>>>>>>>>>>>>>> first >>>>>>>>>>>>>>> step >>>>>>>>>>>>>>> here >>>>>>>>>>>>>>> is to find out why your servers are restarting. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> the logs can be found here. From the log you can see >>>>>>>>>>>>>>>> atlas-bkp1 >>>>>>>>>>>>>>>> manager are dropping nodes again and again which tries to >>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>> it. >>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky >>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Wen, Could you start everything up and provide me a >>>>>>>>>>>>>>>>> pointer >>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> manager log file, supervisor log file, and one data server >>>>>>>>>>>>>>>>> logfile >>>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>> which cover the same time-frame (from start to some point >>>>>>>>>>>>>>>>> where >>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>> think >>>>>>>>>>>>>>>>> things are working or not). That way I can see what is >>>>>>>>>>>>>>>>> happening. >>>>>>>>>>>>>>>>> At >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> moment I only see two "bad" things in the config file: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a manager >>>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>> claim, >>>>>>>>>>>>>>>>> via >>>>>>>>>>>>>>>>> the all.manager directive, that there are three (bkp2 and >>>>>>>>>>>>>>>>> bkp3). >>>>>>>>>>>>>>>>> While >>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>> should work, the log file will be dense with error >>>>>>>>>>>>>>>>> messages. >>>>>>>>>>>>>>>>> Please >>>>>>>>>>>>>>>>> correct >>>>>>>>>>>>>>>>> this to be consistent and make it easier to see real >>>>>>>>>>>>>>>>> errors. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This is not a problem for me. Because this config is used in >>>>>>>>>>>>>>>> dataserver. In manager, I updated the if >>>>>>>>>>>>>>>> atlas-bkp1.cs.wisc.edu >>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>> atlas-bkp2 or something. This is a history problem. at first >>>>>>>>>>>>>>>> only >>>>>>>>>>>>>>>> atlas-bkp1 is used. atlas-bkp2 and atlas-bkp3 are added >>>>>>>>>>>>>>>> later. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 2) Please use cms.space not olb.space (for historical >>>>>>>>>>>>>>>>> reasons >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> latter >>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>> still accepted and over-rides the former, but that will >>>>>>>>>>>>>>>>> soon >>>>>>>>>>>>>>>>> end), >>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>> please use only one (the config file uses both directives). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> yes. I should remove this line. in fact cms.space is in the >>>>>>>>>>>>>>>> cfg >>>>>>>>>>>>>>>> too. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The xrootd has an internal mechanism to connect servers >>>>>>>>>>>>>>>>> with >>>>>>>>>>>>>>>>> supervisors >>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>> allow for maximum reliability. You cannot change that >>>>>>>>>>>>>>>>> algorithm >>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>> there >>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>> no need to do so. You should *never* tell anyone to >>>>>>>>>>>>>>>>> directly >>>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>> supervisor. If you do, you will likely get unreachable >>>>>>>>>>>>>>>>> nodes. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> As for dropping data servers, it would appear to me, given >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> flurry >>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>> such activity, that something either crashed or was >>>>>>>>>>>>>>>>> restarted. >>>>>>>>>>>>>>>>> That's >>>>>>>>>>>>>>>>> why >>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>> would be good to see the complete log of each one of the >>>>>>>>>>>>>>>>> entities. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I read the document. and write a config >>>>>>>>>>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg). >>>>>>>>>>>>>>>>>> I used my conf, I can see manager is dispatch message to >>>>>>>>>>>>>>>>>> supervisor. But I cannot see any dataserver tries to >>>>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> supervisor. At the same time, in the manager's log, I can >>>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>> some >>>>>>>>>>>>>>>>>> dataserver are Dropped. >>>>>>>>>>>>>>>>>> How does xrootd decide which dataserver will connect >>>>>>>>>>>>>>>>>> supervisor? >>>>>>>>>>>>>>>>>> should I specify some dataservers to connect the >>>>>>>>>>>>>>>>>> supervisor? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> (*) supervisor log >>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2 for >>>>>>>>>>>>>>>>>> state >>>>>>>>>>>>>>>>>> dlen=42 >>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State: >>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_StateFWD: >>>>>>>>>>>>>>>>>> Path >>>>>>>>>>>>>>>>>> find >>>>>>>>>>>>>>>>>> failed for state /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> (*)manager log >>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu >>>>>>>>>>>>>>>>>> TSpace=5587GB >>>>>>>>>>>>>>>>>> NumFS=1 >>>>>>>>>>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0 >>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding >>>>>>>>>>>>>>>>>> path: >>>>>>>>>>>>>>>>>> w >>>>>>>>>>>>>>>>>> /atlas >>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 >>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>> do_Space: 5696231MB free; 0% util >>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 logged in. >>>>>>>>>>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection from >>>>>>>>>>>>>>>>>> [log in to unmask] >>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: running >>>>>>>>>>>>>>>>>> ?:[log in to unmask] >>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched protocol cmsd >>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll: FD >>>>>>>>>>>>>>>>>> 79 >>>>>>>>>>>>>>>>>> attached >>>>>>>>>>>>>>>>>> to poller 2; num=22 >>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add >>>>>>>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>>>>>>> bumps >>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 #63 >>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Drop_Node: >>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 dropped. >>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add Shoved >>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 to cluster; >>>>>>>>>>>>>>>>>> id=63.78; >>>>>>>>>>>>>>>>>> num=64; >>>>>>>>>>>>>>>>>> min=51 >>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0 >>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu >>>>>>>>>>>>>>>>>> TSpace=5587GB >>>>>>>>>>>>>>>>>> NumFS=1 >>>>>>>>>>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0 >>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding >>>>>>>>>>>>>>>>>> path: >>>>>>>>>>>>>>>>>> w >>>>>>>>>>>>>>>>>> /atlas >>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 >>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>> do_Space: 5721854MB free; 0% util >>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 logged in. >>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve from >>>>>>>>>>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer >>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop node in 60 >>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Remove_Node >>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 node 63.78 >>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 server.21739:[log in to unmask] >>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>> 79 detached from poller 2; num=21 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service >>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>> c177.chtc.wisc.edu >>>>>>>>>>>>>>>>>> FD=16 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 node 0.3 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>> server.21656:[log in to unmask] >>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.21656:[log in to unmask] >>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>> 16 detached from poller 2; num=20 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>> c179.chtc.wisc.edu >>>>>>>>>>>>>>>>>> FD=21 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>> server.17065:[log in to unmask]:1094 node 1.4 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>> server.7978:[log in to unmask] >>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.7978:[log in to unmask] >>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>> 21 >>>>>>>>>>>>>>>>>> detached from poller 1; num=21 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 State: Status changed to suspended >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Send status to >>>>>>>>>>>>>>>>>> redirector.15656:14@atlas-bkp2 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service >>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>> c182.chtc.wisc.edu >>>>>>>>>>>>>>>>>> FD=19 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 node 7.10 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>> server.26620:[log in to unmask] >>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.26620:[log in to unmask] >>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>> 19 detached from poller 2; num=19 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service >>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>> c178.chtc.wisc.edu >>>>>>>>>>>>>>>>>> FD=15 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 node 9.12 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>> server.11901:[log in to unmask] >>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.11901:[log in to unmask] >>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>> 15 detached from poller 1; num=20 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service >>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>> c181.chtc.wisc.edu >>>>>>>>>>>>>>>>>> FD=17 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 node 5.8 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>> server.13984:[log in to unmask] >>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.13984:[log in to unmask] >>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>> 17 detached from poller 0; num=21 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service >>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>> c183.chtc.wisc.edu >>>>>>>>>>>>>>>>>> FD=22 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 node 8.11 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>> server.27735:[log in to unmask] >>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.27735:[log in to unmask] >>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>> 22 detached from poller 2; num=18 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>> c184.chtc.wisc.edu >>>>>>>>>>>>>>>>>> FD=20 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>> server.4131:[log in to unmask]:1094 node 3.6 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>> server.26787:[log in to unmask] >>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.26787:[log in to unmask] >>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>> 20 detached from poller 0; num=20 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service >>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>> c185.chtc.wisc.edu >>>>>>>>>>>>>>>>>> FD=23 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 node 6.9 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>> server.8524:[log in to unmask] >>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.8524:[log in to unmask] >>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>> 23 >>>>>>>>>>>>>>>>>> detached from poller 0; num=19 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service >>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>> c180.chtc.wisc.edu >>>>>>>>>>>>>>>>>> FD=18 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 node 4.7 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>> server.14636:[log in to unmask] >>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.14636:[log in to unmask] >>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>> 18 detached from poller 1; num=19 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service >>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>> c186.chtc.wisc.edu >>>>>>>>>>>>>>>>>> FD=24 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 node 2.5 >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>> server.7849:[log in to unmask] >>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.7849:[log in to unmask] >>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>> 24 >>>>>>>>>>>>>>>>>> detached from poller 1; num=18 >>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled. >>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled. >>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled. >>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled. >>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled. >>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled. >>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled. >>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled. >>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled. >>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled. >>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled. >>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers >>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled. >>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky >>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> To go past 64 data servers you will need to setup one or >>>>>>>>>>>>>>>>>>> more >>>>>>>>>>>>>>>>>>> supervisors. >>>>>>>>>>>>>>>>>>> This does not logically change the current configuration >>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>> have. >>>>>>>>>>>>>>>>>>> You >>>>>>>>>>>>>>>>>>> only >>>>>>>>>>>>>>>>>>> need to configure one or more *new* servers (or at least >>>>>>>>>>>>>>>>>>> xrootd >>>>>>>>>>>>>>>>>>> processes) >>>>>>>>>>>>>>>>>>> whose role is supervisor. We'd like them to run in >>>>>>>>>>>>>>>>>>> separate >>>>>>>>>>>>>>>>>>> machines >>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>> reliability purposes, but they could run on the manager >>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>>> long >>>>>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>> give each one a unique instance name (i.e., -n option). >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> The front part of the cmsd reference explains how to do >>>>>>>>>>>>>>>>>>> this. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Is there any change to configure xrootd with more than >>>>>>>>>>>>>>>>>>>> 65 >>>>>>>>>>>>>>>>>>>> machines? I used the configure below but it doesn't >>>>>>>>>>>>>>>>>>>> work. >>>>>>>>>>>>>>>>>>>> Should >>>>>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>>>> configure some machines' manager to be supvervisor? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> >>> >>> >>> >> >> >> > > >