Hi Andy, Which xrootd version are you using? XrdCmsConfig.hh is different. XrdCmsConfig.hh is downloaded from http://xrootd.slac.stanford.edu/download/20091028-1003/. [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsNode.cc 6fb3ae40fe4e10bdd4d372818a341f2c src/XrdCms/XrdCmsNode.cc [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsConfig.hh 7d57753847d9448186c718f98e963cbe src/XrdCms/XrdCmsConfig.hh Thanks Wen On Tue, Dec 15, 2009 at 10:50 PM, Andrew Hanushevsky <[log in to unmask]> wrote: > Hi Wen, > > Just compiled on Linux and it was clean. Something is really wrong with your > source files, specifically XrdCmsConfig.cc > > The MD5 checksums on the relevant files are: > > MD5 (XrdCmsNode.cc) = 6fb3ae40fe4e10bdd4d372818a341f2c > > MD5 (XrdCmsConfig.hh) = 4a7d655582a7cd43b098947d0676924b > > Andy > > ----- Original Message ----- From: "wen guan" <[log in to unmask]> > To: "Andrew Hanushevsky" <[log in to unmask]> > Cc: <[log in to unmask]> > Sent: Tuesday, December 15, 2009 4:24 AM > Subject: Re: xrootd with more than 65 machines > > > Hi Andy, > > No problem. Thanks for the fix. But it cannot be compiled. The > version I am using is > http://xrootd.slac.stanford.edu/download/20091028-1003/. > > Making cms component... > Compiling XrdCmsNode.cc > XrdCmsNode.cc: In member function `const char* > XrdCmsNode::do_Chmod(XrdCmsRRData&)': > XrdCmsNode.cc:268: error: `fsExec' was not declared in this scope > XrdCmsNode.cc:268: warning: unused variable 'fsExec' > XrdCmsNode.cc:269: error: 'class XrdCmsConfig' has no member named 'ossFS' > XrdCmsNode.cc:273: error: `fsFail' was not declared in this scope > XrdCmsNode.cc:273: warning: unused variable 'fsFail' > XrdCmsNode.cc: In member function `const char* > XrdCmsNode::do_Mkdir(XrdCmsRRData&)': > XrdCmsNode.cc:600: error: `fsExec' was not declared in this scope > XrdCmsNode.cc:600: warning: unused variable 'fsExec' > XrdCmsNode.cc:601: error: 'class XrdCmsConfig' has no member named 'ossFS' > XrdCmsNode.cc:605: error: `fsFail' was not declared in this scope > XrdCmsNode.cc:605: warning: unused variable 'fsFail' > XrdCmsNode.cc: In member function `const char* > XrdCmsNode::do_Mkpath(XrdCmsRRData&)': > XrdCmsNode.cc:640: error: `fsExec' was not declared in this scope > XrdCmsNode.cc:640: warning: unused variable 'fsExec' > XrdCmsNode.cc:641: error: 'class XrdCmsConfig' has no member named 'ossFS' > XrdCmsNode.cc:645: error: `fsFail' was not declared in this scope > XrdCmsNode.cc:645: warning: unused variable 'fsFail' > XrdCmsNode.cc: In member function `const char* > XrdCmsNode::do_Mv(XrdCmsRRData&)': > XrdCmsNode.cc:704: error: `fsExec' was not declared in this scope > XrdCmsNode.cc:704: warning: unused variable 'fsExec' > XrdCmsNode.cc:705: error: 'class XrdCmsConfig' has no member named 'ossFS' > XrdCmsNode.cc:709: error: `fsFail' was not declared in this scope > XrdCmsNode.cc:709: warning: unused variable 'fsFail' > XrdCmsNode.cc: In member function `const char* > XrdCmsNode::do_Rm(XrdCmsRRData&)': > XrdCmsNode.cc:831: error: `fsExec' was not declared in this scope > XrdCmsNode.cc:831: warning: unused variable 'fsExec' > XrdCmsNode.cc:832: error: 'class XrdCmsConfig' has no member named 'ossFS' > XrdCmsNode.cc:836: error: `fsFail' was not declared in this scope > XrdCmsNode.cc:836: warning: unused variable 'fsFail' > XrdCmsNode.cc: In member function `const char* > XrdCmsNode::do_Rmdir(XrdCmsRRData&)': > XrdCmsNode.cc:873: error: `fsExec' was not declared in this scope > XrdCmsNode.cc:873: warning: unused variable 'fsExec' > XrdCmsNode.cc:874: error: 'class XrdCmsConfig' has no member named 'ossFS' > XrdCmsNode.cc:878: error: `fsFail' was not declared in this scope > XrdCmsNode.cc:878: warning: unused variable 'fsFail' > XrdCmsNode.cc: In member function `const char* > XrdCmsNode::do_Trunc(XrdCmsRRData&)': > XrdCmsNode.cc:1377: error: `fsExec' was not declared in this scope > XrdCmsNode.cc:1377: warning: unused variable 'fsExec' > XrdCmsNode.cc:1378: error: 'class XrdCmsConfig' has no member named 'ossFS' > XrdCmsNode.cc:1382: error: `fsFail' was not declared in this scope > XrdCmsNode.cc:1382: warning: unused variable 'fsFail' > XrdCmsNode.cc: At global scope: > XrdCmsNode.cc:1524: error: no `int XrdCmsNode::fsExec(XrdOucProg*, > char*, char*)' member function declared in class `XrdCmsNode' > XrdCmsNode.cc: In member function `int XrdCmsNode::fsExec(XrdOucProg*, > char*, char*)': > XrdCmsNode.cc:1533: error: `fsL2PFail1' was not declared in this scope > XrdCmsNode.cc:1533: warning: unused variable 'fsL2PFail1' > XrdCmsNode.cc:1537: error: `fsL2PFail2' was not declared in this scope > XrdCmsNode.cc:1537: warning: unused variable 'fsL2PFail2' > XrdCmsNode.cc: At global scope: > XrdCmsNode.cc:1553: error: no `const char* XrdCmsNode::fsFail(const > char*, const char*, const char*, int)' member function declared in > class `XrdCmsNode' > XrdCmsNode.cc: In member function `const char* > XrdCmsNode::fsFail(const char*, const char*, const char*, int)': > XrdCmsNode.cc:1559: error: `fsL2PFail1' was not declared in this scope > XrdCmsNode.cc:1559: warning: unused variable 'fsL2PFail1' > XrdCmsNode.cc:1560: error: `fsL2PFail2' was not declared in this scope > XrdCmsNode.cc:1560: warning: unused variable 'fsL2PFail2' > XrdCmsNode.cc: In static member function `static int > XrdCmsNode::isOnline(char*, int)': > XrdCmsNode.cc:1608: error: 'class XrdCmsConfig' has no member named 'ossFS' > make[4]: *** [../../obj/XrdCmsNode.o] Error 1 > make[3]: *** [Linuxall] Error 2 > make[2]: *** [all] Error 2 > make[1]: *** [XrdCms] Error 2 > make: *** [all] Error 2 > > > Wen > > On Tue, Dec 15, 2009 at 2:08 AM, Andrew Hanushevsky <[log in to unmask]> > wrote: >> >> Hi Wen, >> >> I have developed a permanent fix. You will find the source files in >> >> http://www.slac.stanford.edu/~abh/cmsd/ >> >> There are three files: XrdCmsCluster.cc XrdCmsNode.cc XrdCmsProtocol.cc >> >> Please do a source replacement and recompile. Unfortunately, the cmsd will >> need to be replaced on each node regardless of role. My apologies for the >> disruption. Please let me know how it goes. >> >> Andy >> >> ----- Original Message ----- From: "wen guan" <[log in to unmask]> >> To: "Andrew Hanushevsky" <[log in to unmask]> >> Cc: <[log in to unmask]> >> Sent: Sunday, December 13, 2009 7:04 AM >> Subject: Re: xrootd with more than 65 machines >> >> >> Hi Andrew, >> >> >> Thanks. >> I used the new cmsd at atlas-bkp1 manager. But it's still dropping >> nodes. And in supervisor's log, I cannot find any dataserver to >> register to it. >> >> The new logs are in http://higgs03.cs.wisc.edu/wguan/*.20091213. >> The manager is patched at 091213 08:38:15. >> >> Wen >> >> On Sun, Dec 13, 2009 at 1:52 AM, Andrew Hanushevsky >> <[log in to unmask]> wrote: >>> >>> Hi Wen >>> >>> You will find the source replacement at: >>> >>> http://www.slac.stanford.edu/~abh/cmsd/ >>> >>> It's XrdCmsCluster.cc and it replaces xrootd/src/XrdCms/XrdCmsCluster.cc >>> >>> I'm stepping out for a couple of hours but will be back to see how things >>> went. Sorry for the issues :-( >>> >>> Andy >>> >>> On Sun, 13 Dec 2009, wen guan wrote: >>> >>>> Hi Andrew, >>>> >>>> I prefer a source replacement. Then I can compile it. >>>> >>>> Thanks >>>> Wen >>>>> >>>>> I can do one of two things here: >>>>> >>>>> 1) Supply a source replacement and then you would recompile, or >>>>> >>>>> 2) Give me the uname -a of where the cmsd will run and I'll supply a >>>>> binary >>>>> replacement for you. >>>>> >>>>> Your choice. >>>>> >>>>> Andy >>>>> >>>>> On Sun, 13 Dec 2009, wen guan wrote: >>>>> >>>>>> Hi Andrew >>>>>> >>>>>> The problem is found. Great. Thanks. >>>>>> >>>>>> Where can I find the patched cmsd? >>>>>> >>>>>> Wen >>>>>> >>>>>> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky >>>>>> <[log in to unmask]> wrote: >>>>>>> >>>>>>> Hi Wen, >>>>>>> >>>>>>> I found the problem. Looks like a regression from way back when. >>>>>>> There >>>>>>> is >>>>>>> a >>>>>>> missing flag on the redirect. This will require a patched cmsd but >>>>>>> you >>>>>>> need >>>>>>> only to replace the redirector's cmsd as this only affects the >>>>>>> redirector. >>>>>>> How would you like to proceed? >>>>>>> >>>>>>> Andy >>>>>>> >>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>> >>>>>>>> Hi Andrew, >>>>>>>> >>>>>>>> It doesn't work. atlas-bkp1 manager still dropping nodes again. >>>>>>>> In supervisor, I still haven't seen any dataserver registered. I >>>>>>>> said >>>>>>>> "I updated the ntp" because you said "the log timestamp do not >>>>>>>> overlap". >>>>>>>> >>>>>>>> Wen >>>>>>>> >>>>>>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky >>>>>>>> <[log in to unmask]> wrote: >>>>>>>>> >>>>>>>>> Hi Wen, >>>>>>>>> >>>>>>>>> Do you mean that everything is now working? It could be that you >>>>>>>>> removed >>>>>>>>> the >>>>>>>>> xrd.timeout directive. That really could cause problems. As for the >>>>>>>>> delays, >>>>>>>>> that is normal when the redirector thinks something is going wrong. >>>>>>>>> The >>>>>>>>> strategy is to delay clients until it can get back to a stable >>>>>>>>> configuration. This usually prevents jobs from crashing during >>>>>>>>> stressful >>>>>>>>> periods. >>>>>>>>> >>>>>>>>> Andy >>>>>>>>> >>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>> >>>>>>>>>> Hi Andrew, >>>>>>>>>> >>>>>>>>>> I restarted it to do supervisor test. Also because xrootd manager >>>>>>>>>> frequently doesn't response. (*) is the cms.log, the file select >>>>>>>>>> is >>>>>>>>>> delayed again and again. When do a restart, all things are fine. >>>>>>>>>> Now >>>>>>>>>> I >>>>>>>>>> am trying to find a clue about it. >>>>>>>>>> >>>>>>>>>> (*) >>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] do_Select: >>>>>>>>>> wc >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>> 091212 00:00:19 21318 Select seeking >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>> 091212 00:00:19 21318 UnkFile rc=1 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] do_Select: >>>>>>>>>> delay 5 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 2+-1 post=0 >>>>>>>>>> 091212 00:00:19 21318 Dispatch redirector.21313:14@atlas-bkp2 for >>>>>>>>>> select dlen=166 >>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1 post=0 >>>>>>>>>> 091212 00:00:19 21318 XrdSched: running redirector inq=0 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> There is no core file. I copied a new copies of the logs to the >>>>>>>>>> link >>>>>>>>>> below. >>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>>>>>> >>>>>>>>>> Wen >>>>>>>>>> >>>>>>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky >>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Wen, >>>>>>>>>>> >>>>>>>>>>> I see in the server log that it is restarting often. Could you >>>>>>>>>>> take >>>>>>>>>>> a >>>>>>>>>>> look >>>>>>>>>>> in the c193 to see if you have any core files? Also please make >>>>>>>>>>> sure >>>>>>>>>>> that >>>>>>>>>>> core files are enabled as Linux defaults the size to 0. The first >>>>>>>>>>> step >>>>>>>>>>> here >>>>>>>>>>> is to find out why your servers are restarting. >>>>>>>>>>> >>>>>>>>>>> Andy >>>>>>>>>>> >>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>> >>>>>>>>>>>> the logs can be found here. From the log you can see atlas-bkp1 >>>>>>>>>>>> manager are dropping nodes again and again which tries to >>>>>>>>>>>> connect >>>>>>>>>>>> to >>>>>>>>>>>> it. >>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky >>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Wen, Could you start everything up and provide me a pointer >>>>>>>>>>>>> to >>>>>>>>>>>>> the >>>>>>>>>>>>> manager log file, supervisor log file, and one data server >>>>>>>>>>>>> logfile >>>>>>>>>>>>> all >>>>>>>>>>>>> of >>>>>>>>>>>>> which cover the same time-frame (from start to some point where >>>>>>>>>>>>> you >>>>>>>>>>>>> think >>>>>>>>>>>>> things are working or not). That way I can see what is >>>>>>>>>>>>> happening. >>>>>>>>>>>>> At >>>>>>>>>>>>> the >>>>>>>>>>>>> moment I only see two "bad" things in the config file: >>>>>>>>>>>>> >>>>>>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a manager but >>>>>>>>>>>>> you >>>>>>>>>>>>> claim, >>>>>>>>>>>>> via >>>>>>>>>>>>> the all.manager directive, that there are three (bkp2 and >>>>>>>>>>>>> bkp3). >>>>>>>>>>>>> While >>>>>>>>>>>>> it >>>>>>>>>>>>> should work, the log file will be dense with error messages. >>>>>>>>>>>>> Please >>>>>>>>>>>>> correct >>>>>>>>>>>>> this to be consistent and make it easier to see real errors. >>>>>>>>>>>> >>>>>>>>>>>> This is not a problem for me. Because this config is used in >>>>>>>>>>>> dataserver. In manager, I updated the if atlas-bkp1.cs.wisc.edu >>>>>>>>>>>> to >>>>>>>>>>>> atlas-bkp2 or something. This is a history problem. at first >>>>>>>>>>>> only >>>>>>>>>>>> atlas-bkp1 is used. atlas-bkp2 and atlas-bkp3 are added later. >>>>>>>>>>>> >>>>>>>>>>>>> 2) Please use cms.space not olb.space (for historical reasons >>>>>>>>>>>>> the >>>>>>>>>>>>> latter >>>>>>>>>>>>> is >>>>>>>>>>>>> still accepted and over-rides the former, but that will soon >>>>>>>>>>>>> end), >>>>>>>>>>>>> and >>>>>>>>>>>>> please use only one (the config file uses both directives). >>>>>>>>>>>> >>>>>>>>>>>> yes. I should remove this line. in fact cms.space is in the cfg >>>>>>>>>>>> too. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Wen >>>>>>>>>>>> >>>>>>>>>>>>> The xrootd has an internal mechanism to connect servers with >>>>>>>>>>>>> supervisors >>>>>>>>>>>>> to >>>>>>>>>>>>> allow for maximum reliability. You cannot change that algorithm >>>>>>>>>>>>> and >>>>>>>>>>>>> there >>>>>>>>>>>>> is >>>>>>>>>>>>> no need to do so. You should *never* tell anyone to directly >>>>>>>>>>>>> connect >>>>>>>>>>>>> to >>>>>>>>>>>>> a >>>>>>>>>>>>> supervisor. If you do, you will likely get unreachable nodes. >>>>>>>>>>>>> >>>>>>>>>>>>> As for dropping data servers, it would appear to me, given the >>>>>>>>>>>>> flurry >>>>>>>>>>>>> of >>>>>>>>>>>>> such activity, that something either crashed or was restarted. >>>>>>>>>>>>> That's >>>>>>>>>>>>> why >>>>>>>>>>>>> it >>>>>>>>>>>>> would be good to see the complete log of each one of the >>>>>>>>>>>>> entities. >>>>>>>>>>>>> >>>>>>>>>>>>> Andy >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I read the document. and write a config >>>>>>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg). >>>>>>>>>>>>>> I used my conf, I can see manager is dispatch message to >>>>>>>>>>>>>> supervisor. But I cannot see any dataserver tries to connect >>>>>>>>>>>>>> to >>>>>>>>>>>>>> the >>>>>>>>>>>>>> supervisor. At the same time, in the manager's log, I can see >>>>>>>>>>>>>> some >>>>>>>>>>>>>> dataserver are Dropped. >>>>>>>>>>>>>> How does xrootd decide which dataserver will connect >>>>>>>>>>>>>> supervisor? >>>>>>>>>>>>>> should I specify some dataservers to connect the supervisor? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> (*) supervisor log >>>>>>>>>>>>>> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2 for >>>>>>>>>>>>>> state >>>>>>>>>>>>>> dlen=42 >>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State: >>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_StateFWD: >>>>>>>>>>>>>> Path >>>>>>>>>>>>>> find >>>>>>>>>>>>>> failed for state /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>>>>>>> >>>>>>>>>>>>>> (*)manager log >>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu TSpace=5587GB >>>>>>>>>>>>>> NumFS=1 >>>>>>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0 >>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding path: w >>>>>>>>>>>>>> /atlas >>>>>>>>>>>>>> 091211 04:13:24 15661 server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>> do_Space: 5696231MB free; 0% util >>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 logged in. >>>>>>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection from >>>>>>>>>>>>>> [log in to unmask] >>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: running >>>>>>>>>>>>>> ?:[log in to unmask] >>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched protocol cmsd >>>>>>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll: FD 79 >>>>>>>>>>>>>> attached >>>>>>>>>>>>>> to poller 2; num=22 >>>>>>>>>>>>>> 091211 04:13:24 15661 Add server.21739:[log in to unmask] >>>>>>>>>>>>>> bumps >>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 #63 >>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>> 091211 04:13:24 15661 Drop_Node: >>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 dropped. >>>>>>>>>>>>>> 091211 04:13:24 15661 Add Shoved >>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 to cluster; id=63.78; >>>>>>>>>>>>>> num=64; >>>>>>>>>>>>>> min=51 >>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0 >>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu TSpace=5587GB >>>>>>>>>>>>>> NumFS=1 >>>>>>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0 >>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding path: w >>>>>>>>>>>>>> /atlas >>>>>>>>>>>>>> 091211 04:13:24 15661 server.21739:[log in to unmask]:1094 >>>>>>>>>>>>>> do_Space: 5721854MB free; 0% util >>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 logged in. >>>>>>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve from >>>>>>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer >>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop node in 60 >>>>>>>>>>>>>> seconds >>>>>>>>>>>>>> 091211 04:13:24 15661 Remove_Node >>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 node 63.78 >>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>>> logged >>>>>>>>>>>>>> out. >>>>>>>>>>>>>> 091211 04:13:24 15661 server.21739:[log in to unmask] >>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>> FD >>>>>>>>>>>>>> 79 detached from poller 2; num=21 >>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 >>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>> 091211 04:13:27 15661 server.24718:[log in to unmask]:1094 >>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>> suspend >>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service >>>>>>>>>>>>>> suspended >>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>> c177.chtc.wisc.edu >>>>>>>>>>>>>> FD=16 >>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 node 0.3 >>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>> server.21656:[log in to unmask] >>>>>>>>>>>>>> logged >>>>>>>>>>>>>> out. >>>>>>>>>>>>>> 091211 04:13:27 15661 server.21656:[log in to unmask] >>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>> FD >>>>>>>>>>>>>> 16 detached from poller 2; num=20 >>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>> c179.chtc.wisc.edu >>>>>>>>>>>>>> FD=21 >>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>> server.17065:[log in to unmask]:1094 node 1.4 >>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>> server.7978:[log in to unmask] >>>>>>>>>>>>>> logged >>>>>>>>>>>>>> out. >>>>>>>>>>>>>> 091211 04:13:27 15661 server.7978:[log in to unmask] >>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>> FD >>>>>>>>>>>>>> 21 >>>>>>>>>>>>>> detached from poller 1; num=21 >>>>>>>>>>>>>> 091211 04:13:27 15661 State: Status changed to suspended >>>>>>>>>>>>>> 091211 04:13:27 15661 Send status to >>>>>>>>>>>>>> redirector.15656:14@atlas-bkp2 >>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 >>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>> 091211 04:13:27 15661 server.12937:[log in to unmask]:1094 >>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>> suspend >>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service >>>>>>>>>>>>>> suspended >>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>> c182.chtc.wisc.edu >>>>>>>>>>>>>> FD=19 >>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 node 7.10 >>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>> server.26620:[log in to unmask] >>>>>>>>>>>>>> logged >>>>>>>>>>>>>> out. >>>>>>>>>>>>>> 091211 04:13:27 15661 server.26620:[log in to unmask] >>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>> FD >>>>>>>>>>>>>> 19 detached from poller 2; num=19 >>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 >>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>> 091211 04:13:27 15661 server.10842:[log in to unmask]:1094 >>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>> suspend >>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service >>>>>>>>>>>>>> suspended >>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>> c178.chtc.wisc.edu >>>>>>>>>>>>>> FD=15 >>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 node 9.12 >>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>> server.11901:[log in to unmask] >>>>>>>>>>>>>> logged >>>>>>>>>>>>>> out. >>>>>>>>>>>>>> 091211 04:13:27 15661 server.11901:[log in to unmask] >>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>> FD >>>>>>>>>>>>>> 15 detached from poller 1; num=20 >>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 >>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>> 091211 04:13:27 15661 server.5535:[log in to unmask]:1094 >>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>> suspend >>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service >>>>>>>>>>>>>> suspended >>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>> c181.chtc.wisc.edu >>>>>>>>>>>>>> FD=17 >>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 node 5.8 >>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>> server.13984:[log in to unmask] >>>>>>>>>>>>>> logged >>>>>>>>>>>>>> out. >>>>>>>>>>>>>> 091211 04:13:27 15661 server.13984:[log in to unmask] >>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>> FD >>>>>>>>>>>>>> 17 detached from poller 0; num=21 >>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 >>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>> 091211 04:13:27 15661 server.23711:[log in to unmask]:1094 >>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>> suspend >>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service >>>>>>>>>>>>>> suspended >>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>> c183.chtc.wisc.edu >>>>>>>>>>>>>> FD=22 >>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 node 8.11 >>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>> server.27735:[log in to unmask] >>>>>>>>>>>>>> logged >>>>>>>>>>>>>> out. >>>>>>>>>>>>>> 091211 04:13:27 15661 server.27735:[log in to unmask] >>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>> FD >>>>>>>>>>>>>> 22 detached from poller 2; num=18 >>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>> c184.chtc.wisc.edu >>>>>>>>>>>>>> FD=20 >>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>> server.4131:[log in to unmask]:1094 node 3.6 >>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>> server.26787:[log in to unmask] >>>>>>>>>>>>>> logged >>>>>>>>>>>>>> out. >>>>>>>>>>>>>> 091211 04:13:27 15661 server.26787:[log in to unmask] >>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>> FD >>>>>>>>>>>>>> 20 detached from poller 0; num=20 >>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>> 091211 04:13:27 15661 server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>> suspend >>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service >>>>>>>>>>>>>> suspended >>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>> c185.chtc.wisc.edu >>>>>>>>>>>>>> FD=23 >>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 node 6.9 >>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>> server.8524:[log in to unmask] >>>>>>>>>>>>>> logged >>>>>>>>>>>>>> out. >>>>>>>>>>>>>> 091211 04:13:27 15661 server.8524:[log in to unmask] >>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>> FD >>>>>>>>>>>>>> 23 >>>>>>>>>>>>>> detached from poller 0; num=19 >>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 >>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>> 091211 04:13:27 15661 server.20264:[log in to unmask]:1094 >>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>> suspend >>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service >>>>>>>>>>>>>> suspended >>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>> c180.chtc.wisc.edu >>>>>>>>>>>>>> FD=18 >>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 node 4.7 >>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>> server.14636:[log in to unmask] >>>>>>>>>>>>>> logged >>>>>>>>>>>>>> out. >>>>>>>>>>>>>> 091211 04:13:27 15661 server.14636:[log in to unmask] >>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>> FD >>>>>>>>>>>>>> 18 detached from poller 1; num=19 >>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 >>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>> 091211 04:13:27 15661 server.1656:[log in to unmask]:1094 >>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>> suspend >>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service >>>>>>>>>>>>>> suspended >>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>> c186.chtc.wisc.edu >>>>>>>>>>>>>> FD=24 >>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 node 2.5 >>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>> server.7849:[log in to unmask] >>>>>>>>>>>>>> logged >>>>>>>>>>>>>> out. >>>>>>>>>>>>>> 091211 04:13:27 15661 server.7849:[log in to unmask] >>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>> FD >>>>>>>>>>>>>> 24 >>>>>>>>>>>>>> detached from poller 1; num=18 >>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>> seconds >>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>> seconds >>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>> seconds >>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>> seconds >>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>> seconds >>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>> seconds >>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>> seconds >>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>> seconds >>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>> seconds >>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 >>>>>>>>>>>>>> seconds >>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled. >>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled. >>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled. >>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled. >>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled. >>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled. >>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled. >>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled. >>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled. >>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled. >>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled. >>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers >>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled. >>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Wen >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky >>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> To go past 64 data servers you will need to setup one or more >>>>>>>>>>>>>>> supervisors. >>>>>>>>>>>>>>> This does not logically change the current configuration you >>>>>>>>>>>>>>> have. >>>>>>>>>>>>>>> You >>>>>>>>>>>>>>> only >>>>>>>>>>>>>>> need to configure one or more *new* servers (or at least >>>>>>>>>>>>>>> xrootd >>>>>>>>>>>>>>> processes) >>>>>>>>>>>>>>> whose role is supervisor. We'd like them to run in separate >>>>>>>>>>>>>>> machines >>>>>>>>>>>>>>> for >>>>>>>>>>>>>>> reliability purposes, but they could run on the manager node >>>>>>>>>>>>>>> as >>>>>>>>>>>>>>> long >>>>>>>>>>>>>>> as >>>>>>>>>>>>>>> you >>>>>>>>>>>>>>> give each one a unique instance name (i.e., -n option). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The front part of the cmsd reference explains how to do this. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Is there any change to configure xrootd with more than 65 >>>>>>>>>>>>>>>> machines? I used the configure below but it doesn't work. >>>>>>>>>>>>>>>> Should >>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>> configure some machines' manager to be supvervisor? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> >> >> >> > > >