Hi Andy, I put new logs in web. It still doesn't work. I cannot copy files in and out. It seems xrootd daemon at atlas-bkp1 hasn't talked with cmsd. Normally if xrootd daemont tries to copy a file, in the cms.log I should see "do_Select: filename". But in this cms.log, there is nothing from atlas-bkp1. (*) [root@atlas-bkp1 ~]# xrdcp root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 /tmp/ Last server error 10000 ('') Error accessing path/file for root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 [root@atlas-bkp1 ~]# xrdcp /bin/mv root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123 133 Wen On Thu, Dec 17, 2009 at 10:54 PM, Andrew Hanushevsky <[log in to unmask]> wrote: > Hi Wen, > > I reviewed the log file. Other than the odd redirect of c131 at 17:47:25 > which I can't comment on because its logs on the web site do not overlap > with the manager or supervisor. Unless all the logs include the full time in > question I can't say much of anything. Can you provide me with inclusive > logs? > > atlas-bkp1 cms: 17:20:57 to 17:42:19 xrd: 17:20:57 to 17:40:57 > higgs07 cms & xrd 17:22:33 to 17:42:33 > c131 cms & xrd 17:31:57 to 17:47:28 > > That said, it certainly looks like things were working and files were being > accessed and discovered on all the machines. You even werw able to open > /atlas/xrootd/users/wguan/test/test98123313 > through not > /atlas/xrootd/users/wguan/test/test123131The other issue is that you did not > specify a stable adminpath and the adminpath defaults to /tmp. If you have a > "cleanup" script that runs periodically for /tmp then eventually your > cluster will go catonic as important (but not often used) files are deleted > by that script. Could you please find a stable home for the adminpath? > > I reran my tests here and things worked as expected. I will ramp up some > more tests. So, what is your status today? > > Andy > > ----- Original Message ----- From: "wen guan" <[log in to unmask]> > To: "Andrew Hanushevsky" <[log in to unmask]> > Cc: <[log in to unmask]> > Sent: Thursday, December 17, 2009 5:05 AM > Subject: Re: xrootd with more than 65 machines > > > Hi Andy, > > Yes. I am using the file download from > http://www.slac.stanford.edu/~abh/cmsd/ which compiled yesterday. I > just now compiled it again and compare it with one I compiled > yesterday. they are the same(same md5sum). > > Wen > > On Thu, Dec 17, 2009 at 2:09 AM, Andrew Hanushevsky <[log in to unmask]> > wrote: >> >> Hi Wen, >> >> If c131 cannot connect then either c131 does not have the new cms or >> atlas-bkp1 does not have the new cms as that would be what would happen if >> either were true. Looking at the log on c131 it would appear that >> atlas-bkp1 >> is still using the old cmsd as the response data length is wrong. Could >> you >> verify please. >> >> Andy >> >> ----- Original Message ----- From: "wen guan" <[log in to unmask]> >> To: "Andrew Hanushevsky" <[log in to unmask]> >> Cc: <[log in to unmask]> >> Sent: Wednesday, December 16, 2009 3:58 PM >> Subject: Re: xrootd with more than 65 machines >> >> >> Hi Andy, >> >> I tried it. But there are still some problem. I put the logs in >> higgs03.cs.wisc.edu/wguan/ >> >> In my test, c131 is the 65 nodes to be added the the manager. >> and I can copy the file to the pool through manager. But I cannot >> copy a file out which is in c131. >> >> In c131's cms.log, I see "Manager: >> manager.0:[log in to unmask] removed; redirected" again and >> again. and I cannot see any thing about c131 in higgs07's >> log(supervisor). Does it mean manager tries to redirect it to higgs07, >> but c131 hasn't try to connect higgs07. It only tries to connect >> manager again. >> >> (*) >> [root@c131 ~]# xrdcp /bin/mv >> root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331 >> Last server error 10000 ('') >> Error accessing path/file for >> root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331 >> [root@c131 ~]# xrdcp /bin/mv >> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123311 >> [xrootd] Total 0.06 MB |====================| 100.00 % [3.1 MB/s] >> [root@c131 ~]# xrdcp /bin/mv >> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123312 >> [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s] >> [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/ >> test123131 >> [root@c131 ~]# xrdcp >> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >> /tmp/ >> Last server error 3011 ('No servers are available to read the file.') >> Error accessing path/file for >> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >> [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/test123131 >> /atlas/xrootd/users/wguan/test/test123131 >> [root@c131 ~]# xrdcp >> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >> /tmp/ >> Last server error 3011 ('No servers are available to read the file.') >> Error accessing path/file for >> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >> [root@c131 ~]# xrdcp /bin/mv >> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123313 >> [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s] >> [root@c131 ~]# xrdcp >> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >> /tmp/ >> Last server error 3011 ('No servers are available to read the file.') >> Error accessing path/file for >> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >> [root@c131 ~]# xrdcp >> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >> /tmp/ >> Last server error 3011 ('No servers are available to read the file.') >> Error accessing path/file for >> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >> [root@c131 ~]# xrdcp >> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >> /tmp/ >> Last server error 3011 ('No servers are available to read the file.') >> Error accessing path/file for >> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >> [root@c131 ~]# tail -f /var/log/xrootd/cms.log >> 091216 17:45:52 3103 manager.0:[log in to unmask] XrdLink: >> Setting ref to 2+-1 post=0 >> 091216 17:45:55 3103 Pander trying to connect to lvl 0 >> atlas-bkp1.cs.wisc.edu:3121 >> 091216 17:45:55 3103 XrdInet: Connected to atlas-bkp1.cs.wisc.edu:3121 >> 091216 17:45:55 3103 Add atlas-bkp1.cs.wisc.edu to manager config; id=0 >> 091216 17:45:55 3103 ManTree: Now connected to 3 root node(s) >> 091216 17:45:55 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu >> 091216 17:45:55 3103 Dispatch manager.0:[log in to unmask] for try >> dlen=3 >> 091216 17:45:55 3103 manager.0:[log in to unmask] do_Try: >> 091216 17:45:55 3103 Remove completed atlas-bkp1.cs.wisc.edu manager 0.95 >> 091216 17:45:55 3103 Manager: manager.0:[log in to unmask] >> removed; redirected >> 091216 17:46:04 3103 Pander trying to connect to lvl 0 >> atlas-bkp1.cs.wisc.edu:3121 >> 091216 17:46:04 3103 XrdInet: Connected to atlas-bkp1.cs.wisc.edu:3121 >> 091216 17:46:04 3103 Add atlas-bkp1.cs.wisc.edu to manager config; id=0 >> 091216 17:46:04 3103 ManTree: Now connected to 3 root node(s) >> 091216 17:46:04 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu >> 091216 17:46:04 3103 Dispatch manager.0:[log in to unmask] for try >> dlen=3 >> 091216 17:46:04 3103 Protocol: No buffers to serve atlas-bkp1.cs.wisc.edu >> 091216 17:46:04 3103 Remove completed atlas-bkp1.cs.wisc.edu manager 0.96 >> 091216 17:46:04 3103 Manager: manager.0:[log in to unmask] >> removed; insufficient buffers >> 091216 17:46:11 3103 Dispatch manager.0:[log in to unmask] for >> state dlen=169 >> 091216 17:46:11 3103 manager.0:[log in to unmask] XrdLink: >> Setting ref to 1+1 post=0 >> >> Thanks >> Wen >> >> On Thu, Dec 17, 2009 at 12:10 AM, wen guan <[log in to unmask]> wrote: >>> >>> Hi Andy, >>> >>>> OK, I understand. As for stalling, too many nodes were deemed to be in >>>> trouble for the manager to allow service resumption. >>>> >>>> Please make sure that all of the nodes in the cluster receive the new >>>> cmsd >>>> as they will drop off with the old one and you'll see the same kind of >>>> activity. Perhaps the best way to know that you suceeded in putting >>>> everything in sync is to start with 63 data nodes plus one supervisor. >>>> Once >>>> all connections are established; adding an additional server should >>>> simply >>>> send it to the supervisor. >>> >>> I will do it. >>> you said start 63 data server and one supervisor. Does it mean the >>> supervisor is managed using the same policy? If I there are 64 >>> dataservers which are connected before the supervisor, will the >>> supervisor be dropped? Is the supervisor has high priority to be >>> added to the manager? I mean, if there are already 64 dataservers and >>> a supervisor comes in, will the supervisor be accepted and a datasever >>> be redirected to the supervisor? >>> >>> Thanks >>> Wen >>> >>>> >>>> Hi Andrew, >>>> >>>> But when I tried to xrdcp a file to it, it doesn't response. In >>>> atlas-bkp1-xrd.log.20091213, it always prints "stalling client for 10 >>>> sec". But in cms.log, I can find any message about the file. >>>> >>>>> I don't see why you say it doesn't work. With the debugging level set >>>>> so >>>>> high the noise may make it look like something is going wrong but that >>>>> isn't >>>>> necessarily the case. >>>>> >>>>> 1) The 'too many subscribers' is correct. The manager was simply >>>>> redirecting >>>>> them because there were already 64 servers. However, in your case the >>>>> supervisor wasn't started until almost 30 minutes after everyone else >>>>> (i.e., >>>>> 10:42 AM). Why was that? I'm not suprised about the flurry of messages >>>>> with >>>>> a critical component missing for 30 minutes. >>>> >>>> Because the manager is 64bit machine but supervisor is 32 bit machine. >>>> Then I have to recompile the it. At that time, I was interrupted by >>>> something else. >>>> >>>> >>>>> 2) Once the supervisor started, it started accepting the redirected >>>>> servers. >>>>> >>>>> 3) Then 10 seconds (10:42:10) later the supervisor was restarted. So, >>>>> that >>>>> would cause a flurry of activity to occur as there is no backup >>>>> supervisor >>>>> to take over. >>>>> >>>>> 4) This happened again at 10:42:34 AM then again at 10:48:49. Is the >>>>> supervisor crashing? Is there a core file? >>>>> >>>>> 5) At 11:11 AM the manager restarted. Again, is there a core file here >>>>> or >>>>> was this a manual action? >>>>> >>>>> During the course of all of this. All nodes connected were operating >>>>> propely >>>>> and files were being located. >>>>> >>>>> So, the two big questions are: >>>>> >>>>> a) Why was the supervisor not started until 30 minutes after the system >>>>> was >>>>> started? >>>>> >>>>> b) Is there an explanation of the restarts? If this was a crash then we >>>>> need >>>>> a core file to figure out what happened. >>>> >>>> It's not a crash. There are some reasons that I restarted some daemons. >>>> (1)I thought if a dataserver tried many times to connect to a >>>> redirector but failed, the dataserver would not try to connect a >>>> redirector again. The supervisor was missing for long time. So maybe >>>> some dataservers would not try to connect to atlas-bkp1 again. To >>>> reactive these dataservers, I restarted any servers. >>>> (2)When I tried to xrdcp, it was hanging for long time. I thought >>>> maybe manager was affected by some others things. then I restarte >>>> manager to see whether a restart can make this xrdcp work. >>>> >>>> >>>> Thanks >>>> Wen >>>> >>>>> Andy >>>>> >>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]> >>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>> Cc: <[log in to unmask]> >>>>> Sent: Wednesday, December 16, 2009 9:38 AM >>>>> Subject: Re: xrootd with more than 65 machines >>>>> >>>>> >>>>> Hi Andrew, >>>>> >>>>> It still doesn't work. >>>>> The log file is in higgs03.cs.wisc.edu/wguan/. The name is *.20091216 >>>>> The manager complains there are too many subscribers and the removes >>>>> nodes. >>>>> >>>>> (*) >>>>> Add server.10040:[log in to unmask] redirected; too many >>>>> subscribers. >>>>> >>>>> Wen >>>>> >>>>> On Wed, Dec 16, 2009 at 4:25 AM, Andrew Hanushevsky <[log in to unmask]> >>>>> wrote: >>>>>> >>>>>> Hi Wen, >>>>>> >>>>>> It will be easier for me to retroft as the changes were pretty minor. >>>>>> Please >>>>>> lift the new XrdCmsNode.cc file from >>>>>> >>>>>> http://www.slac.stanford.edu/~abh/cmsd >>>>>> >>>>>> Andy >>>>>> >>>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]> >>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>> Cc: <[log in to unmask]> >>>>>> Sent: Tuesday, December 15, 2009 5:12 PM >>>>>> Subject: Re: xrootd with more than 65 machines >>>>>> >>>>>> >>>>>> Hi Andy, >>>>>> >>>>>> I can switch to 20091104-1102. Then you don't need to patch >>>>>> another version. How can I download v20091104-1102? >>>>>> >>>>>> Thanks >>>>>> Wen >>>>>> >>>>>> On Wed, Dec 16, 2009 at 12:52 AM, Andrew Hanushevsky >>>>>> <[log in to unmask]> >>>>>> wrote: >>>>>>> >>>>>>> Hi Wen, >>>>>>> >>>>>>> Ah yes, I see that now. The file I gave you is based on >>>>>>> v20091104-1102. >>>>>>> Let >>>>>>> me see if I can retrofit the patch for you. >>>>>>> >>>>>>> Andy >>>>>>> >>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>> <[log in to unmask]> >>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>> Cc: <[log in to unmask]> >>>>>>> Sent: Tuesday, December 15, 2009 1:04 PM >>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>> >>>>>>> >>>>>>> Hi Andy, >>>>>>> >>>>>>> Which xrootd version are you using? XrdCmsConfig.hh is different. >>>>>>> XrdCmsConfig.hh is downloaded from >>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/. >>>>>>> >>>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsNode.cc >>>>>>> 6fb3ae40fe4e10bdd4d372818a341f2c src/XrdCms/XrdCmsNode.cc >>>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsConfig.hh >>>>>>> 7d57753847d9448186c718f98e963cbe src/XrdCms/XrdCmsConfig.hh >>>>>>> >>>>>>> Thanks >>>>>>> Wen >>>>>>> >>>>>>> On Tue, Dec 15, 2009 at 10:50 PM, Andrew Hanushevsky >>>>>>> <[log in to unmask]> >>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Wen, >>>>>>>> >>>>>>>> Just compiled on Linux and it was clean. Something is really wrong >>>>>>>> with >>>>>>>> your >>>>>>>> source files, specifically XrdCmsConfig.cc >>>>>>>> >>>>>>>> The MD5 checksums on the relevant files are: >>>>>>>> >>>>>>>> MD5 (XrdCmsNode.cc) = 6fb3ae40fe4e10bdd4d372818a341f2c >>>>>>>> >>>>>>>> MD5 (XrdCmsConfig.hh) = 4a7d655582a7cd43b098947d0676924b >>>>>>>> >>>>>>>> Andy >>>>>>>> >>>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>>> <[log in to unmask]> >>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>>> Cc: <[log in to unmask]> >>>>>>>> Sent: Tuesday, December 15, 2009 4:24 AM >>>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>>> >>>>>>>> >>>>>>>> Hi Andy, >>>>>>>> >>>>>>>> No problem. Thanks for the fix. But it cannot be compiled. The >>>>>>>> version I am using is >>>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/. >>>>>>>> >>>>>>>> Making cms component... >>>>>>>> Compiling XrdCmsNode.cc >>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>> XrdCmsNode::do_Chmod(XrdCmsRRData&)': >>>>>>>> XrdCmsNode.cc:268: error: `fsExec' was not declared in this scope >>>>>>>> XrdCmsNode.cc:268: warning: unused variable 'fsExec' >>>>>>>> XrdCmsNode.cc:269: error: 'class XrdCmsConfig' has no member named >>>>>>>> 'ossFS' >>>>>>>> XrdCmsNode.cc:273: error: `fsFail' was not declared in this scope >>>>>>>> XrdCmsNode.cc:273: warning: unused variable 'fsFail' >>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>> XrdCmsNode::do_Mkdir(XrdCmsRRData&)': >>>>>>>> XrdCmsNode.cc:600: error: `fsExec' was not declared in this scope >>>>>>>> XrdCmsNode.cc:600: warning: unused variable 'fsExec' >>>>>>>> XrdCmsNode.cc:601: error: 'class XrdCmsConfig' has no member named >>>>>>>> 'ossFS' >>>>>>>> XrdCmsNode.cc:605: error: `fsFail' was not declared in this scope >>>>>>>> XrdCmsNode.cc:605: warning: unused variable 'fsFail' >>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>> XrdCmsNode::do_Mkpath(XrdCmsRRData&)': >>>>>>>> XrdCmsNode.cc:640: error: `fsExec' was not declared in this scope >>>>>>>> XrdCmsNode.cc:640: warning: unused variable 'fsExec' >>>>>>>> XrdCmsNode.cc:641: error: 'class XrdCmsConfig' has no member named >>>>>>>> 'ossFS' >>>>>>>> XrdCmsNode.cc:645: error: `fsFail' was not declared in this scope >>>>>>>> XrdCmsNode.cc:645: warning: unused variable 'fsFail' >>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>> XrdCmsNode::do_Mv(XrdCmsRRData&)': >>>>>>>> XrdCmsNode.cc:704: error: `fsExec' was not declared in this scope >>>>>>>> XrdCmsNode.cc:704: warning: unused variable 'fsExec' >>>>>>>> XrdCmsNode.cc:705: error: 'class XrdCmsConfig' has no member named >>>>>>>> 'ossFS' >>>>>>>> XrdCmsNode.cc:709: error: `fsFail' was not declared in this scope >>>>>>>> XrdCmsNode.cc:709: warning: unused variable 'fsFail' >>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>> XrdCmsNode::do_Rm(XrdCmsRRData&)': >>>>>>>> XrdCmsNode.cc:831: error: `fsExec' was not declared in this scope >>>>>>>> XrdCmsNode.cc:831: warning: unused variable 'fsExec' >>>>>>>> XrdCmsNode.cc:832: error: 'class XrdCmsConfig' has no member named >>>>>>>> 'ossFS' >>>>>>>> XrdCmsNode.cc:836: error: `fsFail' was not declared in this scope >>>>>>>> XrdCmsNode.cc:836: warning: unused variable 'fsFail' >>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>> XrdCmsNode::do_Rmdir(XrdCmsRRData&)': >>>>>>>> XrdCmsNode.cc:873: error: `fsExec' was not declared in this scope >>>>>>>> XrdCmsNode.cc:873: warning: unused variable 'fsExec' >>>>>>>> XrdCmsNode.cc:874: error: 'class XrdCmsConfig' has no member named >>>>>>>> 'ossFS' >>>>>>>> XrdCmsNode.cc:878: error: `fsFail' was not declared in this scope >>>>>>>> XrdCmsNode.cc:878: warning: unused variable 'fsFail' >>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>> XrdCmsNode::do_Trunc(XrdCmsRRData&)': >>>>>>>> XrdCmsNode.cc:1377: error: `fsExec' was not declared in this scope >>>>>>>> XrdCmsNode.cc:1377: warning: unused variable 'fsExec' >>>>>>>> XrdCmsNode.cc:1378: error: 'class XrdCmsConfig' has no member named >>>>>>>> 'ossFS' >>>>>>>> XrdCmsNode.cc:1382: error: `fsFail' was not declared in this scope >>>>>>>> XrdCmsNode.cc:1382: warning: unused variable 'fsFail' >>>>>>>> XrdCmsNode.cc: At global scope: >>>>>>>> XrdCmsNode.cc:1524: error: no `int XrdCmsNode::fsExec(XrdOucProg*, >>>>>>>> char*, char*)' member function declared in class `XrdCmsNode' >>>>>>>> XrdCmsNode.cc: In member function `int >>>>>>>> XrdCmsNode::fsExec(XrdOucProg*, >>>>>>>> char*, char*)': >>>>>>>> XrdCmsNode.cc:1533: error: `fsL2PFail1' was not declared in this >>>>>>>> scope >>>>>>>> XrdCmsNode.cc:1533: warning: unused variable 'fsL2PFail1' >>>>>>>> XrdCmsNode.cc:1537: error: `fsL2PFail2' was not declared in this >>>>>>>> scope >>>>>>>> XrdCmsNode.cc:1537: warning: unused variable 'fsL2PFail2' >>>>>>>> XrdCmsNode.cc: At global scope: >>>>>>>> XrdCmsNode.cc:1553: error: no `const char* XrdCmsNode::fsFail(const >>>>>>>> char*, const char*, const char*, int)' member function declared in >>>>>>>> class `XrdCmsNode' >>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>> XrdCmsNode::fsFail(const char*, const char*, const char*, int)': >>>>>>>> XrdCmsNode.cc:1559: error: `fsL2PFail1' was not declared in this >>>>>>>> scope >>>>>>>> XrdCmsNode.cc:1559: warning: unused variable 'fsL2PFail1' >>>>>>>> XrdCmsNode.cc:1560: error: `fsL2PFail2' was not declared in this >>>>>>>> scope >>>>>>>> XrdCmsNode.cc:1560: warning: unused variable 'fsL2PFail2' >>>>>>>> XrdCmsNode.cc: In static member function `static int >>>>>>>> XrdCmsNode::isOnline(char*, int)': >>>>>>>> XrdCmsNode.cc:1608: error: 'class XrdCmsConfig' has no member named >>>>>>>> 'ossFS' >>>>>>>> make[4]: *** [../../obj/XrdCmsNode.o] Error 1 >>>>>>>> make[3]: *** [Linuxall] Error 2 >>>>>>>> make[2]: *** [all] Error 2 >>>>>>>> make[1]: *** [XrdCms] Error 2 >>>>>>>> make: *** [all] Error 2 >>>>>>>> >>>>>>>> >>>>>>>> Wen >>>>>>>> >>>>>>>> On Tue, Dec 15, 2009 at 2:08 AM, Andrew Hanushevsky >>>>>>>> <[log in to unmask]> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi Wen, >>>>>>>>> >>>>>>>>> I have developed a permanent fix. You will find the source files in >>>>>>>>> >>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/ >>>>>>>>> >>>>>>>>> There are three files: XrdCmsCluster.cc XrdCmsNode.cc >>>>>>>>> XrdCmsProtocol.cc >>>>>>>>> >>>>>>>>> Please do a source replacement and recompile. Unfortunately, the >>>>>>>>> cmsd >>>>>>>>> will >>>>>>>>> need to be replaced on each node regardless of role. My apologies >>>>>>>>> for >>>>>>>>> the >>>>>>>>> disruption. Please let me know how it goes. >>>>>>>>> >>>>>>>>> Andy >>>>>>>>> >>>>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>>>> <[log in to unmask]> >>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>>>> Cc: <[log in to unmask]> >>>>>>>>> Sent: Sunday, December 13, 2009 7:04 AM >>>>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi Andrew, >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks. >>>>>>>>> I used the new cmsd at atlas-bkp1 manager. But it's still dropping >>>>>>>>> nodes. And in supervisor's log, I cannot find any dataserver to >>>>>>>>> register to it. >>>>>>>>> >>>>>>>>> The new logs are in http://higgs03.cs.wisc.edu/wguan/*.20091213. >>>>>>>>> The manager is patched at 091213 08:38:15. >>>>>>>>> >>>>>>>>> Wen >>>>>>>>> >>>>>>>>> On Sun, Dec 13, 2009 at 1:52 AM, Andrew Hanushevsky >>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>> >>>>>>>>>> Hi Wen >>>>>>>>>> >>>>>>>>>> You will find the source replacement at: >>>>>>>>>> >>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/ >>>>>>>>>> >>>>>>>>>> It's XrdCmsCluster.cc and it replaces >>>>>>>>>> xrootd/src/XrdCms/XrdCmsCluster.cc >>>>>>>>>> >>>>>>>>>> I'm stepping out for a couple of hours but will be back to see how >>>>>>>>>> things >>>>>>>>>> went. Sorry for the issues :-( >>>>>>>>>> >>>>>>>>>> Andy >>>>>>>>>> >>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote: >>>>>>>>>> >>>>>>>>>>> Hi Andrew, >>>>>>>>>>> >>>>>>>>>>> I prefer a source replacement. Then I can compile it. >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> Wen >>>>>>>>>>>> >>>>>>>>>>>> I can do one of two things here: >>>>>>>>>>>> >>>>>>>>>>>> 1) Supply a source replacement and then you would recompile, or >>>>>>>>>>>> >>>>>>>>>>>> 2) Give me the uname -a of where the cmsd will run and I'll >>>>>>>>>>>> supply >>>>>>>>>>>> a >>>>>>>>>>>> binary >>>>>>>>>>>> replacement for you. >>>>>>>>>>>> >>>>>>>>>>>> Your choice. >>>>>>>>>>>> >>>>>>>>>>>> Andy >>>>>>>>>>>> >>>>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Andrew >>>>>>>>>>>>> >>>>>>>>>>>>> The problem is found. Great. Thanks. >>>>>>>>>>>>> >>>>>>>>>>>>> Where can I find the patched cmsd? >>>>>>>>>>>>> >>>>>>>>>>>>> Wen >>>>>>>>>>>>> >>>>>>>>>>>>> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky >>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I found the problem. Looks like a regression from way back >>>>>>>>>>>>>> when. >>>>>>>>>>>>>> There >>>>>>>>>>>>>> is >>>>>>>>>>>>>> a >>>>>>>>>>>>>> missing flag on the redirect. This will require a patched cmsd >>>>>>>>>>>>>> but >>>>>>>>>>>>>> you >>>>>>>>>>>>>> need >>>>>>>>>>>>>> only to replace the redirector's cmsd as this only affects the >>>>>>>>>>>>>> redirector. >>>>>>>>>>>>>> How would you like to proceed? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Andy >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> It doesn't work. atlas-bkp1 manager still dropping nodes >>>>>>>>>>>>>>> again. >>>>>>>>>>>>>>> In supervisor, I still haven't seen any dataserver >>>>>>>>>>>>>>> registered. >>>>>>>>>>>>>>> I >>>>>>>>>>>>>>> said >>>>>>>>>>>>>>> "I updated the ntp" because you said "the log timestamp do >>>>>>>>>>>>>>> not >>>>>>>>>>>>>>> overlap". >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky >>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Do you mean that everything is now working? It could be that >>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>> removed >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> xrd.timeout directive. That really could cause problems. As >>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> delays, >>>>>>>>>>>>>>>> that is normal when the redirector thinks something is going >>>>>>>>>>>>>>>> wrong. >>>>>>>>>>>>>>>> The >>>>>>>>>>>>>>>> strategy is to delay clients until it can get back to a >>>>>>>>>>>>>>>> stable >>>>>>>>>>>>>>>> configuration. This usually prevents jobs from crashing >>>>>>>>>>>>>>>> during >>>>>>>>>>>>>>>> stressful >>>>>>>>>>>>>>>> periods. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I restarted it to do supervisor test. Also because xrootd >>>>>>>>>>>>>>>>> manager >>>>>>>>>>>>>>>>> frequently doesn't response. (*) is the cms.log, the file >>>>>>>>>>>>>>>>> select >>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>> delayed again and again. When do a restart, all things are >>>>>>>>>>>>>>>>> fine. >>>>>>>>>>>>>>>>> Now >>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>> am trying to find a clue about it. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> (*) >>>>>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] >>>>>>>>>>>>>>>>> do_Select: >>>>>>>>>>>>>>>>> wc >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>>>>> 091212 00:00:19 21318 Select seeking >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>>>>> 091212 00:00:19 21318 UnkFile rc=1 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] >>>>>>>>>>>>>>>>> do_Select: >>>>>>>>>>>>>>>>> delay 5 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 2+-1 >>>>>>>>>>>>>>>>> post=0 >>>>>>>>>>>>>>>>> 091212 00:00:19 21318 Dispatch >>>>>>>>>>>>>>>>> redirector.21313:14@atlas-bkp2 >>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>> select dlen=166 >>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1 >>>>>>>>>>>>>>>>> post=0 >>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdSched: running redirector inq=0 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> There is no core file. I copied a new copies of the logs to >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> link >>>>>>>>>>>>>>>>> below. >>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky >>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I see in the server log that it is restarting often. Could >>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>> take >>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>> look >>>>>>>>>>>>>>>>>> in the c193 to see if you have any core files? Also please >>>>>>>>>>>>>>>>>> make >>>>>>>>>>>>>>>>>> sure >>>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>> core files are enabled as Linux defaults the size to 0. >>>>>>>>>>>>>>>>>> The >>>>>>>>>>>>>>>>>> first >>>>>>>>>>>>>>>>>> step >>>>>>>>>>>>>>>>>> here >>>>>>>>>>>>>>>>>> is to find out why your servers are restarting. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> the logs can be found here. From the log you can see >>>>>>>>>>>>>>>>>>> atlas-bkp1 >>>>>>>>>>>>>>>>>>> manager are dropping nodes again and again which tries to >>>>>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>> it. >>>>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky >>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi Wen, Could you start everything up and provide me a >>>>>>>>>>>>>>>>>>>> pointer >>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>> manager log file, supervisor log file, and one data >>>>>>>>>>>>>>>>>>>> server >>>>>>>>>>>>>>>>>>>> logfile >>>>>>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>> which cover the same time-frame (from start to some >>>>>>>>>>>>>>>>>>>> point >>>>>>>>>>>>>>>>>>>> where >>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>> think >>>>>>>>>>>>>>>>>>>> things are working or not). That way I can see what is >>>>>>>>>>>>>>>>>>>> happening. >>>>>>>>>>>>>>>>>>>> At >>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>> moment I only see two "bad" things in the config file: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a >>>>>>>>>>>>>>>>>>>> manager >>>>>>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>> claim, >>>>>>>>>>>>>>>>>>>> via >>>>>>>>>>>>>>>>>>>> the all.manager directive, that there are three (bkp2 >>>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>> bkp3). >>>>>>>>>>>>>>>>>>>> While >>>>>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>>>>> should work, the log file will be dense with error >>>>>>>>>>>>>>>>>>>> messages. >>>>>>>>>>>>>>>>>>>> Please >>>>>>>>>>>>>>>>>>>> correct >>>>>>>>>>>>>>>>>>>> this to be consistent and make it easier to see real >>>>>>>>>>>>>>>>>>>> errors. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> This is not a problem for me. Because this config is used >>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>> dataserver. In manager, I updated the if >>>>>>>>>>>>>>>>>>> atlas-bkp1.cs.wisc.edu >>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>> atlas-bkp2 or something. This is a history problem. at >>>>>>>>>>>>>>>>>>> first >>>>>>>>>>>>>>>>>>> only >>>>>>>>>>>>>>>>>>> atlas-bkp1 is used. atlas-bkp2 and atlas-bkp3 are added >>>>>>>>>>>>>>>>>>> later. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> 2) Please use cms.space not olb.space (for historical >>>>>>>>>>>>>>>>>>>> reasons >>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>> latter >>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>> still accepted and over-rides the former, but that will >>>>>>>>>>>>>>>>>>>> soon >>>>>>>>>>>>>>>>>>>> end), >>>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>> please use only one (the config file uses both >>>>>>>>>>>>>>>>>>>> directives). >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> yes. I should remove this line. in fact cms.space is in >>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>> cfg >>>>>>>>>>>>>>>>>>> too. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> The xrootd has an internal mechanism to connect servers >>>>>>>>>>>>>>>>>>>> with >>>>>>>>>>>>>>>>>>>> supervisors >>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>> allow for maximum reliability. You cannot change that >>>>>>>>>>>>>>>>>>>> algorithm >>>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>> there >>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>> no need to do so. You should *never* tell anyone to >>>>>>>>>>>>>>>>>>>> directly >>>>>>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>>> supervisor. If you do, you will likely get unreachable >>>>>>>>>>>>>>>>>>>> nodes. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> As for dropping data servers, it would appear to me, >>>>>>>>>>>>>>>>>>>> given >>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>> flurry >>>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>> such activity, that something either crashed or was >>>>>>>>>>>>>>>>>>>> restarted. >>>>>>>>>>>>>>>>>>>> That's >>>>>>>>>>>>>>>>>>>> why >>>>>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>>>>> would be good to see the complete log of each one of the >>>>>>>>>>>>>>>>>>>> entities. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I read the document. and write a config >>>>>>>>>>>>>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg). >>>>>>>>>>>>>>>>>>>>> I used my conf, I can see manager is dispatch message >>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>> supervisor. But I cannot see any dataserver tries to >>>>>>>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>> supervisor. At the same time, in the manager's log, I >>>>>>>>>>>>>>>>>>>>> can >>>>>>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>>>>> some >>>>>>>>>>>>>>>>>>>>> dataserver are Dropped. >>>>>>>>>>>>>>>>>>>>> How does xrootd decide which dataserver will connect >>>>>>>>>>>>>>>>>>>>> supervisor? >>>>>>>>>>>>>>>>>>>>> should I specify some dataservers to connect the >>>>>>>>>>>>>>>>>>>>> supervisor? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> (*) supervisor log >>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2 >>>>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>>> state >>>>>>>>>>>>>>>>>>>>> dlen=42 >>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State: >>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 >>>>>>>>>>>>>>>>>>>>> do_StateFWD: >>>>>>>>>>>>>>>>>>>>> Path >>>>>>>>>>>>>>>>>>>>> find >>>>>>>>>>>>>>>>>>>>> failed for state >>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> (*)manager log >>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>> TSpace=5587GB >>>>>>>>>>>>>>>>>>>>> NumFS=1 >>>>>>>>>>>>>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding >>>>>>>>>>>>>>>>>>>>> path: >>>>>>>>>>>>>>>>>>>>> w >>>>>>>>>>>>>>>>>>>>> /atlas >>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 >>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>> do_Space: 5696231MB free; 0% util >>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 logged in. >>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection from >>>>>>>>>>>>>>>>>>>>> [log in to unmask] >>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: running >>>>>>>>>>>>>>>>>>>>> ?:[log in to unmask] >>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched protocol >>>>>>>>>>>>>>>>>>>>> cmsd >>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll: >>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>> 79 >>>>>>>>>>>>>>>>>>>>> attached >>>>>>>>>>>>>>>>>>>>> to poller 2; num=22 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add >>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>>>>>>>>>> bumps >>>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 #63 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Drop_Node: >>>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 dropped. >>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add Shoved >>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 to cluster; >>>>>>>>>>>>>>>>>>>>> id=63.78; >>>>>>>>>>>>>>>>>>>>> num=64; >>>>>>>>>>>>>>>>>>>>> min=51 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>> TSpace=5587GB >>>>>>>>>>>>>>>>>>>>> NumFS=1 >>>>>>>>>>>>>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding >>>>>>>>>>>>>>>>>>>>> path: >>>>>>>>>>>>>>>>>>>>> w >>>>>>>>>>>>>>>>>>>>> /atlas >>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 >>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>> do_Space: 5721854MB free; 0% util >>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 logged in. >>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve from >>>>>>>>>>>>>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer >>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>>> 60 >>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 node 63.78 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 >>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>> 79 detached from poller 2; num=21 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service >>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>> c177.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>> FD=16 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 node 0.3 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask] >>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask] >>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>> 16 detached from poller 2; num=20 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>> c179.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>> FD=21 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>> server.17065:[log in to unmask]:1094 node 1.4 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>> server.7978:[log in to unmask] >>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.7978:[log in to unmask] >>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>> 21 >>>>>>>>>>>>>>>>>>>>> detached from poller 1; num=21 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 State: Status changed to >>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Send status to >>>>>>>>>>>>>>>>>>>>> redirector.15656:14@atlas-bkp2 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service >>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>> c182.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>> FD=19 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 node 7.10 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask] >>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask] >>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>> 19 detached from poller 2; num=19 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service >>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>> c178.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>> FD=15 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 node 9.12 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask] >>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask] >>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>> 15 detached from poller 1; num=20 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service >>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>> c181.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>> FD=17 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 node 5.8 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask] >>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask] >>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>> 17 detached from poller 0; num=21 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service >>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>> c183.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>> FD=22 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 node 8.11 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask] >>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask] >>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>> 22 detached from poller 2; num=18 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>> c184.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>> FD=20 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>> server.4131:[log in to unmask]:1094 node 3.6 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask] >>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask] >>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>> 20 detached from poller 0; num=20 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service >>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>> c185.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>> FD=23 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 node 6.9 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>> server.8524:[log in to unmask] >>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.8524:[log in to unmask] >>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>> 23 >>>>>>>>>>>>>>>>>>>>> detached from poller 0; num=19 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service >>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>> c180.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>> FD=18 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 node 4.7 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask] >>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask] >>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>> 18 detached from poller 1; num=19 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service >>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>> c186.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>> FD=24 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 node 2.5 >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>> server.7849:[log in to unmask] >>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.7849:[log in to unmask] >>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>> 24 >>>>>>>>>>>>>>>>>>>>> detached from poller 1; num=18 >>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled. >>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled. >>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled. >>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled. >>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled. >>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled. >>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled. >>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled. >>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled. >>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled. >>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled. >>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers >>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled. >>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky >>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> To go past 64 data servers you will need to setup one >>>>>>>>>>>>>>>>>>>>>> or >>>>>>>>>>>>>>>>>>>>>> more >>>>>>>>>>>>>>>>>>>>>> supervisors. >>>>>>>>>>>>>>>>>>>>>> This does not logically change the current >>>>>>>>>>>>>>>>>>>>>> configuration >>>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>>> have. >>>>>>>>>>>>>>>>>>>>>> You >>>>>>>>>>>>>>>>>>>>>> only >>>>>>>>>>>>>>>>>>>>>> need to configure one or more *new* servers (or at >>>>>>>>>>>>>>>>>>>>>> least >>>>>>>>>>>>>>>>>>>>>> xrootd >>>>>>>>>>>>>>>>>>>>>> processes) >>>>>>>>>>>>>>>>>>>>>> whose role is supervisor. We'd like them to run in >>>>>>>>>>>>>>>>>>>>>> separate >>>>>>>>>>>>>>>>>>>>>> machines >>>>>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>>>> reliability purposes, but they could run on the >>>>>>>>>>>>>>>>>>>>>> manager >>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>>>>>> long >>>>>>>>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>>> give each one a unique instance name (i.e., -n >>>>>>>>>>>>>>>>>>>>>> option). >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> The front part of the cmsd reference explains how to >>>>>>>>>>>>>>>>>>>>>> do >>>>>>>>>>>>>>>>>>>>>> this. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Is there any change to configure xrootd with more >>>>>>>>>>>>>>>>>>>>>>> than >>>>>>>>>>>>>>>>>>>>>>> 65 >>>>>>>>>>>>>>>>>>>>>>> machines? I used the configure below but it doesn't >>>>>>>>>>>>>>>>>>>>>>> work. >>>>>>>>>>>>>>>>>>>>>>> Should >>>>>>>>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>>>>>>> configure some machines' manager to be supvervisor? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> >>> >> >> >> > > >