Hi Wen, I reviewed the log file. Other than the odd redirect of c131 at 17:47:25 which I can't comment on because its logs on the web site do not overlap with the manager or supervisor. Unless all the logs include the full time in question I can't say much of anything. Can you provide me with inclusive logs? atlas-bkp1 cms: 17:20:57 to 17:42:19 xrd: 17:20:57 to 17:40:57 higgs07 cms & xrd 17:22:33 to 17:42:33 c131 cms & xrd 17:31:57 to 17:47:28 That said, it certainly looks like things were working and files were being accessed and discovered on all the machines. You even werw able to open /atlas/xrootd/users/wguan/test/test98123313 through not /atlas/xrootd/users/wguan/test/test123131The other issue is that you did not specify a stable adminpath and the adminpath defaults to /tmp. If you have a "cleanup" script that runs periodically for /tmp then eventually your cluster will go catonic as important (but not often used) files are deleted by that script. Could you please find a stable home for the adminpath? I reran my tests here and things worked as expected. I will ramp up some more tests. So, what is your status today? Andy ----- Original Message ----- From: "wen guan" <[log in to unmask]> To: "Andrew Hanushevsky" <[log in to unmask]> Cc: <[log in to unmask]> Sent: Thursday, December 17, 2009 5:05 AM Subject: Re: xrootd with more than 65 machines Hi Andy, Yes. I am using the file download from http://www.slac.stanford.edu/~abh/cmsd/ which compiled yesterday. I just now compiled it again and compare it with one I compiled yesterday. they are the same(same md5sum). Wen On Thu, Dec 17, 2009 at 2:09 AM, Andrew Hanushevsky <[log in to unmask]> wrote: > Hi Wen, > > If c131 cannot connect then either c131 does not have the new cms or > atlas-bkp1 does not have the new cms as that would be what would happen if > either were true. Looking at the log on c131 it would appear that > atlas-bkp1 > is still using the old cmsd as the response data length is wrong. Could > you > verify please. > > Andy > > ----- Original Message ----- From: "wen guan" <[log in to unmask]> > To: "Andrew Hanushevsky" <[log in to unmask]> > Cc: <[log in to unmask]> > Sent: Wednesday, December 16, 2009 3:58 PM > Subject: Re: xrootd with more than 65 machines > > > Hi Andy, > > I tried it. But there are still some problem. I put the logs in > higgs03.cs.wisc.edu/wguan/ > > In my test, c131 is the 65 nodes to be added the the manager. > and I can copy the file to the pool through manager. But I cannot > copy a file out which is in c131. > > In c131's cms.log, I see "Manager: > manager.0:[log in to unmask] removed; redirected" again and > again. and I cannot see any thing about c131 in higgs07's > log(supervisor). Does it mean manager tries to redirect it to higgs07, > but c131 hasn't try to connect higgs07. It only tries to connect > manager again. > > (*) > [root@c131 ~]# xrdcp /bin/mv > root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331 > Last server error 10000 ('') > Error accessing path/file for > root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331 > [root@c131 ~]# xrdcp /bin/mv > root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123311 > [xrootd] Total 0.06 MB |====================| 100.00 % [3.1 MB/s] > [root@c131 ~]# xrdcp /bin/mv > root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123312 > [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s] > [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/ > test123131 > [root@c131 ~]# xrdcp > root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 > /tmp/ > Last server error 3011 ('No servers are available to read the file.') > Error accessing path/file for > root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 > [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/test123131 > /atlas/xrootd/users/wguan/test/test123131 > [root@c131 ~]# xrdcp > root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 > /tmp/ > Last server error 3011 ('No servers are available to read the file.') > Error accessing path/file for > root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 > [root@c131 ~]# xrdcp /bin/mv > root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123313 > [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s] > [root@c131 ~]# xrdcp > root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 > /tmp/ > Last server error 3011 ('No servers are available to read the file.') > Error accessing path/file for > root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 > [root@c131 ~]# xrdcp > root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 > /tmp/ > Last server error 3011 ('No servers are available to read the file.') > Error accessing path/file for > root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 > [root@c131 ~]# xrdcp > root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 > /tmp/ > Last server error 3011 ('No servers are available to read the file.') > Error accessing path/file for > root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 > [root@c131 ~]# tail -f /var/log/xrootd/cms.log > 091216 17:45:52 3103 manager.0:[log in to unmask] XrdLink: > Setting ref to 2+-1 post=0 > 091216 17:45:55 3103 Pander trying to connect to lvl 0 > atlas-bkp1.cs.wisc.edu:3121 > 091216 17:45:55 3103 XrdInet: Connected to atlas-bkp1.cs.wisc.edu:3121 > 091216 17:45:55 3103 Add atlas-bkp1.cs.wisc.edu to manager config; id=0 > 091216 17:45:55 3103 ManTree: Now connected to 3 root node(s) > 091216 17:45:55 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu > 091216 17:45:55 3103 Dispatch manager.0:[log in to unmask] for try > dlen=3 > 091216 17:45:55 3103 manager.0:[log in to unmask] do_Try: > 091216 17:45:55 3103 Remove completed atlas-bkp1.cs.wisc.edu manager 0.95 > 091216 17:45:55 3103 Manager: manager.0:[log in to unmask] > removed; redirected > 091216 17:46:04 3103 Pander trying to connect to lvl 0 > atlas-bkp1.cs.wisc.edu:3121 > 091216 17:46:04 3103 XrdInet: Connected to atlas-bkp1.cs.wisc.edu:3121 > 091216 17:46:04 3103 Add atlas-bkp1.cs.wisc.edu to manager config; id=0 > 091216 17:46:04 3103 ManTree: Now connected to 3 root node(s) > 091216 17:46:04 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu > 091216 17:46:04 3103 Dispatch manager.0:[log in to unmask] for try > dlen=3 > 091216 17:46:04 3103 Protocol: No buffers to serve atlas-bkp1.cs.wisc.edu > 091216 17:46:04 3103 Remove completed atlas-bkp1.cs.wisc.edu manager 0.96 > 091216 17:46:04 3103 Manager: manager.0:[log in to unmask] > removed; insufficient buffers > 091216 17:46:11 3103 Dispatch manager.0:[log in to unmask] for > state dlen=169 > 091216 17:46:11 3103 manager.0:[log in to unmask] XrdLink: > Setting ref to 1+1 post=0 > > Thanks > Wen > > On Thu, Dec 17, 2009 at 12:10 AM, wen guan <[log in to unmask]> wrote: >> >> Hi Andy, >> >>> OK, I understand. As for stalling, too many nodes were deemed to be in >>> trouble for the manager to allow service resumption. >>> >>> Please make sure that all of the nodes in the cluster receive the new >>> cmsd >>> as they will drop off with the old one and you'll see the same kind of >>> activity. Perhaps the best way to know that you suceeded in putting >>> everything in sync is to start with 63 data nodes plus one supervisor. >>> Once >>> all connections are established; adding an additional server should >>> simply >>> send it to the supervisor. >> >> I will do it. >> you said start 63 data server and one supervisor. Does it mean the >> supervisor is managed using the same policy? If I there are 64 >> dataservers which are connected before the supervisor, will the >> supervisor be dropped? Is the supervisor has high priority to be >> added to the manager? I mean, if there are already 64 dataservers and >> a supervisor comes in, will the supervisor be accepted and a datasever >> be redirected to the supervisor? >> >> Thanks >> Wen >> >>> >>> Hi Andrew, >>> >>> But when I tried to xrdcp a file to it, it doesn't response. In >>> atlas-bkp1-xrd.log.20091213, it always prints "stalling client for 10 >>> sec". But in cms.log, I can find any message about the file. >>> >>>> I don't see why you say it doesn't work. With the debugging level set >>>> so >>>> high the noise may make it look like something is going wrong but that >>>> isn't >>>> necessarily the case. >>>> >>>> 1) The 'too many subscribers' is correct. The manager was simply >>>> redirecting >>>> them because there were already 64 servers. However, in your case the >>>> supervisor wasn't started until almost 30 minutes after everyone else >>>> (i.e., >>>> 10:42 AM). Why was that? I'm not suprised about the flurry of messages >>>> with >>>> a critical component missing for 30 minutes. >>> >>> Because the manager is 64bit machine but supervisor is 32 bit machine. >>> Then I have to recompile the it. At that time, I was interrupted by >>> something else. >>> >>> >>>> 2) Once the supervisor started, it started accepting the redirected >>>> servers. >>>> >>>> 3) Then 10 seconds (10:42:10) later the supervisor was restarted. So, >>>> that >>>> would cause a flurry of activity to occur as there is no backup >>>> supervisor >>>> to take over. >>>> >>>> 4) This happened again at 10:42:34 AM then again at 10:48:49. Is the >>>> supervisor crashing? Is there a core file? >>>> >>>> 5) At 11:11 AM the manager restarted. Again, is there a core file here >>>> or >>>> was this a manual action? >>>> >>>> During the course of all of this. All nodes connected were operating >>>> propely >>>> and files were being located. >>>> >>>> So, the two big questions are: >>>> >>>> a) Why was the supervisor not started until 30 minutes after the system >>>> was >>>> started? >>>> >>>> b) Is there an explanation of the restarts? If this was a crash then we >>>> need >>>> a core file to figure out what happened. >>> >>> It's not a crash. There are some reasons that I restarted some daemons. >>> (1)I thought if a dataserver tried many times to connect to a >>> redirector but failed, the dataserver would not try to connect a >>> redirector again. The supervisor was missing for long time. So maybe >>> some dataservers would not try to connect to atlas-bkp1 again. To >>> reactive these dataservers, I restarted any servers. >>> (2)When I tried to xrdcp, it was hanging for long time. I thought >>> maybe manager was affected by some others things. then I restarte >>> manager to see whether a restart can make this xrdcp work. >>> >>> >>> Thanks >>> Wen >>> >>>> Andy >>>> >>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]> >>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>> Cc: <[log in to unmask]> >>>> Sent: Wednesday, December 16, 2009 9:38 AM >>>> Subject: Re: xrootd with more than 65 machines >>>> >>>> >>>> Hi Andrew, >>>> >>>> It still doesn't work. >>>> The log file is in higgs03.cs.wisc.edu/wguan/. The name is *.20091216 >>>> The manager complains there are too many subscribers and the removes >>>> nodes. >>>> >>>> (*) >>>> Add server.10040:[log in to unmask] redirected; too many >>>> subscribers. >>>> >>>> Wen >>>> >>>> On Wed, Dec 16, 2009 at 4:25 AM, Andrew Hanushevsky <[log in to unmask]> >>>> wrote: >>>>> >>>>> Hi Wen, >>>>> >>>>> It will be easier for me to retroft as the changes were pretty minor. >>>>> Please >>>>> lift the new XrdCmsNode.cc file from >>>>> >>>>> http://www.slac.stanford.edu/~abh/cmsd >>>>> >>>>> Andy >>>>> >>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]> >>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>> Cc: <[log in to unmask]> >>>>> Sent: Tuesday, December 15, 2009 5:12 PM >>>>> Subject: Re: xrootd with more than 65 machines >>>>> >>>>> >>>>> Hi Andy, >>>>> >>>>> I can switch to 20091104-1102. Then you don't need to patch >>>>> another version. How can I download v20091104-1102? >>>>> >>>>> Thanks >>>>> Wen >>>>> >>>>> On Wed, Dec 16, 2009 at 12:52 AM, Andrew Hanushevsky >>>>> <[log in to unmask]> >>>>> wrote: >>>>>> >>>>>> Hi Wen, >>>>>> >>>>>> Ah yes, I see that now. The file I gave you is based on >>>>>> v20091104-1102. >>>>>> Let >>>>>> me see if I can retrofit the patch for you. >>>>>> >>>>>> Andy >>>>>> >>>>>> ----- Original Message ----- From: "wen guan" >>>>>> <[log in to unmask]> >>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>> Cc: <[log in to unmask]> >>>>>> Sent: Tuesday, December 15, 2009 1:04 PM >>>>>> Subject: Re: xrootd with more than 65 machines >>>>>> >>>>>> >>>>>> Hi Andy, >>>>>> >>>>>> Which xrootd version are you using? XrdCmsConfig.hh is different. >>>>>> XrdCmsConfig.hh is downloaded from >>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/. >>>>>> >>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsNode.cc >>>>>> 6fb3ae40fe4e10bdd4d372818a341f2c src/XrdCms/XrdCmsNode.cc >>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsConfig.hh >>>>>> 7d57753847d9448186c718f98e963cbe src/XrdCms/XrdCmsConfig.hh >>>>>> >>>>>> Thanks >>>>>> Wen >>>>>> >>>>>> On Tue, Dec 15, 2009 at 10:50 PM, Andrew Hanushevsky >>>>>> <[log in to unmask]> >>>>>> wrote: >>>>>>> >>>>>>> Hi Wen, >>>>>>> >>>>>>> Just compiled on Linux and it was clean. Something is really wrong >>>>>>> with >>>>>>> your >>>>>>> source files, specifically XrdCmsConfig.cc >>>>>>> >>>>>>> The MD5 checksums on the relevant files are: >>>>>>> >>>>>>> MD5 (XrdCmsNode.cc) = 6fb3ae40fe4e10bdd4d372818a341f2c >>>>>>> >>>>>>> MD5 (XrdCmsConfig.hh) = 4a7d655582a7cd43b098947d0676924b >>>>>>> >>>>>>> Andy >>>>>>> >>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>> <[log in to unmask]> >>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>> Cc: <[log in to unmask]> >>>>>>> Sent: Tuesday, December 15, 2009 4:24 AM >>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>> >>>>>>> >>>>>>> Hi Andy, >>>>>>> >>>>>>> No problem. Thanks for the fix. But it cannot be compiled. The >>>>>>> version I am using is >>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/. >>>>>>> >>>>>>> Making cms component... >>>>>>> Compiling XrdCmsNode.cc >>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>> XrdCmsNode::do_Chmod(XrdCmsRRData&)': >>>>>>> XrdCmsNode.cc:268: error: `fsExec' was not declared in this scope >>>>>>> XrdCmsNode.cc:268: warning: unused variable 'fsExec' >>>>>>> XrdCmsNode.cc:269: error: 'class XrdCmsConfig' has no member named >>>>>>> 'ossFS' >>>>>>> XrdCmsNode.cc:273: error: `fsFail' was not declared in this scope >>>>>>> XrdCmsNode.cc:273: warning: unused variable 'fsFail' >>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>> XrdCmsNode::do_Mkdir(XrdCmsRRData&)': >>>>>>> XrdCmsNode.cc:600: error: `fsExec' was not declared in this scope >>>>>>> XrdCmsNode.cc:600: warning: unused variable 'fsExec' >>>>>>> XrdCmsNode.cc:601: error: 'class XrdCmsConfig' has no member named >>>>>>> 'ossFS' >>>>>>> XrdCmsNode.cc:605: error: `fsFail' was not declared in this scope >>>>>>> XrdCmsNode.cc:605: warning: unused variable 'fsFail' >>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>> XrdCmsNode::do_Mkpath(XrdCmsRRData&)': >>>>>>> XrdCmsNode.cc:640: error: `fsExec' was not declared in this scope >>>>>>> XrdCmsNode.cc:640: warning: unused variable 'fsExec' >>>>>>> XrdCmsNode.cc:641: error: 'class XrdCmsConfig' has no member named >>>>>>> 'ossFS' >>>>>>> XrdCmsNode.cc:645: error: `fsFail' was not declared in this scope >>>>>>> XrdCmsNode.cc:645: warning: unused variable 'fsFail' >>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>> XrdCmsNode::do_Mv(XrdCmsRRData&)': >>>>>>> XrdCmsNode.cc:704: error: `fsExec' was not declared in this scope >>>>>>> XrdCmsNode.cc:704: warning: unused variable 'fsExec' >>>>>>> XrdCmsNode.cc:705: error: 'class XrdCmsConfig' has no member named >>>>>>> 'ossFS' >>>>>>> XrdCmsNode.cc:709: error: `fsFail' was not declared in this scope >>>>>>> XrdCmsNode.cc:709: warning: unused variable 'fsFail' >>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>> XrdCmsNode::do_Rm(XrdCmsRRData&)': >>>>>>> XrdCmsNode.cc:831: error: `fsExec' was not declared in this scope >>>>>>> XrdCmsNode.cc:831: warning: unused variable 'fsExec' >>>>>>> XrdCmsNode.cc:832: error: 'class XrdCmsConfig' has no member named >>>>>>> 'ossFS' >>>>>>> XrdCmsNode.cc:836: error: `fsFail' was not declared in this scope >>>>>>> XrdCmsNode.cc:836: warning: unused variable 'fsFail' >>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>> XrdCmsNode::do_Rmdir(XrdCmsRRData&)': >>>>>>> XrdCmsNode.cc:873: error: `fsExec' was not declared in this scope >>>>>>> XrdCmsNode.cc:873: warning: unused variable 'fsExec' >>>>>>> XrdCmsNode.cc:874: error: 'class XrdCmsConfig' has no member named >>>>>>> 'ossFS' >>>>>>> XrdCmsNode.cc:878: error: `fsFail' was not declared in this scope >>>>>>> XrdCmsNode.cc:878: warning: unused variable 'fsFail' >>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>> XrdCmsNode::do_Trunc(XrdCmsRRData&)': >>>>>>> XrdCmsNode.cc:1377: error: `fsExec' was not declared in this scope >>>>>>> XrdCmsNode.cc:1377: warning: unused variable 'fsExec' >>>>>>> XrdCmsNode.cc:1378: error: 'class XrdCmsConfig' has no member named >>>>>>> 'ossFS' >>>>>>> XrdCmsNode.cc:1382: error: `fsFail' was not declared in this scope >>>>>>> XrdCmsNode.cc:1382: warning: unused variable 'fsFail' >>>>>>> XrdCmsNode.cc: At global scope: >>>>>>> XrdCmsNode.cc:1524: error: no `int XrdCmsNode::fsExec(XrdOucProg*, >>>>>>> char*, char*)' member function declared in class `XrdCmsNode' >>>>>>> XrdCmsNode.cc: In member function `int >>>>>>> XrdCmsNode::fsExec(XrdOucProg*, >>>>>>> char*, char*)': >>>>>>> XrdCmsNode.cc:1533: error: `fsL2PFail1' was not declared in this >>>>>>> scope >>>>>>> XrdCmsNode.cc:1533: warning: unused variable 'fsL2PFail1' >>>>>>> XrdCmsNode.cc:1537: error: `fsL2PFail2' was not declared in this >>>>>>> scope >>>>>>> XrdCmsNode.cc:1537: warning: unused variable 'fsL2PFail2' >>>>>>> XrdCmsNode.cc: At global scope: >>>>>>> XrdCmsNode.cc:1553: error: no `const char* XrdCmsNode::fsFail(const >>>>>>> char*, const char*, const char*, int)' member function declared in >>>>>>> class `XrdCmsNode' >>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>> XrdCmsNode::fsFail(const char*, const char*, const char*, int)': >>>>>>> XrdCmsNode.cc:1559: error: `fsL2PFail1' was not declared in this >>>>>>> scope >>>>>>> XrdCmsNode.cc:1559: warning: unused variable 'fsL2PFail1' >>>>>>> XrdCmsNode.cc:1560: error: `fsL2PFail2' was not declared in this >>>>>>> scope >>>>>>> XrdCmsNode.cc:1560: warning: unused variable 'fsL2PFail2' >>>>>>> XrdCmsNode.cc: In static member function `static int >>>>>>> XrdCmsNode::isOnline(char*, int)': >>>>>>> XrdCmsNode.cc:1608: error: 'class XrdCmsConfig' has no member named >>>>>>> 'ossFS' >>>>>>> make[4]: *** [../../obj/XrdCmsNode.o] Error 1 >>>>>>> make[3]: *** [Linuxall] Error 2 >>>>>>> make[2]: *** [all] Error 2 >>>>>>> make[1]: *** [XrdCms] Error 2 >>>>>>> make: *** [all] Error 2 >>>>>>> >>>>>>> >>>>>>> Wen >>>>>>> >>>>>>> On Tue, Dec 15, 2009 at 2:08 AM, Andrew Hanushevsky >>>>>>> <[log in to unmask]> >>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Wen, >>>>>>>> >>>>>>>> I have developed a permanent fix. You will find the source files in >>>>>>>> >>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/ >>>>>>>> >>>>>>>> There are three files: XrdCmsCluster.cc XrdCmsNode.cc >>>>>>>> XrdCmsProtocol.cc >>>>>>>> >>>>>>>> Please do a source replacement and recompile. Unfortunately, the >>>>>>>> cmsd >>>>>>>> will >>>>>>>> need to be replaced on each node regardless of role. My apologies >>>>>>>> for >>>>>>>> the >>>>>>>> disruption. Please let me know how it goes. >>>>>>>> >>>>>>>> Andy >>>>>>>> >>>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>>> <[log in to unmask]> >>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>>> Cc: <[log in to unmask]> >>>>>>>> Sent: Sunday, December 13, 2009 7:04 AM >>>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>>> >>>>>>>> >>>>>>>> Hi Andrew, >>>>>>>> >>>>>>>> >>>>>>>> Thanks. >>>>>>>> I used the new cmsd at atlas-bkp1 manager. But it's still dropping >>>>>>>> nodes. And in supervisor's log, I cannot find any dataserver to >>>>>>>> register to it. >>>>>>>> >>>>>>>> The new logs are in http://higgs03.cs.wisc.edu/wguan/*.20091213. >>>>>>>> The manager is patched at 091213 08:38:15. >>>>>>>> >>>>>>>> Wen >>>>>>>> >>>>>>>> On Sun, Dec 13, 2009 at 1:52 AM, Andrew Hanushevsky >>>>>>>> <[log in to unmask]> wrote: >>>>>>>>> >>>>>>>>> Hi Wen >>>>>>>>> >>>>>>>>> You will find the source replacement at: >>>>>>>>> >>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/ >>>>>>>>> >>>>>>>>> It's XrdCmsCluster.cc and it replaces >>>>>>>>> xrootd/src/XrdCms/XrdCmsCluster.cc >>>>>>>>> >>>>>>>>> I'm stepping out for a couple of hours but will be back to see how >>>>>>>>> things >>>>>>>>> went. Sorry for the issues :-( >>>>>>>>> >>>>>>>>> Andy >>>>>>>>> >>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote: >>>>>>>>> >>>>>>>>>> Hi Andrew, >>>>>>>>>> >>>>>>>>>> I prefer a source replacement. Then I can compile it. >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Wen >>>>>>>>>>> >>>>>>>>>>> I can do one of two things here: >>>>>>>>>>> >>>>>>>>>>> 1) Supply a source replacement and then you would recompile, or >>>>>>>>>>> >>>>>>>>>>> 2) Give me the uname -a of where the cmsd will run and I'll >>>>>>>>>>> supply >>>>>>>>>>> a >>>>>>>>>>> binary >>>>>>>>>>> replacement for you. >>>>>>>>>>> >>>>>>>>>>> Your choice. >>>>>>>>>>> >>>>>>>>>>> Andy >>>>>>>>>>> >>>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Andrew >>>>>>>>>>>> >>>>>>>>>>>> The problem is found. Great. Thanks. >>>>>>>>>>>> >>>>>>>>>>>> Where can I find the patched cmsd? >>>>>>>>>>>> >>>>>>>>>>>> Wen >>>>>>>>>>>> >>>>>>>>>>>> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky >>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>> >>>>>>>>>>>>> I found the problem. Looks like a regression from way back >>>>>>>>>>>>> when. >>>>>>>>>>>>> There >>>>>>>>>>>>> is >>>>>>>>>>>>> a >>>>>>>>>>>>> missing flag on the redirect. This will require a patched cmsd >>>>>>>>>>>>> but >>>>>>>>>>>>> you >>>>>>>>>>>>> need >>>>>>>>>>>>> only to replace the redirector's cmsd as this only affects the >>>>>>>>>>>>> redirector. >>>>>>>>>>>>> How would you like to proceed? >>>>>>>>>>>>> >>>>>>>>>>>>> Andy >>>>>>>>>>>>> >>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>> >>>>>>>>>>>>>> It doesn't work. atlas-bkp1 manager still dropping nodes >>>>>>>>>>>>>> again. >>>>>>>>>>>>>> In supervisor, I still haven't seen any dataserver >>>>>>>>>>>>>> registered. >>>>>>>>>>>>>> I >>>>>>>>>>>>>> said >>>>>>>>>>>>>> "I updated the ntp" because you said "the log timestamp do >>>>>>>>>>>>>> not >>>>>>>>>>>>>> overlap". >>>>>>>>>>>>>> >>>>>>>>>>>>>> Wen >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky >>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Do you mean that everything is now working? It could be that >>>>>>>>>>>>>>> you >>>>>>>>>>>>>>> removed >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> xrd.timeout directive. That really could cause problems. As >>>>>>>>>>>>>>> for >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> delays, >>>>>>>>>>>>>>> that is normal when the redirector thinks something is going >>>>>>>>>>>>>>> wrong. >>>>>>>>>>>>>>> The >>>>>>>>>>>>>>> strategy is to delay clients until it can get back to a >>>>>>>>>>>>>>> stable >>>>>>>>>>>>>>> configuration. This usually prevents jobs from crashing >>>>>>>>>>>>>>> during >>>>>>>>>>>>>>> stressful >>>>>>>>>>>>>>> periods. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I restarted it to do supervisor test. Also because xrootd >>>>>>>>>>>>>>>> manager >>>>>>>>>>>>>>>> frequently doesn't response. (*) is the cms.log, the file >>>>>>>>>>>>>>>> select >>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>> delayed again and again. When do a restart, all things are >>>>>>>>>>>>>>>> fine. >>>>>>>>>>>>>>>> Now >>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>> am trying to find a clue about it. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> (*) >>>>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] >>>>>>>>>>>>>>>> do_Select: >>>>>>>>>>>>>>>> wc >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>>>> 091212 00:00:19 21318 Select seeking >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>>>> 091212 00:00:19 21318 UnkFile rc=1 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] >>>>>>>>>>>>>>>> do_Select: >>>>>>>>>>>>>>>> delay 5 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 2+-1 >>>>>>>>>>>>>>>> post=0 >>>>>>>>>>>>>>>> 091212 00:00:19 21318 Dispatch >>>>>>>>>>>>>>>> redirector.21313:14@atlas-bkp2 >>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>> select dlen=166 >>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1 >>>>>>>>>>>>>>>> post=0 >>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdSched: running redirector inq=0 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> There is no core file. I copied a new copies of the logs to >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> link >>>>>>>>>>>>>>>> below. >>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky >>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I see in the server log that it is restarting often. Could >>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>> take >>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>> look >>>>>>>>>>>>>>>>> in the c193 to see if you have any core files? Also please >>>>>>>>>>>>>>>>> make >>>>>>>>>>>>>>>>> sure >>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>> core files are enabled as Linux defaults the size to 0. >>>>>>>>>>>>>>>>> The >>>>>>>>>>>>>>>>> first >>>>>>>>>>>>>>>>> step >>>>>>>>>>>>>>>>> here >>>>>>>>>>>>>>>>> is to find out why your servers are restarting. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> the logs can be found here. From the log you can see >>>>>>>>>>>>>>>>>> atlas-bkp1 >>>>>>>>>>>>>>>>>> manager are dropping nodes again and again which tries to >>>>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>> it. >>>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky >>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi Wen, Could you start everything up and provide me a >>>>>>>>>>>>>>>>>>> pointer >>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>> manager log file, supervisor log file, and one data >>>>>>>>>>>>>>>>>>> server >>>>>>>>>>>>>>>>>>> logfile >>>>>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>> which cover the same time-frame (from start to some >>>>>>>>>>>>>>>>>>> point >>>>>>>>>>>>>>>>>>> where >>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>> think >>>>>>>>>>>>>>>>>>> things are working or not). That way I can see what is >>>>>>>>>>>>>>>>>>> happening. >>>>>>>>>>>>>>>>>>> At >>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>> moment I only see two "bad" things in the config file: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a >>>>>>>>>>>>>>>>>>> manager >>>>>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>> claim, >>>>>>>>>>>>>>>>>>> via >>>>>>>>>>>>>>>>>>> the all.manager directive, that there are three (bkp2 >>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>> bkp3). >>>>>>>>>>>>>>>>>>> While >>>>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>>>> should work, the log file will be dense with error >>>>>>>>>>>>>>>>>>> messages. >>>>>>>>>>>>>>>>>>> Please >>>>>>>>>>>>>>>>>>> correct >>>>>>>>>>>>>>>>>>> this to be consistent and make it easier to see real >>>>>>>>>>>>>>>>>>> errors. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> This is not a problem for me. Because this config is used >>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>> dataserver. In manager, I updated the if >>>>>>>>>>>>>>>>>> atlas-bkp1.cs.wisc.edu >>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>> atlas-bkp2 or something. This is a history problem. at >>>>>>>>>>>>>>>>>> first >>>>>>>>>>>>>>>>>> only >>>>>>>>>>>>>>>>>> atlas-bkp1 is used. atlas-bkp2 and atlas-bkp3 are added >>>>>>>>>>>>>>>>>> later. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 2) Please use cms.space not olb.space (for historical >>>>>>>>>>>>>>>>>>> reasons >>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>> latter >>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>> still accepted and over-rides the former, but that will >>>>>>>>>>>>>>>>>>> soon >>>>>>>>>>>>>>>>>>> end), >>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>> please use only one (the config file uses both >>>>>>>>>>>>>>>>>>> directives). >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> yes. I should remove this line. in fact cms.space is in >>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> cfg >>>>>>>>>>>>>>>>>> too. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> The xrootd has an internal mechanism to connect servers >>>>>>>>>>>>>>>>>>> with >>>>>>>>>>>>>>>>>>> supervisors >>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>> allow for maximum reliability. You cannot change that >>>>>>>>>>>>>>>>>>> algorithm >>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>> there >>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>> no need to do so. You should *never* tell anyone to >>>>>>>>>>>>>>>>>>> directly >>>>>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>> supervisor. If you do, you will likely get unreachable >>>>>>>>>>>>>>>>>>> nodes. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> As for dropping data servers, it would appear to me, >>>>>>>>>>>>>>>>>>> given >>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>> flurry >>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>> such activity, that something either crashed or was >>>>>>>>>>>>>>>>>>> restarted. >>>>>>>>>>>>>>>>>>> That's >>>>>>>>>>>>>>>>>>> why >>>>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>>>> would be good to see the complete log of each one of the >>>>>>>>>>>>>>>>>>> entities. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I read the document. and write a config >>>>>>>>>>>>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg). >>>>>>>>>>>>>>>>>>>> I used my conf, I can see manager is dispatch message >>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>> supervisor. But I cannot see any dataserver tries to >>>>>>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>> supervisor. At the same time, in the manager's log, I >>>>>>>>>>>>>>>>>>>> can >>>>>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>>>> some >>>>>>>>>>>>>>>>>>>> dataserver are Dropped. >>>>>>>>>>>>>>>>>>>> How does xrootd decide which dataserver will connect >>>>>>>>>>>>>>>>>>>> supervisor? >>>>>>>>>>>>>>>>>>>> should I specify some dataservers to connect the >>>>>>>>>>>>>>>>>>>> supervisor? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> (*) supervisor log >>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2 >>>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>> state >>>>>>>>>>>>>>>>>>>> dlen=42 >>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State: >>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 >>>>>>>>>>>>>>>>>>>> do_StateFWD: >>>>>>>>>>>>>>>>>>>> Path >>>>>>>>>>>>>>>>>>>> find >>>>>>>>>>>>>>>>>>>> failed for state >>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> (*)manager log >>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>> TSpace=5587GB >>>>>>>>>>>>>>>>>>>> NumFS=1 >>>>>>>>>>>>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding >>>>>>>>>>>>>>>>>>>> path: >>>>>>>>>>>>>>>>>>>> w >>>>>>>>>>>>>>>>>>>> /atlas >>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 >>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>> do_Space: 5696231MB free; 0% util >>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 logged in. >>>>>>>>>>>>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection from >>>>>>>>>>>>>>>>>>>> [log in to unmask] >>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: running >>>>>>>>>>>>>>>>>>>> ?:[log in to unmask] >>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched protocol >>>>>>>>>>>>>>>>>>>> cmsd >>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll: >>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>> 79 >>>>>>>>>>>>>>>>>>>> attached >>>>>>>>>>>>>>>>>>>> to poller 2; num=22 >>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add >>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>>>>>>>>> bumps >>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 #63 >>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Drop_Node: >>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 dropped. >>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add Shoved >>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 to cluster; >>>>>>>>>>>>>>>>>>>> id=63.78; >>>>>>>>>>>>>>>>>>>> num=64; >>>>>>>>>>>>>>>>>>>> min=51 >>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>> TSpace=5587GB >>>>>>>>>>>>>>>>>>>> NumFS=1 >>>>>>>>>>>>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding >>>>>>>>>>>>>>>>>>>> path: >>>>>>>>>>>>>>>>>>>> w >>>>>>>>>>>>>>>>>>>> /atlas >>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 >>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>> do_Space: 5721854MB free; 0% util >>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 logged in. >>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve from >>>>>>>>>>>>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer >>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>> 60 >>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Remove_Node >>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 node 63.78 >>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 >>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>> 79 detached from poller 2; num=21 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service >>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>> c177.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>> FD=16 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 node 0.3 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask] >>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask] >>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>> 16 detached from poller 2; num=20 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>> c179.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>> FD=21 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>> server.17065:[log in to unmask]:1094 node 1.4 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>> server.7978:[log in to unmask] >>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.7978:[log in to unmask] >>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>> 21 >>>>>>>>>>>>>>>>>>>> detached from poller 1; num=21 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 State: Status changed to >>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Send status to >>>>>>>>>>>>>>>>>>>> redirector.15656:14@atlas-bkp2 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service >>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>> c182.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>> FD=19 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 node 7.10 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask] >>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask] >>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>> 19 detached from poller 2; num=19 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service >>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>> c178.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>> FD=15 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 node 9.12 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask] >>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask] >>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>> 15 detached from poller 1; num=20 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service >>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>> c181.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>> FD=17 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 node 5.8 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask] >>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask] >>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>> 17 detached from poller 0; num=21 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service >>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>> c183.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>> FD=22 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 node 8.11 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask] >>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask] >>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>> 22 detached from poller 2; num=18 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>> c184.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>> FD=20 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>> server.4131:[log in to unmask]:1094 node 3.6 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask] >>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask] >>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>> 20 detached from poller 0; num=20 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service >>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>> c185.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>> FD=23 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 node 6.9 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>> server.8524:[log in to unmask] >>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.8524:[log in to unmask] >>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>> 23 >>>>>>>>>>>>>>>>>>>> detached from poller 0; num=19 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service >>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>> c180.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>> FD=18 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 node 4.7 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask] >>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask] >>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>> 18 detached from poller 1; num=19 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service >>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>> c186.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>> FD=24 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 node 2.5 >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>> server.7849:[log in to unmask] >>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.7849:[log in to unmask] >>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>> 24 >>>>>>>>>>>>>>>>>>>> detached from poller 1; num=18 >>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled. >>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled. >>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled. >>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled. >>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled. >>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled. >>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled. >>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled. >>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled. >>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled. >>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled. >>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers >>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled. >>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky >>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> To go past 64 data servers you will need to setup one >>>>>>>>>>>>>>>>>>>>> or >>>>>>>>>>>>>>>>>>>>> more >>>>>>>>>>>>>>>>>>>>> supervisors. >>>>>>>>>>>>>>>>>>>>> This does not logically change the current >>>>>>>>>>>>>>>>>>>>> configuration >>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>> have. >>>>>>>>>>>>>>>>>>>>> You >>>>>>>>>>>>>>>>>>>>> only >>>>>>>>>>>>>>>>>>>>> need to configure one or more *new* servers (or at >>>>>>>>>>>>>>>>>>>>> least >>>>>>>>>>>>>>>>>>>>> xrootd >>>>>>>>>>>>>>>>>>>>> processes) >>>>>>>>>>>>>>>>>>>>> whose role is supervisor. We'd like them to run in >>>>>>>>>>>>>>>>>>>>> separate >>>>>>>>>>>>>>>>>>>>> machines >>>>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>>> reliability purposes, but they could run on the >>>>>>>>>>>>>>>>>>>>> manager >>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>>>>> long >>>>>>>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>> give each one a unique instance name (i.e., -n >>>>>>>>>>>>>>>>>>>>> option). >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> The front part of the cmsd reference explains how to >>>>>>>>>>>>>>>>>>>>> do >>>>>>>>>>>>>>>>>>>>> this. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Is there any change to configure xrootd with more >>>>>>>>>>>>>>>>>>>>>> than >>>>>>>>>>>>>>>>>>>>>> 65 >>>>>>>>>>>>>>>>>>>>>> machines? I used the configure below but it doesn't >>>>>>>>>>>>>>>>>>>>>> work. >>>>>>>>>>>>>>>>>>>>>> Should >>>>>>>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>>>>>> configure some machines' manager to be supvervisor? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> >>> >>> >>> >> > > >