Hi Wen, I see that you are getting error 10000, which means "generic error before any interaction". Could you please run the same command with debug level 3 and post the log with the same kind of issue? Something like xrdcp -d 3 .... Most likely this time the problem is different. I may be wrong here, but a possible reason for that error is that the servers require authentication and xrdcp does not find some library in the LD_LIBRARY_PATH. Fabrizio wen guan ha scritto: > Hi Andy, > > I put new logs in web. > > It still doesn't work. I cannot copy files in and out. > > It seems xrootd daemon at atlas-bkp1 hasn't talked with cmsd. > Normally if xrootd daemont tries to copy a file, in the cms.log I > should see "do_Select: filename". But in this cms.log, there is > nothing from atlas-bkp1. > > (*) > [root@atlas-bkp1 ~]# xrdcp > root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 > /tmp/ > Last server error 10000 ('') > Error accessing path/file for > root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 > [root@atlas-bkp1 ~]# xrdcp /bin/mv > root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123 > 133 > > > Wen > > On Thu, Dec 17, 2009 at 10:54 PM, Andrew Hanushevsky <[log in to unmask]> wrote: >> Hi Wen, >> >> I reviewed the log file. Other than the odd redirect of c131 at 17:47:25 >> which I can't comment on because its logs on the web site do not overlap >> with the manager or supervisor. Unless all the logs include the full time in >> question I can't say much of anything. Can you provide me with inclusive >> logs? >> >> atlas-bkp1 cms: 17:20:57 to 17:42:19 xrd: 17:20:57 to 17:40:57 >> higgs07 cms & xrd 17:22:33 to 17:42:33 >> c131 cms & xrd 17:31:57 to 17:47:28 >> >> That said, it certainly looks like things were working and files were being >> accessed and discovered on all the machines. You even werw able to open >> /atlas/xrootd/users/wguan/test/test98123313 >> through not >> /atlas/xrootd/users/wguan/test/test123131The other issue is that you did not >> specify a stable adminpath and the adminpath defaults to /tmp. If you have a >> "cleanup" script that runs periodically for /tmp then eventually your >> cluster will go catonic as important (but not often used) files are deleted >> by that script. Could you please find a stable home for the adminpath? >> >> I reran my tests here and things worked as expected. I will ramp up some >> more tests. So, what is your status today? >> >> Andy >> >> ----- Original Message ----- From: "wen guan" <[log in to unmask]> >> To: "Andrew Hanushevsky" <[log in to unmask]> >> Cc: <[log in to unmask]> >> Sent: Thursday, December 17, 2009 5:05 AM >> Subject: Re: xrootd with more than 65 machines >> >> >> Hi Andy, >> >> Yes. I am using the file download from >> http://www.slac.stanford.edu/~abh/cmsd/ which compiled yesterday. I >> just now compiled it again and compare it with one I compiled >> yesterday. they are the same(same md5sum). >> >> Wen >> >> On Thu, Dec 17, 2009 at 2:09 AM, Andrew Hanushevsky <[log in to unmask]> >> wrote: >>> Hi Wen, >>> >>> If c131 cannot connect then either c131 does not have the new cms or >>> atlas-bkp1 does not have the new cms as that would be what would happen if >>> either were true. Looking at the log on c131 it would appear that >>> atlas-bkp1 >>> is still using the old cmsd as the response data length is wrong. Could >>> you >>> verify please. >>> >>> Andy >>> >>> ----- Original Message ----- From: "wen guan" <[log in to unmask]> >>> To: "Andrew Hanushevsky" <[log in to unmask]> >>> Cc: <[log in to unmask]> >>> Sent: Wednesday, December 16, 2009 3:58 PM >>> Subject: Re: xrootd with more than 65 machines >>> >>> >>> Hi Andy, >>> >>> I tried it. But there are still some problem. I put the logs in >>> higgs03.cs.wisc.edu/wguan/ >>> >>> In my test, c131 is the 65 nodes to be added the the manager. >>> and I can copy the file to the pool through manager. But I cannot >>> copy a file out which is in c131. >>> >>> In c131's cms.log, I see "Manager: >>> manager.0:[log in to unmask] removed; redirected" again and >>> again. and I cannot see any thing about c131 in higgs07's >>> log(supervisor). Does it mean manager tries to redirect it to higgs07, >>> but c131 hasn't try to connect higgs07. It only tries to connect >>> manager again. >>> >>> (*) >>> [root@c131 ~]# xrdcp /bin/mv >>> root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331 >>> Last server error 10000 ('') >>> Error accessing path/file for >>> root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331 >>> [root@c131 ~]# xrdcp /bin/mv >>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123311 >>> [xrootd] Total 0.06 MB |====================| 100.00 % [3.1 MB/s] >>> [root@c131 ~]# xrdcp /bin/mv >>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123312 >>> [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s] >>> [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/ >>> test123131 >>> [root@c131 ~]# xrdcp >>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>> /tmp/ >>> Last server error 3011 ('No servers are available to read the file.') >>> Error accessing path/file for >>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>> [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/test123131 >>> /atlas/xrootd/users/wguan/test/test123131 >>> [root@c131 ~]# xrdcp >>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>> /tmp/ >>> Last server error 3011 ('No servers are available to read the file.') >>> Error accessing path/file for >>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>> [root@c131 ~]# xrdcp /bin/mv >>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123313 >>> [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s] >>> [root@c131 ~]# xrdcp >>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>> /tmp/ >>> Last server error 3011 ('No servers are available to read the file.') >>> Error accessing path/file for >>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>> [root@c131 ~]# xrdcp >>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>> /tmp/ >>> Last server error 3011 ('No servers are available to read the file.') >>> Error accessing path/file for >>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>> [root@c131 ~]# xrdcp >>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>> /tmp/ >>> Last server error 3011 ('No servers are available to read the file.') >>> Error accessing path/file for >>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>> [root@c131 ~]# tail -f /var/log/xrootd/cms.log >>> 091216 17:45:52 3103 manager.0:[log in to unmask] XrdLink: >>> Setting ref to 2+-1 post=0 >>> 091216 17:45:55 3103 Pander trying to connect to lvl 0 >>> atlas-bkp1.cs.wisc.edu:3121 >>> 091216 17:45:55 3103 XrdInet: Connected to atlas-bkp1.cs.wisc.edu:3121 >>> 091216 17:45:55 3103 Add atlas-bkp1.cs.wisc.edu to manager config; id=0 >>> 091216 17:45:55 3103 ManTree: Now connected to 3 root node(s) >>> 091216 17:45:55 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu >>> 091216 17:45:55 3103 Dispatch manager.0:[log in to unmask] for try >>> dlen=3 >>> 091216 17:45:55 3103 manager.0:[log in to unmask] do_Try: >>> 091216 17:45:55 3103 Remove completed atlas-bkp1.cs.wisc.edu manager 0.95 >>> 091216 17:45:55 3103 Manager: manager.0:[log in to unmask] >>> removed; redirected >>> 091216 17:46:04 3103 Pander trying to connect to lvl 0 >>> atlas-bkp1.cs.wisc.edu:3121 >>> 091216 17:46:04 3103 XrdInet: Connected to atlas-bkp1.cs.wisc.edu:3121 >>> 091216 17:46:04 3103 Add atlas-bkp1.cs.wisc.edu to manager config; id=0 >>> 091216 17:46:04 3103 ManTree: Now connected to 3 root node(s) >>> 091216 17:46:04 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu >>> 091216 17:46:04 3103 Dispatch manager.0:[log in to unmask] for try >>> dlen=3 >>> 091216 17:46:04 3103 Protocol: No buffers to serve atlas-bkp1.cs.wisc.edu >>> 091216 17:46:04 3103 Remove completed atlas-bkp1.cs.wisc.edu manager 0.96 >>> 091216 17:46:04 3103 Manager: manager.0:[log in to unmask] >>> removed; insufficient buffers >>> 091216 17:46:11 3103 Dispatch manager.0:[log in to unmask] for >>> state dlen=169 >>> 091216 17:46:11 3103 manager.0:[log in to unmask] XrdLink: >>> Setting ref to 1+1 post=0 >>> >>> Thanks >>> Wen >>> >>> On Thu, Dec 17, 2009 at 12:10 AM, wen guan <[log in to unmask]> wrote: >>>> Hi Andy, >>>> >>>>> OK, I understand. As for stalling, too many nodes were deemed to be in >>>>> trouble for the manager to allow service resumption. >>>>> >>>>> Please make sure that all of the nodes in the cluster receive the new >>>>> cmsd >>>>> as they will drop off with the old one and you'll see the same kind of >>>>> activity. Perhaps the best way to know that you suceeded in putting >>>>> everything in sync is to start with 63 data nodes plus one supervisor. >>>>> Once >>>>> all connections are established; adding an additional server should >>>>> simply >>>>> send it to the supervisor. >>>> I will do it. >>>> you said start 63 data server and one supervisor. Does it mean the >>>> supervisor is managed using the same policy? If I there are 64 >>>> dataservers which are connected before the supervisor, will the >>>> supervisor be dropped? Is the supervisor has high priority to be >>>> added to the manager? I mean, if there are already 64 dataservers and >>>> a supervisor comes in, will the supervisor be accepted and a datasever >>>> be redirected to the supervisor? >>>> >>>> Thanks >>>> Wen >>>> >>>>> Hi Andrew, >>>>> >>>>> But when I tried to xrdcp a file to it, it doesn't response. In >>>>> atlas-bkp1-xrd.log.20091213, it always prints "stalling client for 10 >>>>> sec". But in cms.log, I can find any message about the file. >>>>> >>>>>> I don't see why you say it doesn't work. With the debugging level set >>>>>> so >>>>>> high the noise may make it look like something is going wrong but that >>>>>> isn't >>>>>> necessarily the case. >>>>>> >>>>>> 1) The 'too many subscribers' is correct. The manager was simply >>>>>> redirecting >>>>>> them because there were already 64 servers. However, in your case the >>>>>> supervisor wasn't started until almost 30 minutes after everyone else >>>>>> (i.e., >>>>>> 10:42 AM). Why was that? I'm not suprised about the flurry of messages >>>>>> with >>>>>> a critical component missing for 30 minutes. >>>>> Because the manager is 64bit machine but supervisor is 32 bit machine. >>>>> Then I have to recompile the it. At that time, I was interrupted by >>>>> something else. >>>>> >>>>> >>>>>> 2) Once the supervisor started, it started accepting the redirected >>>>>> servers. >>>>>> >>>>>> 3) Then 10 seconds (10:42:10) later the supervisor was restarted. So, >>>>>> that >>>>>> would cause a flurry of activity to occur as there is no backup >>>>>> supervisor >>>>>> to take over. >>>>>> >>>>>> 4) This happened again at 10:42:34 AM then again at 10:48:49. Is the >>>>>> supervisor crashing? Is there a core file? >>>>>> >>>>>> 5) At 11:11 AM the manager restarted. Again, is there a core file here >>>>>> or >>>>>> was this a manual action? >>>>>> >>>>>> During the course of all of this. All nodes connected were operating >>>>>> propely >>>>>> and files were being located. >>>>>> >>>>>> So, the two big questions are: >>>>>> >>>>>> a) Why was the supervisor not started until 30 minutes after the system >>>>>> was >>>>>> started? >>>>>> >>>>>> b) Is there an explanation of the restarts? If this was a crash then we >>>>>> need >>>>>> a core file to figure out what happened. >>>>> It's not a crash. There are some reasons that I restarted some daemons. >>>>> (1)I thought if a dataserver tried many times to connect to a >>>>> redirector but failed, the dataserver would not try to connect a >>>>> redirector again. The supervisor was missing for long time. So maybe >>>>> some dataservers would not try to connect to atlas-bkp1 again. To >>>>> reactive these dataservers, I restarted any servers. >>>>> (2)When I tried to xrdcp, it was hanging for long time. I thought >>>>> maybe manager was affected by some others things. then I restarte >>>>> manager to see whether a restart can make this xrdcp work. >>>>> >>>>> >>>>> Thanks >>>>> Wen >>>>> >>>>>> Andy >>>>>> >>>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]> >>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>> Cc: <[log in to unmask]> >>>>>> Sent: Wednesday, December 16, 2009 9:38 AM >>>>>> Subject: Re: xrootd with more than 65 machines >>>>>> >>>>>> >>>>>> Hi Andrew, >>>>>> >>>>>> It still doesn't work. >>>>>> The log file is in higgs03.cs.wisc.edu/wguan/. The name is *.20091216 >>>>>> The manager complains there are too many subscribers and the removes >>>>>> nodes. >>>>>> >>>>>> (*) >>>>>> Add server.10040:[log in to unmask] redirected; too many >>>>>> subscribers. >>>>>> >>>>>> Wen >>>>>> >>>>>> On Wed, Dec 16, 2009 at 4:25 AM, Andrew Hanushevsky <[log in to unmask]> >>>>>> wrote: >>>>>>> Hi Wen, >>>>>>> >>>>>>> It will be easier for me to retroft as the changes were pretty minor. >>>>>>> Please >>>>>>> lift the new XrdCmsNode.cc file from >>>>>>> >>>>>>> http://www.slac.stanford.edu/~abh/cmsd >>>>>>> >>>>>>> Andy >>>>>>> >>>>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]> >>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>> Cc: <[log in to unmask]> >>>>>>> Sent: Tuesday, December 15, 2009 5:12 PM >>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>> >>>>>>> >>>>>>> Hi Andy, >>>>>>> >>>>>>> I can switch to 20091104-1102. Then you don't need to patch >>>>>>> another version. How can I download v20091104-1102? >>>>>>> >>>>>>> Thanks >>>>>>> Wen >>>>>>> >>>>>>> On Wed, Dec 16, 2009 at 12:52 AM, Andrew Hanushevsky >>>>>>> <[log in to unmask]> >>>>>>> wrote: >>>>>>>> Hi Wen, >>>>>>>> >>>>>>>> Ah yes, I see that now. The file I gave you is based on >>>>>>>> v20091104-1102. >>>>>>>> Let >>>>>>>> me see if I can retrofit the patch for you. >>>>>>>> >>>>>>>> Andy >>>>>>>> >>>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>>> <[log in to unmask]> >>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>>> Cc: <[log in to unmask]> >>>>>>>> Sent: Tuesday, December 15, 2009 1:04 PM >>>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>>> >>>>>>>> >>>>>>>> Hi Andy, >>>>>>>> >>>>>>>> Which xrootd version are you using? XrdCmsConfig.hh is different. >>>>>>>> XrdCmsConfig.hh is downloaded from >>>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/. >>>>>>>> >>>>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsNode.cc >>>>>>>> 6fb3ae40fe4e10bdd4d372818a341f2c src/XrdCms/XrdCmsNode.cc >>>>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsConfig.hh >>>>>>>> 7d57753847d9448186c718f98e963cbe src/XrdCms/XrdCmsConfig.hh >>>>>>>> >>>>>>>> Thanks >>>>>>>> Wen >>>>>>>> >>>>>>>> On Tue, Dec 15, 2009 at 10:50 PM, Andrew Hanushevsky >>>>>>>> <[log in to unmask]> >>>>>>>> wrote: >>>>>>>>> Hi Wen, >>>>>>>>> >>>>>>>>> Just compiled on Linux and it was clean. Something is really wrong >>>>>>>>> with >>>>>>>>> your >>>>>>>>> source files, specifically XrdCmsConfig.cc >>>>>>>>> >>>>>>>>> The MD5 checksums on the relevant files are: >>>>>>>>> >>>>>>>>> MD5 (XrdCmsNode.cc) = 6fb3ae40fe4e10bdd4d372818a341f2c >>>>>>>>> >>>>>>>>> MD5 (XrdCmsConfig.hh) = 4a7d655582a7cd43b098947d0676924b >>>>>>>>> >>>>>>>>> Andy >>>>>>>>> >>>>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>>>> <[log in to unmask]> >>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>>>> Cc: <[log in to unmask]> >>>>>>>>> Sent: Tuesday, December 15, 2009 4:24 AM >>>>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi Andy, >>>>>>>>> >>>>>>>>> No problem. Thanks for the fix. But it cannot be compiled. The >>>>>>>>> version I am using is >>>>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/. >>>>>>>>> >>>>>>>>> Making cms component... >>>>>>>>> Compiling XrdCmsNode.cc >>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>> XrdCmsNode::do_Chmod(XrdCmsRRData&)': >>>>>>>>> XrdCmsNode.cc:268: error: `fsExec' was not declared in this scope >>>>>>>>> XrdCmsNode.cc:268: warning: unused variable 'fsExec' >>>>>>>>> XrdCmsNode.cc:269: error: 'class XrdCmsConfig' has no member named >>>>>>>>> 'ossFS' >>>>>>>>> XrdCmsNode.cc:273: error: `fsFail' was not declared in this scope >>>>>>>>> XrdCmsNode.cc:273: warning: unused variable 'fsFail' >>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>> XrdCmsNode::do_Mkdir(XrdCmsRRData&)': >>>>>>>>> XrdCmsNode.cc:600: error: `fsExec' was not declared in this scope >>>>>>>>> XrdCmsNode.cc:600: warning: unused variable 'fsExec' >>>>>>>>> XrdCmsNode.cc:601: error: 'class XrdCmsConfig' has no member named >>>>>>>>> 'ossFS' >>>>>>>>> XrdCmsNode.cc:605: error: `fsFail' was not declared in this scope >>>>>>>>> XrdCmsNode.cc:605: warning: unused variable 'fsFail' >>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>> XrdCmsNode::do_Mkpath(XrdCmsRRData&)': >>>>>>>>> XrdCmsNode.cc:640: error: `fsExec' was not declared in this scope >>>>>>>>> XrdCmsNode.cc:640: warning: unused variable 'fsExec' >>>>>>>>> XrdCmsNode.cc:641: error: 'class XrdCmsConfig' has no member named >>>>>>>>> 'ossFS' >>>>>>>>> XrdCmsNode.cc:645: error: `fsFail' was not declared in this scope >>>>>>>>> XrdCmsNode.cc:645: warning: unused variable 'fsFail' >>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>> XrdCmsNode::do_Mv(XrdCmsRRData&)': >>>>>>>>> XrdCmsNode.cc:704: error: `fsExec' was not declared in this scope >>>>>>>>> XrdCmsNode.cc:704: warning: unused variable 'fsExec' >>>>>>>>> XrdCmsNode.cc:705: error: 'class XrdCmsConfig' has no member named >>>>>>>>> 'ossFS' >>>>>>>>> XrdCmsNode.cc:709: error: `fsFail' was not declared in this scope >>>>>>>>> XrdCmsNode.cc:709: warning: unused variable 'fsFail' >>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>> XrdCmsNode::do_Rm(XrdCmsRRData&)': >>>>>>>>> XrdCmsNode.cc:831: error: `fsExec' was not declared in this scope >>>>>>>>> XrdCmsNode.cc:831: warning: unused variable 'fsExec' >>>>>>>>> XrdCmsNode.cc:832: error: 'class XrdCmsConfig' has no member named >>>>>>>>> 'ossFS' >>>>>>>>> XrdCmsNode.cc:836: error: `fsFail' was not declared in this scope >>>>>>>>> XrdCmsNode.cc:836: warning: unused variable 'fsFail' >>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>> XrdCmsNode::do_Rmdir(XrdCmsRRData&)': >>>>>>>>> XrdCmsNode.cc:873: error: `fsExec' was not declared in this scope >>>>>>>>> XrdCmsNode.cc:873: warning: unused variable 'fsExec' >>>>>>>>> XrdCmsNode.cc:874: error: 'class XrdCmsConfig' has no member named >>>>>>>>> 'ossFS' >>>>>>>>> XrdCmsNode.cc:878: error: `fsFail' was not declared in this scope >>>>>>>>> XrdCmsNode.cc:878: warning: unused variable 'fsFail' >>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>> XrdCmsNode::do_Trunc(XrdCmsRRData&)': >>>>>>>>> XrdCmsNode.cc:1377: error: `fsExec' was not declared in this scope >>>>>>>>> XrdCmsNode.cc:1377: warning: unused variable 'fsExec' >>>>>>>>> XrdCmsNode.cc:1378: error: 'class XrdCmsConfig' has no member named >>>>>>>>> 'ossFS' >>>>>>>>> XrdCmsNode.cc:1382: error: `fsFail' was not declared in this scope >>>>>>>>> XrdCmsNode.cc:1382: warning: unused variable 'fsFail' >>>>>>>>> XrdCmsNode.cc: At global scope: >>>>>>>>> XrdCmsNode.cc:1524: error: no `int XrdCmsNode::fsExec(XrdOucProg*, >>>>>>>>> char*, char*)' member function declared in class `XrdCmsNode' >>>>>>>>> XrdCmsNode.cc: In member function `int >>>>>>>>> XrdCmsNode::fsExec(XrdOucProg*, >>>>>>>>> char*, char*)': >>>>>>>>> XrdCmsNode.cc:1533: error: `fsL2PFail1' was not declared in this >>>>>>>>> scope >>>>>>>>> XrdCmsNode.cc:1533: warning: unused variable 'fsL2PFail1' >>>>>>>>> XrdCmsNode.cc:1537: error: `fsL2PFail2' was not declared in this >>>>>>>>> scope >>>>>>>>> XrdCmsNode.cc:1537: warning: unused variable 'fsL2PFail2' >>>>>>>>> XrdCmsNode.cc: At global scope: >>>>>>>>> XrdCmsNode.cc:1553: error: no `const char* XrdCmsNode::fsFail(const >>>>>>>>> char*, const char*, const char*, int)' member function declared in >>>>>>>>> class `XrdCmsNode' >>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>> XrdCmsNode::fsFail(const char*, const char*, const char*, int)': >>>>>>>>> XrdCmsNode.cc:1559: error: `fsL2PFail1' was not declared in this >>>>>>>>> scope >>>>>>>>> XrdCmsNode.cc:1559: warning: unused variable 'fsL2PFail1' >>>>>>>>> XrdCmsNode.cc:1560: error: `fsL2PFail2' was not declared in this >>>>>>>>> scope >>>>>>>>> XrdCmsNode.cc:1560: warning: unused variable 'fsL2PFail2' >>>>>>>>> XrdCmsNode.cc: In static member function `static int >>>>>>>>> XrdCmsNode::isOnline(char*, int)': >>>>>>>>> XrdCmsNode.cc:1608: error: 'class XrdCmsConfig' has no member named >>>>>>>>> 'ossFS' >>>>>>>>> make[4]: *** [../../obj/XrdCmsNode.o] Error 1 >>>>>>>>> make[3]: *** [Linuxall] Error 2 >>>>>>>>> make[2]: *** [all] Error 2 >>>>>>>>> make[1]: *** [XrdCms] Error 2 >>>>>>>>> make: *** [all] Error 2 >>>>>>>>> >>>>>>>>> >>>>>>>>> Wen >>>>>>>>> >>>>>>>>> On Tue, Dec 15, 2009 at 2:08 AM, Andrew Hanushevsky >>>>>>>>> <[log in to unmask]> >>>>>>>>> wrote: >>>>>>>>>> Hi Wen, >>>>>>>>>> >>>>>>>>>> I have developed a permanent fix. You will find the source files in >>>>>>>>>> >>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/ >>>>>>>>>> >>>>>>>>>> There are three files: XrdCmsCluster.cc XrdCmsNode.cc >>>>>>>>>> XrdCmsProtocol.cc >>>>>>>>>> >>>>>>>>>> Please do a source replacement and recompile. Unfortunately, the >>>>>>>>>> cmsd >>>>>>>>>> will >>>>>>>>>> need to be replaced on each node regardless of role. My apologies >>>>>>>>>> for >>>>>>>>>> the >>>>>>>>>> disruption. Please let me know how it goes. >>>>>>>>>> >>>>>>>>>> Andy >>>>>>>>>> >>>>>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>>>>> <[log in to unmask]> >>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>>>>> Cc: <[log in to unmask]> >>>>>>>>>> Sent: Sunday, December 13, 2009 7:04 AM >>>>>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi Andrew, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks. >>>>>>>>>> I used the new cmsd at atlas-bkp1 manager. But it's still dropping >>>>>>>>>> nodes. And in supervisor's log, I cannot find any dataserver to >>>>>>>>>> register to it. >>>>>>>>>> >>>>>>>>>> The new logs are in http://higgs03.cs.wisc.edu/wguan/*.20091213. >>>>>>>>>> The manager is patched at 091213 08:38:15. >>>>>>>>>> >>>>>>>>>> Wen >>>>>>>>>> >>>>>>>>>> On Sun, Dec 13, 2009 at 1:52 AM, Andrew Hanushevsky >>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>> Hi Wen >>>>>>>>>>> >>>>>>>>>>> You will find the source replacement at: >>>>>>>>>>> >>>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/ >>>>>>>>>>> >>>>>>>>>>> It's XrdCmsCluster.cc and it replaces >>>>>>>>>>> xrootd/src/XrdCms/XrdCmsCluster.cc >>>>>>>>>>> >>>>>>>>>>> I'm stepping out for a couple of hours but will be back to see how >>>>>>>>>>> things >>>>>>>>>>> went. Sorry for the issues :-( >>>>>>>>>>> >>>>>>>>>>> Andy >>>>>>>>>>> >>>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>> >>>>>>>>>>>> I prefer a source replacement. Then I can compile it. >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Wen >>>>>>>>>>>>> I can do one of two things here: >>>>>>>>>>>>> >>>>>>>>>>>>> 1) Supply a source replacement and then you would recompile, or >>>>>>>>>>>>> >>>>>>>>>>>>> 2) Give me the uname -a of where the cmsd will run and I'll >>>>>>>>>>>>> supply >>>>>>>>>>>>> a >>>>>>>>>>>>> binary >>>>>>>>>>>>> replacement for you. >>>>>>>>>>>>> >>>>>>>>>>>>> Your choice. >>>>>>>>>>>>> >>>>>>>>>>>>> Andy >>>>>>>>>>>>> >>>>>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Andrew >>>>>>>>>>>>>> >>>>>>>>>>>>>> The problem is found. Great. Thanks. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Where can I find the patched cmsd? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Wen >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky >>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I found the problem. Looks like a regression from way back >>>>>>>>>>>>>>> when. >>>>>>>>>>>>>>> There >>>>>>>>>>>>>>> is >>>>>>>>>>>>>>> a >>>>>>>>>>>>>>> missing flag on the redirect. This will require a patched cmsd >>>>>>>>>>>>>>> but >>>>>>>>>>>>>>> you >>>>>>>>>>>>>>> need >>>>>>>>>>>>>>> only to replace the redirector's cmsd as this only affects the >>>>>>>>>>>>>>> redirector. >>>>>>>>>>>>>>> How would you like to proceed? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It doesn't work. atlas-bkp1 manager still dropping nodes >>>>>>>>>>>>>>>> again. >>>>>>>>>>>>>>>> In supervisor, I still haven't seen any dataserver >>>>>>>>>>>>>>>> registered. >>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>> said >>>>>>>>>>>>>>>> "I updated the ntp" because you said "the log timestamp do >>>>>>>>>>>>>>>> not >>>>>>>>>>>>>>>> overlap". >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky >>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Do you mean that everything is now working? It could be that >>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>> removed >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> xrd.timeout directive. That really could cause problems. As >>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> delays, >>>>>>>>>>>>>>>>> that is normal when the redirector thinks something is going >>>>>>>>>>>>>>>>> wrong. >>>>>>>>>>>>>>>>> The >>>>>>>>>>>>>>>>> strategy is to delay clients until it can get back to a >>>>>>>>>>>>>>>>> stable >>>>>>>>>>>>>>>>> configuration. This usually prevents jobs from crashing >>>>>>>>>>>>>>>>> during >>>>>>>>>>>>>>>>> stressful >>>>>>>>>>>>>>>>> periods. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I restarted it to do supervisor test. Also because xrootd >>>>>>>>>>>>>>>>>> manager >>>>>>>>>>>>>>>>>> frequently doesn't response. (*) is the cms.log, the file >>>>>>>>>>>>>>>>>> select >>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>> delayed again and again. When do a restart, all things are >>>>>>>>>>>>>>>>>> fine. >>>>>>>>>>>>>>>>>> Now >>>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>> am trying to find a clue about it. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> (*) >>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] >>>>>>>>>>>>>>>>>> do_Select: >>>>>>>>>>>>>>>>>> wc >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 Select seeking >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 UnkFile rc=1 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] >>>>>>>>>>>>>>>>>> do_Select: >>>>>>>>>>>>>>>>>> delay 5 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 2+-1 >>>>>>>>>>>>>>>>>> post=0 >>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 Dispatch >>>>>>>>>>>>>>>>>> redirector.21313:14@atlas-bkp2 >>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>> select dlen=166 >>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1 >>>>>>>>>>>>>>>>>> post=0 >>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdSched: running redirector inq=0 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> There is no core file. I copied a new copies of the logs to >>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> link >>>>>>>>>>>>>>>>>> below. >>>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky >>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I see in the server log that it is restarting often. Could >>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>> take >>>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>> look >>>>>>>>>>>>>>>>>>> in the c193 to see if you have any core files? Also please >>>>>>>>>>>>>>>>>>> make >>>>>>>>>>>>>>>>>>> sure >>>>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>>> core files are enabled as Linux defaults the size to 0. >>>>>>>>>>>>>>>>>>> The >>>>>>>>>>>>>>>>>>> first >>>>>>>>>>>>>>>>>>> step >>>>>>>>>>>>>>>>>>> here >>>>>>>>>>>>>>>>>>> is to find out why your servers are restarting. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> the logs can be found here. From the log you can see >>>>>>>>>>>>>>>>>>>> atlas-bkp1 >>>>>>>>>>>>>>>>>>>> manager are dropping nodes again and again which tries to >>>>>>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>> it. >>>>>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky >>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>>>>> Hi Wen, Could you start everything up and provide me a >>>>>>>>>>>>>>>>>>>>> pointer >>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>> manager log file, supervisor log file, and one data >>>>>>>>>>>>>>>>>>>>> server >>>>>>>>>>>>>>>>>>>>> logfile >>>>>>>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>>> which cover the same time-frame (from start to some >>>>>>>>>>>>>>>>>>>>> point >>>>>>>>>>>>>>>>>>>>> where >>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>> think >>>>>>>>>>>>>>>>>>>>> things are working or not). That way I can see what is >>>>>>>>>>>>>>>>>>>>> happening. >>>>>>>>>>>>>>>>>>>>> At >>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>> moment I only see two "bad" things in the config file: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a >>>>>>>>>>>>>>>>>>>>> manager >>>>>>>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>> claim, >>>>>>>>>>>>>>>>>>>>> via >>>>>>>>>>>>>>>>>>>>> the all.manager directive, that there are three (bkp2 >>>>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>> bkp3). >>>>>>>>>>>>>>>>>>>>> While >>>>>>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>>>>>> should work, the log file will be dense with error >>>>>>>>>>>>>>>>>>>>> messages. >>>>>>>>>>>>>>>>>>>>> Please >>>>>>>>>>>>>>>>>>>>> correct >>>>>>>>>>>>>>>>>>>>> this to be consistent and make it easier to see real >>>>>>>>>>>>>>>>>>>>> errors. >>>>>>>>>>>>>>>>>>>> This is not a problem for me. Because this config is used >>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>> dataserver. In manager, I updated the if >>>>>>>>>>>>>>>>>>>> atlas-bkp1.cs.wisc.edu >>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>> atlas-bkp2 or something. This is a history problem. at >>>>>>>>>>>>>>>>>>>> first >>>>>>>>>>>>>>>>>>>> only >>>>>>>>>>>>>>>>>>>> atlas-bkp1 is used. atlas-bkp2 and atlas-bkp3 are added >>>>>>>>>>>>>>>>>>>> later. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> 2) Please use cms.space not olb.space (for historical >>>>>>>>>>>>>>>>>>>>> reasons >>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>> latter >>>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>> still accepted and over-rides the former, but that will >>>>>>>>>>>>>>>>>>>>> soon >>>>>>>>>>>>>>>>>>>>> end), >>>>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>> please use only one (the config file uses both >>>>>>>>>>>>>>>>>>>>> directives). >>>>>>>>>>>>>>>>>>>> yes. I should remove this line. in fact cms.space is in >>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>> cfg >>>>>>>>>>>>>>>>>>>> too. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> The xrootd has an internal mechanism to connect servers >>>>>>>>>>>>>>>>>>>>> with >>>>>>>>>>>>>>>>>>>>> supervisors >>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>> allow for maximum reliability. You cannot change that >>>>>>>>>>>>>>>>>>>>> algorithm >>>>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>> there >>>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>> no need to do so. You should *never* tell anyone to >>>>>>>>>>>>>>>>>>>>> directly >>>>>>>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>>>> supervisor. If you do, you will likely get unreachable >>>>>>>>>>>>>>>>>>>>> nodes. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> As for dropping data servers, it would appear to me, >>>>>>>>>>>>>>>>>>>>> given >>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>> flurry >>>>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>>> such activity, that something either crashed or was >>>>>>>>>>>>>>>>>>>>> restarted. >>>>>>>>>>>>>>>>>>>>> That's >>>>>>>>>>>>>>>>>>>>> why >>>>>>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>>>>>> would be good to see the complete log of each one of the >>>>>>>>>>>>>>>>>>>>> entities. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I read the document. and write a config >>>>>>>>>>>>>>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg). >>>>>>>>>>>>>>>>>>>>>> I used my conf, I can see manager is dispatch message >>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>> supervisor. But I cannot see any dataserver tries to >>>>>>>>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>> supervisor. At the same time, in the manager's log, I >>>>>>>>>>>>>>>>>>>>>> can >>>>>>>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>>>>>> some >>>>>>>>>>>>>>>>>>>>>> dataserver are Dropped. >>>>>>>>>>>>>>>>>>>>>> How does xrootd decide which dataserver will connect >>>>>>>>>>>>>>>>>>>>>> supervisor? >>>>>>>>>>>>>>>>>>>>>> should I specify some dataservers to connect the >>>>>>>>>>>>>>>>>>>>>> supervisor? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> (*) supervisor log >>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2 >>>>>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>>>> state >>>>>>>>>>>>>>>>>>>>>> dlen=42 >>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State: >>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 >>>>>>>>>>>>>>>>>>>>>> do_StateFWD: >>>>>>>>>>>>>>>>>>>>>> Path >>>>>>>>>>>>>>>>>>>>>> find >>>>>>>>>>>>>>>>>>>>>> failed for state >>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> (*)manager log >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>> TSpace=5587GB >>>>>>>>>>>>>>>>>>>>>> NumFS=1 >>>>>>>>>>>>>>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding >>>>>>>>>>>>>>>>>>>>>> path: >>>>>>>>>>>>>>>>>>>>>> w >>>>>>>>>>>>>>>>>>>>>> /atlas >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 >>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>> do_Space: 5696231MB free; 0% util >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 logged in. >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection from >>>>>>>>>>>>>>>>>>>>>> [log in to unmask] >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: running >>>>>>>>>>>>>>>>>>>>>> ?:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched protocol >>>>>>>>>>>>>>>>>>>>>> cmsd >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll: >>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>> 79 >>>>>>>>>>>>>>>>>>>>>> attached >>>>>>>>>>>>>>>>>>>>>> to poller 2; num=22 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add >>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>> bumps >>>>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 #63 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Drop_Node: >>>>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 dropped. >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add Shoved >>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 to cluster; >>>>>>>>>>>>>>>>>>>>>> id=63.78; >>>>>>>>>>>>>>>>>>>>>> num=64; >>>>>>>>>>>>>>>>>>>>>> min=51 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>> TSpace=5587GB >>>>>>>>>>>>>>>>>>>>>> NumFS=1 >>>>>>>>>>>>>>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding >>>>>>>>>>>>>>>>>>>>>> path: >>>>>>>>>>>>>>>>>>>>>> w >>>>>>>>>>>>>>>>>>>>>> /atlas >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 >>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>> do_Space: 5721854MB free; 0% util >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 logged in. >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve from >>>>>>>>>>>>>>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>>>> 60 >>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 node 63.78 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 >>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>> 79 detached from poller 2; num=21 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service >>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>>> c177.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>> FD=16 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 node 0.3 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>> 16 detached from poller 2; num=20 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>>> c179.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>> FD=21 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>> server.17065:[log in to unmask]:1094 node 1.4 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>> server.7978:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.7978:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>> 21 >>>>>>>>>>>>>>>>>>>>>> detached from poller 1; num=21 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 State: Status changed to >>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Send status to >>>>>>>>>>>>>>>>>>>>>> redirector.15656:14@atlas-bkp2 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service >>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>>> c182.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>> FD=19 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 node 7.10 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>> 19 detached from poller 2; num=19 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service >>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>>> c178.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>> FD=15 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 node 9.12 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>> 15 detached from poller 1; num=20 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service >>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>>> c181.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>> FD=17 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 node 5.8 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>> 17 detached from poller 0; num=21 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service >>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>>> c183.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>> FD=22 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 node 8.11 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>> 22 detached from poller 2; num=18 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>>> c184.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>> FD=20 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>> server.4131:[log in to unmask]:1094 node 3.6 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>> 20 detached from poller 0; num=20 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service >>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>>> c185.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>> FD=23 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 node 6.9 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>> server.8524:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.8524:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>> 23 >>>>>>>>>>>>>>>>>>>>>> detached from poller 0; num=19 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service >>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>>> c180.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>> FD=18 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 node 4.7 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>> 18 detached from poller 1; num=19 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service >>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>>> c186.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>> FD=24 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 node 2.5 >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>> server.7849:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.7849:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>> 24 >>>>>>>>>>>>>>>>>>>>>> detached from poller 1; num=18 >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in >>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled. >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled. >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled. >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled. >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled. >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled. >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled. >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled. >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled. >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled. >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled. >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled. >>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0 >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky >>>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> To go past 64 data servers you will need to setup one >>>>>>>>>>>>>>>>>>>>>>> or >>>>>>>>>>>>>>>>>>>>>>> more >>>>>>>>>>>>>>>>>>>>>>> supervisors. >>>>>>>>>>>>>>>>>>>>>>> This does not logically change the current >>>>>>>>>>>>>>>>>>>>>>> configuration >>>>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>>>> have. >>>>>>>>>>>>>>>>>>>>>>> You >>>>>>>>>>>>>>>>>>>>>>> only >>>>>>>>>>>>>>>>>>>>>>> need to configure one or more *new* servers (or at >>>>>>>>>>>>>>>>>>>>>>> least >>>>>>>>>>>>>>>>>>>>>>> xrootd >>>>>>>>>>>>>>>>>>>>>>> processes) >>>>>>>>>>>>>>>>>>>>>>> whose role is supervisor. We'd like them to run in >>>>>>>>>>>>>>>>>>>>>>> separate >>>>>>>>>>>>>>>>>>>>>>> machines >>>>>>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>>>>> reliability purposes, but they could run on the >>>>>>>>>>>>>>>>>>>>>>> manager >>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>>>>>>> long >>>>>>>>>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>>>> give each one a unique instance name (i.e., -n >>>>>>>>>>>>>>>>>>>>>>> option). >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> The front part of the cmsd reference explains how to >>>>>>>>>>>>>>>>>>>>>>> do >>>>>>>>>>>>>>>>>>>>>>> this. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Is there any change to configure xrootd with more >>>>>>>>>>>>>>>>>>>>>>>> than >>>>>>>>>>>>>>>>>>>>>>>> 65 >>>>>>>>>>>>>>>>>>>>>>>> machines? I used the configure below but it doesn't >>>>>>>>>>>>>>>>>>>>>>>> work. >>>>>>>>>>>>>>>>>>>>>>>> Should >>>>>>>>>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>>>>>>>> configure some machines' manager to be supvervisor? >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>> >>> >> >>