Hi Fabrizio, This is the xrdcp debug message. ClientHeader.header.dlen = 41 =================== END CLIENT HEADER DUMPING =================== 091217 16:47:54 15961 Xrd: WriteRaw: Writing 24 bytes to physical connection 091217 16:47:54 15961 Xrd: WriteRaw: Writing to substreamid 0 091217 16:47:54 15961 Xrd: WriteRaw: Writing 41 bytes to physical connection 091217 16:47:54 15961 Xrd: WriteRaw: Writing to substreamid 0 091217 16:47:54 15961 Xrd: ReadPartialAnswer: Reading a XrdClientMessage from the server [atlas-bkp1.cs.wisc.edu:1094]... 091217 16:47:54 15961 Xrd: XrdClientMessage::ReadRaw: sid: 1, IsAttn: 0, substreamid: 0 091217 16:47:54 15961 Xrd: XrdClientMessage::ReadRaw: Reading data (4 bytes) from substream 0 091217 16:47:54 15961 Xrd: ReadRaw: Reading from atlas-bkp1.cs.wisc.edu:1094 091217 16:47:54 15961 Xrd: BuildMessage: posting id 1 091217 16:47:54 15961 Xrd: XrdClientMessage::ReadRaw: Reading header (8 bytes). 091217 16:47:54 15961 Xrd: ReadRaw: Reading from atlas-bkp1.cs.wisc.edu:1094 ======== DUMPING SERVER RESPONSE HEADER ======== ServerHeader.streamid = 0x01 0x00 ServerHeader.status = kXR_wait (4005) ServerHeader.dlen = 4 ========== END DUMPING SERVER HEADER =========== 091217 16:47:54 15961 Xrd: ReadPartialAnswer: Server [atlas-bkp1.cs.wisc.edu:1094] answered [kXR_wait] (4005) 091217 16:47:54 15961 Xrd: CheckErrorStatus: Server [atlas-bkp1.cs.wisc.edu:1094] requested 10 seconds of wait 091217 16:48:04 15961 Xrd: DumpPhyConn: Phyconn entry, [log in to unmask]:1094', LogCnt=1 Valid 091217 16:48:04 15961 Xrd: SendGenCommand: Sending command Open ================= DUMPING CLIENT REQUEST HEADER ================= ClientHeader.streamid = 0x01 0x00 ClientHeader.requestid = kXR_open (3010) ClientHeader.open.mode = 0x00 0x00 ClientHeader.open.options = 0x40 0x04 ClientHeader.open.reserved = 0 repeated 12 times ClientHeader.header.dlen = 41 =================== END CLIENT HEADER DUMPING =================== 091217 16:48:04 15961 Xrd: WriteRaw: Writing 24 bytes to physical connection 091217 16:48:04 15961 Xrd: WriteRaw: Writing to substreamid 0 091217 16:48:04 15961 Xrd: WriteRaw: Writing 41 bytes to physical connection 091217 16:48:04 15961 Xrd: WriteRaw: Writing to substreamid 0 091217 16:48:04 15961 Xrd: ReadPartialAnswer: Reading a XrdClientMessage from the server [atlas-bkp1.cs.wisc.edu:1094]... 091217 16:48:04 15961 Xrd: XrdClientMessage::ReadRaw: sid: 1, IsAttn: 0, substreamid: 0 091217 16:48:04 15961 Xrd: XrdClientMessage::ReadRaw: Reading data (4 bytes) from substream 0 091217 16:48:04 15961 Xrd: ReadRaw: Reading from atlas-bkp1.cs.wisc.edu:1094 091217 16:48:04 15961 Xrd: BuildMessage: posting id 1 091217 16:48:04 15961 Xrd: XrdClientMessage::ReadRaw: Reading header (8 bytes). 091217 16:48:04 15961 Xrd: ReadRaw: Reading from atlas-bkp1.cs.wisc.edu:1094 ======== DUMPING SERVER RESPONSE HEADER ======== ServerHeader.streamid = 0x01 0x00 ServerHeader.status = kXR_wait (4005) ServerHeader.dlen = 4 ========== END DUMPING SERVER HEADER =========== 091217 16:48:04 15961 Xrd: ReadPartialAnswer: Server [atlas-bkp1.cs.wisc.edu:1094] answered [kXR_wait] (4005) 091217 16:48:04 15961 Xrd: CheckErrorStatus: Server [atlas-bkp1.cs.wisc.edu:1094] requested 10 seconds of wait 091217 16:48:14 15961 Xrd: SendGenCommand: Sending command Open ================= DUMPING CLIENT REQUEST HEADER ================= ClientHeader.streamid = 0x01 0x00 ClientHeader.requestid = kXR_open (3010) ClientHeader.open.mode = 0x00 0x00 ClientHeader.open.options = 0x40 0x04 ClientHeader.open.reserved = 0 repeated 12 times ClientHeader.header.dlen = 41 =================== END CLIENT HEADER DUMPING =================== 091217 16:48:14 15961 Xrd: WriteRaw: Writing 24 bytes to physical connection 091217 16:48:14 15961 Xrd: WriteRaw: Writing to substreamid 0 091217 16:48:14 15961 Xrd: WriteRaw: Writing 41 bytes to physical connection 091217 16:48:14 15961 Xrd: WriteRaw: Writing to substreamid 0 091217 16:48:14 15961 Xrd: ReadPartialAnswer: Reading a XrdClientMessage from the server [atlas-bkp1.cs.wisc.edu:1094]... 091217 16:48:14 15961 Xrd: XrdClientMessage::ReadRaw: sid: 1, IsAttn: 0, substreamid: 0 091217 16:48:14 15961 Xrd: XrdClientMessage::ReadRaw: Reading data (4 bytes) from substream 0 091217 16:48:14 15961 Xrd: ReadRaw: Reading from atlas-bkp1.cs.wisc.edu:1094 091217 16:48:14 15961 Xrd: BuildMessage: posting id 1 091217 16:48:14 15961 Xrd: XrdClientMessage::ReadRaw: Reading header (8 bytes). 091217 16:48:14 15961 Xrd: ReadRaw: Reading from atlas-bkp1.cs.wisc.edu:1094 ======== DUMPING SERVER RESPONSE HEADER ======== ServerHeader.streamid = 0x01 0x00 ServerHeader.status = kXR_wait (4005) ServerHeader.dlen = 4 ========== END DUMPING SERVER HEADER =========== 091217 16:48:14 15961 Xrd: ReadPartialAnswer: Server [atlas-bkp1.cs.wisc.edu:1094] answered [kXR_wait] (4005) 091217 16:48:14 15961 Xrd: SendGenCommand: Max time limit elapsed for request kXR_open. Aborting command. Last server error 10000 ('') Error accessing path/file for root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 Wen On Thu, Dec 17, 2009 at 11:27 PM, Fabrizio Furano <[log in to unmask]> wrote: > Hi Wen, > > I see that you are getting error 10000, which means "generic error before > any interaction". Could you please run the same command with debug level 3 > and post the log with the same kind of issue? Something like > > xrdcp -d 3 .... > > Most likely this time the problem is different. I may be wrong here, but a > possible reason for that error is that the servers require authentication > and xrdcp does not find some library in the LD_LIBRARY_PATH. > > Fabrizio > > > wen guan ha scritto: >> >> Hi Andy, >> >> I put new logs in web. >> >> It still doesn't work. I cannot copy files in and out. >> >> It seems xrootd daemon at atlas-bkp1 hasn't talked with cmsd. >> Normally if xrootd daemont tries to copy a file, in the cms.log I >> should see "do_Select: filename". But in this cms.log, there is >> nothing from atlas-bkp1. >> >> (*) >> [root@atlas-bkp1 ~]# xrdcp >> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >> /tmp/ >> Last server error 10000 ('') >> Error accessing path/file for >> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >> [root@atlas-bkp1 ~]# xrdcp /bin/mv >> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123 >> 133 >> >> >> Wen >> >> On Thu, Dec 17, 2009 at 10:54 PM, Andrew Hanushevsky <[log in to unmask]> >> wrote: >>> >>> Hi Wen, >>> >>> I reviewed the log file. Other than the odd redirect of c131 at 17:47:25 >>> which I can't comment on because its logs on the web site do not overlap >>> with the manager or supervisor. Unless all the logs include the full time >>> in >>> question I can't say much of anything. Can you provide me with inclusive >>> logs? >>> >>> atlas-bkp1 cms: 17:20:57 to 17:42:19 xrd: 17:20:57 to 17:40:57 >>> higgs07 cms & xrd 17:22:33 to 17:42:33 >>> c131 cms & xrd 17:31:57 to 17:47:28 >>> >>> That said, it certainly looks like things were working and files were >>> being >>> accessed and discovered on all the machines. You even werw able to open >>> /atlas/xrootd/users/wguan/test/test98123313 >>> through not >>> /atlas/xrootd/users/wguan/test/test123131The other issue is that you did >>> not >>> specify a stable adminpath and the adminpath defaults to /tmp. If you >>> have a >>> "cleanup" script that runs periodically for /tmp then eventually your >>> cluster will go catonic as important (but not often used) files are >>> deleted >>> by that script. Could you please find a stable home for the adminpath? >>> >>> I reran my tests here and things worked as expected. I will ramp up some >>> more tests. So, what is your status today? >>> >>> Andy >>> >>> ----- Original Message ----- From: "wen guan" <[log in to unmask]> >>> To: "Andrew Hanushevsky" <[log in to unmask]> >>> Cc: <[log in to unmask]> >>> Sent: Thursday, December 17, 2009 5:05 AM >>> Subject: Re: xrootd with more than 65 machines >>> >>> >>> Hi Andy, >>> >>> Yes. I am using the file download from >>> http://www.slac.stanford.edu/~abh/cmsd/ which compiled yesterday. I >>> just now compiled it again and compare it with one I compiled >>> yesterday. they are the same(same md5sum). >>> >>> Wen >>> >>> On Thu, Dec 17, 2009 at 2:09 AM, Andrew Hanushevsky <[log in to unmask]> >>> wrote: >>>> >>>> Hi Wen, >>>> >>>> If c131 cannot connect then either c131 does not have the new cms or >>>> atlas-bkp1 does not have the new cms as that would be what would happen >>>> if >>>> either were true. Looking at the log on c131 it would appear that >>>> atlas-bkp1 >>>> is still using the old cmsd as the response data length is wrong. Could >>>> you >>>> verify please. >>>> >>>> Andy >>>> >>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]> >>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>> Cc: <[log in to unmask]> >>>> Sent: Wednesday, December 16, 2009 3:58 PM >>>> Subject: Re: xrootd with more than 65 machines >>>> >>>> >>>> Hi Andy, >>>> >>>> I tried it. But there are still some problem. I put the logs in >>>> higgs03.cs.wisc.edu/wguan/ >>>> >>>> In my test, c131 is the 65 nodes to be added the the manager. >>>> and I can copy the file to the pool through manager. But I cannot >>>> copy a file out which is in c131. >>>> >>>> In c131's cms.log, I see "Manager: >>>> manager.0:[log in to unmask] removed; redirected" again and >>>> again. and I cannot see any thing about c131 in higgs07's >>>> log(supervisor). Does it mean manager tries to redirect it to higgs07, >>>> but c131 hasn't try to connect higgs07. It only tries to connect >>>> manager again. >>>> >>>> (*) >>>> [root@c131 ~]# xrdcp /bin/mv >>>> root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331 >>>> Last server error 10000 ('') >>>> Error accessing path/file for >>>> root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331 >>>> [root@c131 ~]# xrdcp /bin/mv >>>> >>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123311 >>>> [xrootd] Total 0.06 MB |====================| 100.00 % [3.1 MB/s] >>>> [root@c131 ~]# xrdcp /bin/mv >>>> >>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123312 >>>> [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s] >>>> [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/ >>>> test123131 >>>> [root@c131 ~]# xrdcp >>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>> /tmp/ >>>> Last server error 3011 ('No servers are available to read the file.') >>>> Error accessing path/file for >>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>> [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/test123131 >>>> /atlas/xrootd/users/wguan/test/test123131 >>>> [root@c131 ~]# xrdcp >>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>> /tmp/ >>>> Last server error 3011 ('No servers are available to read the file.') >>>> Error accessing path/file for >>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>> [root@c131 ~]# xrdcp /bin/mv >>>> >>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123313 >>>> [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s] >>>> [root@c131 ~]# xrdcp >>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>> /tmp/ >>>> Last server error 3011 ('No servers are available to read the file.') >>>> Error accessing path/file for >>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>> [root@c131 ~]# xrdcp >>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>> /tmp/ >>>> Last server error 3011 ('No servers are available to read the file.') >>>> Error accessing path/file for >>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>> [root@c131 ~]# xrdcp >>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>> /tmp/ >>>> Last server error 3011 ('No servers are available to read the file.') >>>> Error accessing path/file for >>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>> [root@c131 ~]# tail -f /var/log/xrootd/cms.log >>>> 091216 17:45:52 3103 manager.0:[log in to unmask] XrdLink: >>>> Setting ref to 2+-1 post=0 >>>> 091216 17:45:55 3103 Pander trying to connect to lvl 0 >>>> atlas-bkp1.cs.wisc.edu:3121 >>>> 091216 17:45:55 3103 XrdInet: Connected to atlas-bkp1.cs.wisc.edu:3121 >>>> 091216 17:45:55 3103 Add atlas-bkp1.cs.wisc.edu to manager config; id=0 >>>> 091216 17:45:55 3103 ManTree: Now connected to 3 root node(s) >>>> 091216 17:45:55 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu >>>> 091216 17:45:55 3103 Dispatch manager.0:[log in to unmask] for >>>> try >>>> dlen=3 >>>> 091216 17:45:55 3103 manager.0:[log in to unmask] do_Try: >>>> 091216 17:45:55 3103 Remove completed atlas-bkp1.cs.wisc.edu manager >>>> 0.95 >>>> 091216 17:45:55 3103 Manager: manager.0:[log in to unmask] >>>> removed; redirected >>>> 091216 17:46:04 3103 Pander trying to connect to lvl 0 >>>> atlas-bkp1.cs.wisc.edu:3121 >>>> 091216 17:46:04 3103 XrdInet: Connected to atlas-bkp1.cs.wisc.edu:3121 >>>> 091216 17:46:04 3103 Add atlas-bkp1.cs.wisc.edu to manager config; id=0 >>>> 091216 17:46:04 3103 ManTree: Now connected to 3 root node(s) >>>> 091216 17:46:04 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu >>>> 091216 17:46:04 3103 Dispatch manager.0:[log in to unmask] for >>>> try >>>> dlen=3 >>>> 091216 17:46:04 3103 Protocol: No buffers to serve >>>> atlas-bkp1.cs.wisc.edu >>>> 091216 17:46:04 3103 Remove completed atlas-bkp1.cs.wisc.edu manager >>>> 0.96 >>>> 091216 17:46:04 3103 Manager: manager.0:[log in to unmask] >>>> removed; insufficient buffers >>>> 091216 17:46:11 3103 Dispatch manager.0:[log in to unmask] for >>>> state dlen=169 >>>> 091216 17:46:11 3103 manager.0:[log in to unmask] XrdLink: >>>> Setting ref to 1+1 post=0 >>>> >>>> Thanks >>>> Wen >>>> >>>> On Thu, Dec 17, 2009 at 12:10 AM, wen guan <[log in to unmask]> >>>> wrote: >>>>> >>>>> Hi Andy, >>>>> >>>>>> OK, I understand. As for stalling, too many nodes were deemed to be in >>>>>> trouble for the manager to allow service resumption. >>>>>> >>>>>> Please make sure that all of the nodes in the cluster receive the new >>>>>> cmsd >>>>>> as they will drop off with the old one and you'll see the same kind of >>>>>> activity. Perhaps the best way to know that you suceeded in putting >>>>>> everything in sync is to start with 63 data nodes plus one supervisor. >>>>>> Once >>>>>> all connections are established; adding an additional server should >>>>>> simply >>>>>> send it to the supervisor. >>>>> >>>>> I will do it. >>>>> you said start 63 data server and one supervisor. Does it mean the >>>>> supervisor is managed using the same policy? If I there are 64 >>>>> dataservers which are connected before the supervisor, will the >>>>> supervisor be dropped? Is the supervisor has high priority to be >>>>> added to the manager? I mean, if there are already 64 dataservers and >>>>> a supervisor comes in, will the supervisor be accepted and a datasever >>>>> be redirected to the supervisor? >>>>> >>>>> Thanks >>>>> Wen >>>>> >>>>>> Hi Andrew, >>>>>> >>>>>> But when I tried to xrdcp a file to it, it doesn't response. In >>>>>> atlas-bkp1-xrd.log.20091213, it always prints "stalling client for 10 >>>>>> sec". But in cms.log, I can find any message about the file. >>>>>> >>>>>>> I don't see why you say it doesn't work. With the debugging level set >>>>>>> so >>>>>>> high the noise may make it look like something is going wrong but >>>>>>> that >>>>>>> isn't >>>>>>> necessarily the case. >>>>>>> >>>>>>> 1) The 'too many subscribers' is correct. The manager was simply >>>>>>> redirecting >>>>>>> them because there were already 64 servers. However, in your case the >>>>>>> supervisor wasn't started until almost 30 minutes after everyone else >>>>>>> (i.e., >>>>>>> 10:42 AM). Why was that? I'm not suprised about the flurry of >>>>>>> messages >>>>>>> with >>>>>>> a critical component missing for 30 minutes. >>>>>> >>>>>> Because the manager is 64bit machine but supervisor is 32 bit machine. >>>>>> Then I have to recompile the it. At that time, I was interrupted by >>>>>> something else. >>>>>> >>>>>> >>>>>>> 2) Once the supervisor started, it started accepting the redirected >>>>>>> servers. >>>>>>> >>>>>>> 3) Then 10 seconds (10:42:10) later the supervisor was restarted. So, >>>>>>> that >>>>>>> would cause a flurry of activity to occur as there is no backup >>>>>>> supervisor >>>>>>> to take over. >>>>>>> >>>>>>> 4) This happened again at 10:42:34 AM then again at 10:48:49. Is the >>>>>>> supervisor crashing? Is there a core file? >>>>>>> >>>>>>> 5) At 11:11 AM the manager restarted. Again, is there a core file >>>>>>> here >>>>>>> or >>>>>>> was this a manual action? >>>>>>> >>>>>>> During the course of all of this. All nodes connected were operating >>>>>>> propely >>>>>>> and files were being located. >>>>>>> >>>>>>> So, the two big questions are: >>>>>>> >>>>>>> a) Why was the supervisor not started until 30 minutes after the >>>>>>> system >>>>>>> was >>>>>>> started? >>>>>>> >>>>>>> b) Is there an explanation of the restarts? If this was a crash then >>>>>>> we >>>>>>> need >>>>>>> a core file to figure out what happened. >>>>>> >>>>>> It's not a crash. There are some reasons that I restarted some >>>>>> daemons. >>>>>> (1)I thought if a dataserver tried many times to connect to a >>>>>> redirector but failed, the dataserver would not try to connect a >>>>>> redirector again. The supervisor was missing for long time. So maybe >>>>>> some dataservers would not try to connect to atlas-bkp1 again. To >>>>>> reactive these dataservers, I restarted any servers. >>>>>> (2)When I tried to xrdcp, it was hanging for long time. I thought >>>>>> maybe manager was affected by some others things. then I restarte >>>>>> manager to see whether a restart can make this xrdcp work. >>>>>> >>>>>> >>>>>> Thanks >>>>>> Wen >>>>>> >>>>>>> Andy >>>>>>> >>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>> <[log in to unmask]> >>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>> Cc: <[log in to unmask]> >>>>>>> Sent: Wednesday, December 16, 2009 9:38 AM >>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>> >>>>>>> >>>>>>> Hi Andrew, >>>>>>> >>>>>>> It still doesn't work. >>>>>>> The log file is in higgs03.cs.wisc.edu/wguan/. The name is *.20091216 >>>>>>> The manager complains there are too many subscribers and the removes >>>>>>> nodes. >>>>>>> >>>>>>> (*) >>>>>>> Add server.10040:[log in to unmask] redirected; too many >>>>>>> subscribers. >>>>>>> >>>>>>> Wen >>>>>>> >>>>>>> On Wed, Dec 16, 2009 at 4:25 AM, Andrew Hanushevsky >>>>>>> <[log in to unmask]> >>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Wen, >>>>>>>> >>>>>>>> It will be easier for me to retroft as the changes were pretty >>>>>>>> minor. >>>>>>>> Please >>>>>>>> lift the new XrdCmsNode.cc file from >>>>>>>> >>>>>>>> http://www.slac.stanford.edu/~abh/cmsd >>>>>>>> >>>>>>>> Andy >>>>>>>> >>>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>>> <[log in to unmask]> >>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>>> Cc: <[log in to unmask]> >>>>>>>> Sent: Tuesday, December 15, 2009 5:12 PM >>>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>>> >>>>>>>> >>>>>>>> Hi Andy, >>>>>>>> >>>>>>>> I can switch to 20091104-1102. Then you don't need to patch >>>>>>>> another version. How can I download v20091104-1102? >>>>>>>> >>>>>>>> Thanks >>>>>>>> Wen >>>>>>>> >>>>>>>> On Wed, Dec 16, 2009 at 12:52 AM, Andrew Hanushevsky >>>>>>>> <[log in to unmask]> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi Wen, >>>>>>>>> >>>>>>>>> Ah yes, I see that now. The file I gave you is based on >>>>>>>>> v20091104-1102. >>>>>>>>> Let >>>>>>>>> me see if I can retrofit the patch for you. >>>>>>>>> >>>>>>>>> Andy >>>>>>>>> >>>>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>>>> <[log in to unmask]> >>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>>>> Cc: <[log in to unmask]> >>>>>>>>> Sent: Tuesday, December 15, 2009 1:04 PM >>>>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi Andy, >>>>>>>>> >>>>>>>>> Which xrootd version are you using? XrdCmsConfig.hh is different. >>>>>>>>> XrdCmsConfig.hh is downloaded from >>>>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/. >>>>>>>>> >>>>>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsNode.cc >>>>>>>>> 6fb3ae40fe4e10bdd4d372818a341f2c src/XrdCms/XrdCmsNode.cc >>>>>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsConfig.hh >>>>>>>>> 7d57753847d9448186c718f98e963cbe src/XrdCms/XrdCmsConfig.hh >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Wen >>>>>>>>> >>>>>>>>> On Tue, Dec 15, 2009 at 10:50 PM, Andrew Hanushevsky >>>>>>>>> <[log in to unmask]> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi Wen, >>>>>>>>>> >>>>>>>>>> Just compiled on Linux and it was clean. Something is really wrong >>>>>>>>>> with >>>>>>>>>> your >>>>>>>>>> source files, specifically XrdCmsConfig.cc >>>>>>>>>> >>>>>>>>>> The MD5 checksums on the relevant files are: >>>>>>>>>> >>>>>>>>>> MD5 (XrdCmsNode.cc) = 6fb3ae40fe4e10bdd4d372818a341f2c >>>>>>>>>> >>>>>>>>>> MD5 (XrdCmsConfig.hh) = 4a7d655582a7cd43b098947d0676924b >>>>>>>>>> >>>>>>>>>> Andy >>>>>>>>>> >>>>>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>>>>> <[log in to unmask]> >>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>>>>> Cc: <[log in to unmask]> >>>>>>>>>> Sent: Tuesday, December 15, 2009 4:24 AM >>>>>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi Andy, >>>>>>>>>> >>>>>>>>>> No problem. Thanks for the fix. But it cannot be compiled. The >>>>>>>>>> version I am using is >>>>>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/. >>>>>>>>>> >>>>>>>>>> Making cms component... >>>>>>>>>> Compiling XrdCmsNode.cc >>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>> XrdCmsNode::do_Chmod(XrdCmsRRData&)': >>>>>>>>>> XrdCmsNode.cc:268: error: `fsExec' was not declared in this scope >>>>>>>>>> XrdCmsNode.cc:268: warning: unused variable 'fsExec' >>>>>>>>>> XrdCmsNode.cc:269: error: 'class XrdCmsConfig' has no member named >>>>>>>>>> 'ossFS' >>>>>>>>>> XrdCmsNode.cc:273: error: `fsFail' was not declared in this scope >>>>>>>>>> XrdCmsNode.cc:273: warning: unused variable 'fsFail' >>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>> XrdCmsNode::do_Mkdir(XrdCmsRRData&)': >>>>>>>>>> XrdCmsNode.cc:600: error: `fsExec' was not declared in this scope >>>>>>>>>> XrdCmsNode.cc:600: warning: unused variable 'fsExec' >>>>>>>>>> XrdCmsNode.cc:601: error: 'class XrdCmsConfig' has no member named >>>>>>>>>> 'ossFS' >>>>>>>>>> XrdCmsNode.cc:605: error: `fsFail' was not declared in this scope >>>>>>>>>> XrdCmsNode.cc:605: warning: unused variable 'fsFail' >>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>> XrdCmsNode::do_Mkpath(XrdCmsRRData&)': >>>>>>>>>> XrdCmsNode.cc:640: error: `fsExec' was not declared in this scope >>>>>>>>>> XrdCmsNode.cc:640: warning: unused variable 'fsExec' >>>>>>>>>> XrdCmsNode.cc:641: error: 'class XrdCmsConfig' has no member named >>>>>>>>>> 'ossFS' >>>>>>>>>> XrdCmsNode.cc:645: error: `fsFail' was not declared in this scope >>>>>>>>>> XrdCmsNode.cc:645: warning: unused variable 'fsFail' >>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>> XrdCmsNode::do_Mv(XrdCmsRRData&)': >>>>>>>>>> XrdCmsNode.cc:704: error: `fsExec' was not declared in this scope >>>>>>>>>> XrdCmsNode.cc:704: warning: unused variable 'fsExec' >>>>>>>>>> XrdCmsNode.cc:705: error: 'class XrdCmsConfig' has no member named >>>>>>>>>> 'ossFS' >>>>>>>>>> XrdCmsNode.cc:709: error: `fsFail' was not declared in this scope >>>>>>>>>> XrdCmsNode.cc:709: warning: unused variable 'fsFail' >>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>> XrdCmsNode::do_Rm(XrdCmsRRData&)': >>>>>>>>>> XrdCmsNode.cc:831: error: `fsExec' was not declared in this scope >>>>>>>>>> XrdCmsNode.cc:831: warning: unused variable 'fsExec' >>>>>>>>>> XrdCmsNode.cc:832: error: 'class XrdCmsConfig' has no member named >>>>>>>>>> 'ossFS' >>>>>>>>>> XrdCmsNode.cc:836: error: `fsFail' was not declared in this scope >>>>>>>>>> XrdCmsNode.cc:836: warning: unused variable 'fsFail' >>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>> XrdCmsNode::do_Rmdir(XrdCmsRRData&)': >>>>>>>>>> XrdCmsNode.cc:873: error: `fsExec' was not declared in this scope >>>>>>>>>> XrdCmsNode.cc:873: warning: unused variable 'fsExec' >>>>>>>>>> XrdCmsNode.cc:874: error: 'class XrdCmsConfig' has no member named >>>>>>>>>> 'ossFS' >>>>>>>>>> XrdCmsNode.cc:878: error: `fsFail' was not declared in this scope >>>>>>>>>> XrdCmsNode.cc:878: warning: unused variable 'fsFail' >>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>> XrdCmsNode::do_Trunc(XrdCmsRRData&)': >>>>>>>>>> XrdCmsNode.cc:1377: error: `fsExec' was not declared in this scope >>>>>>>>>> XrdCmsNode.cc:1377: warning: unused variable 'fsExec' >>>>>>>>>> XrdCmsNode.cc:1378: error: 'class XrdCmsConfig' has no member >>>>>>>>>> named >>>>>>>>>> 'ossFS' >>>>>>>>>> XrdCmsNode.cc:1382: error: `fsFail' was not declared in this scope >>>>>>>>>> XrdCmsNode.cc:1382: warning: unused variable 'fsFail' >>>>>>>>>> XrdCmsNode.cc: At global scope: >>>>>>>>>> XrdCmsNode.cc:1524: error: no `int XrdCmsNode::fsExec(XrdOucProg*, >>>>>>>>>> char*, char*)' member function declared in class `XrdCmsNode' >>>>>>>>>> XrdCmsNode.cc: In member function `int >>>>>>>>>> XrdCmsNode::fsExec(XrdOucProg*, >>>>>>>>>> char*, char*)': >>>>>>>>>> XrdCmsNode.cc:1533: error: `fsL2PFail1' was not declared in this >>>>>>>>>> scope >>>>>>>>>> XrdCmsNode.cc:1533: warning: unused variable 'fsL2PFail1' >>>>>>>>>> XrdCmsNode.cc:1537: error: `fsL2PFail2' was not declared in this >>>>>>>>>> scope >>>>>>>>>> XrdCmsNode.cc:1537: warning: unused variable 'fsL2PFail2' >>>>>>>>>> XrdCmsNode.cc: At global scope: >>>>>>>>>> XrdCmsNode.cc:1553: error: no `const char* >>>>>>>>>> XrdCmsNode::fsFail(const >>>>>>>>>> char*, const char*, const char*, int)' member function declared in >>>>>>>>>> class `XrdCmsNode' >>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>> XrdCmsNode::fsFail(const char*, const char*, const char*, int)': >>>>>>>>>> XrdCmsNode.cc:1559: error: `fsL2PFail1' was not declared in this >>>>>>>>>> scope >>>>>>>>>> XrdCmsNode.cc:1559: warning: unused variable 'fsL2PFail1' >>>>>>>>>> XrdCmsNode.cc:1560: error: `fsL2PFail2' was not declared in this >>>>>>>>>> scope >>>>>>>>>> XrdCmsNode.cc:1560: warning: unused variable 'fsL2PFail2' >>>>>>>>>> XrdCmsNode.cc: In static member function `static int >>>>>>>>>> XrdCmsNode::isOnline(char*, int)': >>>>>>>>>> XrdCmsNode.cc:1608: error: 'class XrdCmsConfig' has no member >>>>>>>>>> named >>>>>>>>>> 'ossFS' >>>>>>>>>> make[4]: *** [../../obj/XrdCmsNode.o] Error 1 >>>>>>>>>> make[3]: *** [Linuxall] Error 2 >>>>>>>>>> make[2]: *** [all] Error 2 >>>>>>>>>> make[1]: *** [XrdCms] Error 2 >>>>>>>>>> make: *** [all] Error 2 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Wen >>>>>>>>>> >>>>>>>>>> On Tue, Dec 15, 2009 at 2:08 AM, Andrew Hanushevsky >>>>>>>>>> <[log in to unmask]> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Wen, >>>>>>>>>>> >>>>>>>>>>> I have developed a permanent fix. You will find the source files >>>>>>>>>>> in >>>>>>>>>>> >>>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/ >>>>>>>>>>> >>>>>>>>>>> There are three files: XrdCmsCluster.cc XrdCmsNode.cc >>>>>>>>>>> XrdCmsProtocol.cc >>>>>>>>>>> >>>>>>>>>>> Please do a source replacement and recompile. Unfortunately, the >>>>>>>>>>> cmsd >>>>>>>>>>> will >>>>>>>>>>> need to be replaced on each node regardless of role. My apologies >>>>>>>>>>> for >>>>>>>>>>> the >>>>>>>>>>> disruption. Please let me know how it goes. >>>>>>>>>>> >>>>>>>>>>> Andy >>>>>>>>>>> >>>>>>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>>>>>> <[log in to unmask]> >>>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>>>>>> Cc: <[log in to unmask]> >>>>>>>>>>> Sent: Sunday, December 13, 2009 7:04 AM >>>>>>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi Andrew, >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks. >>>>>>>>>>> I used the new cmsd at atlas-bkp1 manager. But it's still >>>>>>>>>>> dropping >>>>>>>>>>> nodes. And in supervisor's log, I cannot find any dataserver to >>>>>>>>>>> register to it. >>>>>>>>>>> >>>>>>>>>>> The new logs are in http://higgs03.cs.wisc.edu/wguan/*.20091213. >>>>>>>>>>> The manager is patched at 091213 08:38:15. >>>>>>>>>>> >>>>>>>>>>> Wen >>>>>>>>>>> >>>>>>>>>>> On Sun, Dec 13, 2009 at 1:52 AM, Andrew Hanushevsky >>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Wen >>>>>>>>>>>> >>>>>>>>>>>> You will find the source replacement at: >>>>>>>>>>>> >>>>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/ >>>>>>>>>>>> >>>>>>>>>>>> It's XrdCmsCluster.cc and it replaces >>>>>>>>>>>> xrootd/src/XrdCms/XrdCmsCluster.cc >>>>>>>>>>>> >>>>>>>>>>>> I'm stepping out for a couple of hours but will be back to see >>>>>>>>>>>> how >>>>>>>>>>>> things >>>>>>>>>>>> went. Sorry for the issues :-( >>>>>>>>>>>> >>>>>>>>>>>> Andy >>>>>>>>>>>> >>>>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>> >>>>>>>>>>>>> I prefer a source replacement. Then I can compile it. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks >>>>>>>>>>>>> Wen >>>>>>>>>>>>>> >>>>>>>>>>>>>> I can do one of two things here: >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1) Supply a source replacement and then you would recompile, >>>>>>>>>>>>>> or >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2) Give me the uname -a of where the cmsd will run and I'll >>>>>>>>>>>>>> supply >>>>>>>>>>>>>> a >>>>>>>>>>>>>> binary >>>>>>>>>>>>>> replacement for you. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Your choice. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Andy >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Andrew >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The problem is found. Great. Thanks. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Where can I find the patched cmsd? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky >>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I found the problem. Looks like a regression from way back >>>>>>>>>>>>>>>> when. >>>>>>>>>>>>>>>> There >>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>> missing flag on the redirect. This will require a patched >>>>>>>>>>>>>>>> cmsd >>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>> need >>>>>>>>>>>>>>>> only to replace the redirector's cmsd as this only affects >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> redirector. >>>>>>>>>>>>>>>> How would you like to proceed? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> It doesn't work. atlas-bkp1 manager still dropping nodes >>>>>>>>>>>>>>>>> again. >>>>>>>>>>>>>>>>> In supervisor, I still haven't seen any dataserver >>>>>>>>>>>>>>>>> registered. >>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>> said >>>>>>>>>>>>>>>>> "I updated the ntp" because you said "the log timestamp do >>>>>>>>>>>>>>>>> not >>>>>>>>>>>>>>>>> overlap". >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky >>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Do you mean that everything is now working? It could be >>>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>> removed >>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> xrd.timeout directive. That really could cause problems. >>>>>>>>>>>>>>>>>> As >>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> delays, >>>>>>>>>>>>>>>>>> that is normal when the redirector thinks something is >>>>>>>>>>>>>>>>>> going >>>>>>>>>>>>>>>>>> wrong. >>>>>>>>>>>>>>>>>> The >>>>>>>>>>>>>>>>>> strategy is to delay clients until it can get back to a >>>>>>>>>>>>>>>>>> stable >>>>>>>>>>>>>>>>>> configuration. This usually prevents jobs from crashing >>>>>>>>>>>>>>>>>> during >>>>>>>>>>>>>>>>>> stressful >>>>>>>>>>>>>>>>>> periods. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I restarted it to do supervisor test. Also because xrootd >>>>>>>>>>>>>>>>>>> manager >>>>>>>>>>>>>>>>>>> frequently doesn't response. (*) is the cms.log, the file >>>>>>>>>>>>>>>>>>> select >>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>> delayed again and again. When do a restart, all things >>>>>>>>>>>>>>>>>>> are >>>>>>>>>>>>>>>>>>> fine. >>>>>>>>>>>>>>>>>>> Now >>>>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>>> am trying to find a clue about it. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> (*) >>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] >>>>>>>>>>>>>>>>>>> do_Select: >>>>>>>>>>>>>>>>>>> wc >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 Select seeking >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 UnkFile rc=1 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] >>>>>>>>>>>>>>>>>>> do_Select: >>>>>>>>>>>>>>>>>>> delay 5 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 2+-1 >>>>>>>>>>>>>>>>>>> post=0 >>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 Dispatch >>>>>>>>>>>>>>>>>>> redirector.21313:14@atlas-bkp2 >>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>> select dlen=166 >>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1 >>>>>>>>>>>>>>>>>>> post=0 >>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdSched: running redirector inq=0 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> There is no core file. I copied a new copies of the logs >>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>> link >>>>>>>>>>>>>>>>>>> below. >>>>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky >>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I see in the server log that it is restarting often. >>>>>>>>>>>>>>>>>>>> Could >>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>> take >>>>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>>> look >>>>>>>>>>>>>>>>>>>> in the c193 to see if you have any core files? Also >>>>>>>>>>>>>>>>>>>> please >>>>>>>>>>>>>>>>>>>> make >>>>>>>>>>>>>>>>>>>> sure >>>>>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>>>> core files are enabled as Linux defaults the size to 0. >>>>>>>>>>>>>>>>>>>> The >>>>>>>>>>>>>>>>>>>> first >>>>>>>>>>>>>>>>>>>> step >>>>>>>>>>>>>>>>>>>> here >>>>>>>>>>>>>>>>>>>> is to find out why your servers are restarting. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> the logs can be found here. From the log you can see >>>>>>>>>>>>>>>>>>>>> atlas-bkp1 >>>>>>>>>>>>>>>>>>>>> manager are dropping nodes again and again which tries >>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>> it. >>>>>>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky >>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hi Wen, Could you start everything up and provide me a >>>>>>>>>>>>>>>>>>>>>> pointer >>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>> manager log file, supervisor log file, and one data >>>>>>>>>>>>>>>>>>>>>> server >>>>>>>>>>>>>>>>>>>>>> logfile >>>>>>>>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>>>> which cover the same time-frame (from start to some >>>>>>>>>>>>>>>>>>>>>> point >>>>>>>>>>>>>>>>>>>>>> where >>>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>>> think >>>>>>>>>>>>>>>>>>>>>> things are working or not). That way I can see what is >>>>>>>>>>>>>>>>>>>>>> happening. >>>>>>>>>>>>>>>>>>>>>> At >>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>> moment I only see two "bad" things in the config file: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a >>>>>>>>>>>>>>>>>>>>>> manager >>>>>>>>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>>> claim, >>>>>>>>>>>>>>>>>>>>>> via >>>>>>>>>>>>>>>>>>>>>> the all.manager directive, that there are three (bkp2 >>>>>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>>> bkp3). >>>>>>>>>>>>>>>>>>>>>> While >>>>>>>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>>>>>>> should work, the log file will be dense with error >>>>>>>>>>>>>>>>>>>>>> messages. >>>>>>>>>>>>>>>>>>>>>> Please >>>>>>>>>>>>>>>>>>>>>> correct >>>>>>>>>>>>>>>>>>>>>> this to be consistent and make it easier to see real >>>>>>>>>>>>>>>>>>>>>> errors. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> This is not a problem for me. Because this config is >>>>>>>>>>>>>>>>>>>>> used >>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>> dataserver. In manager, I updated the if >>>>>>>>>>>>>>>>>>>>> atlas-bkp1.cs.wisc.edu >>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>> atlas-bkp2 or something. This is a history problem. at >>>>>>>>>>>>>>>>>>>>> first >>>>>>>>>>>>>>>>>>>>> only >>>>>>>>>>>>>>>>>>>>> atlas-bkp1 is used. atlas-bkp2 and atlas-bkp3 are added >>>>>>>>>>>>>>>>>>>>> later. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> 2) Please use cms.space not olb.space (for historical >>>>>>>>>>>>>>>>>>>>>> reasons >>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>> latter >>>>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>>> still accepted and over-rides the former, but that >>>>>>>>>>>>>>>>>>>>>> will >>>>>>>>>>>>>>>>>>>>>> soon >>>>>>>>>>>>>>>>>>>>>> end), >>>>>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>>> please use only one (the config file uses both >>>>>>>>>>>>>>>>>>>>>> directives). >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> yes. I should remove this line. in fact cms.space is in >>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>> cfg >>>>>>>>>>>>>>>>>>>>> too. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> The xrootd has an internal mechanism to connect >>>>>>>>>>>>>>>>>>>>>> servers >>>>>>>>>>>>>>>>>>>>>> with >>>>>>>>>>>>>>>>>>>>>> supervisors >>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>> allow for maximum reliability. You cannot change that >>>>>>>>>>>>>>>>>>>>>> algorithm >>>>>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>>> there >>>>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>>> no need to do so. You should *never* tell anyone to >>>>>>>>>>>>>>>>>>>>>> directly >>>>>>>>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>>>>> supervisor. If you do, you will likely get unreachable >>>>>>>>>>>>>>>>>>>>>> nodes. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> As for dropping data servers, it would appear to me, >>>>>>>>>>>>>>>>>>>>>> given >>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>> flurry >>>>>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>>>> such activity, that something either crashed or was >>>>>>>>>>>>>>>>>>>>>> restarted. >>>>>>>>>>>>>>>>>>>>>> That's >>>>>>>>>>>>>>>>>>>>>> why >>>>>>>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>>>>>>> would be good to see the complete log of each one of >>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>> entities. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> I read the document. and write a config >>>>>>>>>>>>>>>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg). >>>>>>>>>>>>>>>>>>>>>>> I used my conf, I can see manager is dispatch message >>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>> supervisor. But I cannot see any dataserver tries to >>>>>>>>>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>> supervisor. At the same time, in the manager's log, I >>>>>>>>>>>>>>>>>>>>>>> can >>>>>>>>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>>>>>>> some >>>>>>>>>>>>>>>>>>>>>>> dataserver are Dropped. >>>>>>>>>>>>>>>>>>>>>>> How does xrootd decide which dataserver will connect >>>>>>>>>>>>>>>>>>>>>>> supervisor? >>>>>>>>>>>>>>>>>>>>>>> should I specify some dataservers to connect the >>>>>>>>>>>>>>>>>>>>>>> supervisor? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> (*) supervisor log >>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 Dispatch >>>>>>>>>>>>>>>>>>>>>>> manager.0:20@atlas-bkp2 >>>>>>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>>>>> state >>>>>>>>>>>>>>>>>>>>>>> dlen=42 >>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 >>>>>>>>>>>>>>>>>>>>>>> do_State: >>>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 >>>>>>>>>>>>>>>>>>>>>>> do_StateFWD: >>>>>>>>>>>>>>>>>>>>>>> Path >>>>>>>>>>>>>>>>>>>>>>> find >>>>>>>>>>>>>>>>>>>>>>> failed for state >>>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> (*)manager log >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>> TSpace=5587GB >>>>>>>>>>>>>>>>>>>>>>> NumFS=1 >>>>>>>>>>>>>>>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding >>>>>>>>>>>>>>>>>>>>>>> path: >>>>>>>>>>>>>>>>>>>>>>> w >>>>>>>>>>>>>>>>>>>>>>> /atlas >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 >>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>> do_Space: 5696231MB free; 0% util >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 logged in. >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection from >>>>>>>>>>>>>>>>>>>>>>> [log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: running >>>>>>>>>>>>>>>>>>>>>>> ?:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched protocol >>>>>>>>>>>>>>>>>>>>>>> cmsd >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>> 79 >>>>>>>>>>>>>>>>>>>>>>> attached >>>>>>>>>>>>>>>>>>>>>>> to poller 2; num=22 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add >>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> bumps >>>>>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 #63 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Drop_Node: >>>>>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 dropped. >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add Shoved >>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 to cluster; >>>>>>>>>>>>>>>>>>>>>>> id=63.78; >>>>>>>>>>>>>>>>>>>>>>> num=64; >>>>>>>>>>>>>>>>>>>>>>> min=51 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>> TSpace=5587GB >>>>>>>>>>>>>>>>>>>>>>> NumFS=1 >>>>>>>>>>>>>>>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding >>>>>>>>>>>>>>>>>>>>>>> path: >>>>>>>>>>>>>>>>>>>>>>> w >>>>>>>>>>>>>>>>>>>>>>> /atlas >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 >>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>> do_Space: 5721854MB free; 0% util >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 logged in. >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve from >>>>>>>>>>>>>>>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop node >>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>> 60 >>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 node 63.78 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 >>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>> 79 detached from poller 2; num=21 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>>>> c177.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>> FD=16 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 node 0.3 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>> 16 detached from poller 2; num=20 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>>>> c179.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>> FD=21 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>> server.17065:[log in to unmask]:1094 node 1.4 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>> server.7978:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>> server.7978:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>> 21 >>>>>>>>>>>>>>>>>>>>>>> detached from poller 1; num=21 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 State: Status changed to >>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Send status to >>>>>>>>>>>>>>>>>>>>>>> redirector.15656:14@atlas-bkp2 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>>>> c182.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>> FD=19 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 node 7.10 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>> 19 detached from poller 2; num=19 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>>>> c178.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>> FD=15 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 node 9.12 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>> 15 detached from poller 1; num=20 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>>>> c181.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>> FD=17 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 node 5.8 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>> 17 detached from poller 0; num=21 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>>>> c183.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>> FD=22 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 node 8.11 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>> 22 detached from poller 2; num=18 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>>>> c184.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>> FD=20 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>> server.4131:[log in to unmask]:1094 node 3.6 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>> 20 detached from poller 0; num=20 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>>>> c185.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>> FD=23 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 node 6.9 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>> server.8524:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>> server.8524:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>> 23 >>>>>>>>>>>>>>>>>>>>>>> detached from poller 0; num=19 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>>>> c180.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>> FD=18 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 node 4.7 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>> 18 detached from poller 1; num=19 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from >>>>>>>>>>>>>>>>>>>>>>> c186.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>> FD=24 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 node 2.5 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>> server.7849:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>> server.7849:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>> 24 >>>>>>>>>>>>>>>>>>>>>>> detached from poller 1; num=18 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node >>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node >>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node >>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node >>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node >>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node >>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node >>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node >>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node >>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node >>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled. >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled. >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled. >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled. >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled. >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled. >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled. >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled. >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled. >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled. >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled. >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled. >>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky >>>>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> To go past 64 data servers you will need to setup >>>>>>>>>>>>>>>>>>>>>>>> one >>>>>>>>>>>>>>>>>>>>>>>> or >>>>>>>>>>>>>>>>>>>>>>>> more >>>>>>>>>>>>>>>>>>>>>>>> supervisors. >>>>>>>>>>>>>>>>>>>>>>>> This does not logically change the current >>>>>>>>>>>>>>>>>>>>>>>> configuration >>>>>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>>>>> have. >>>>>>>>>>>>>>>>>>>>>>>> You >>>>>>>>>>>>>>>>>>>>>>>> only >>>>>>>>>>>>>>>>>>>>>>>> need to configure one or more *new* servers (or at >>>>>>>>>>>>>>>>>>>>>>>> least >>>>>>>>>>>>>>>>>>>>>>>> xrootd >>>>>>>>>>>>>>>>>>>>>>>> processes) >>>>>>>>>>>>>>>>>>>>>>>> whose role is supervisor. We'd like them to run in >>>>>>>>>>>>>>>>>>>>>>>> separate >>>>>>>>>>>>>>>>>>>>>>>> machines >>>>>>>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>>>>>> reliability purposes, but they could run on the >>>>>>>>>>>>>>>>>>>>>>>> manager >>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>>>>>>>> long >>>>>>>>>>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>>>>> give each one a unique instance name (i.e., -n >>>>>>>>>>>>>>>>>>>>>>>> option). >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> The front part of the cmsd reference explains how to >>>>>>>>>>>>>>>>>>>>>>>> do >>>>>>>>>>>>>>>>>>>>>>>> this. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Is there any change to configure xrootd with more >>>>>>>>>>>>>>>>>>>>>>>>> than >>>>>>>>>>>>>>>>>>>>>>>>> 65 >>>>>>>>>>>>>>>>>>>>>>>>> machines? I used the configure below but it doesn't >>>>>>>>>>>>>>>>>>>>>>>>> work. >>>>>>>>>>>>>>>>>>>>>>>>> Should >>>>>>>>>>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>>>>>>>>> configure some machines' manager to be supvervisor? >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>> >>>> >>> >>> >