Hi Andy, I am sure I am using the right cmsd code. Today I compiled and reinstall all cmsd and xrootd. But now it still doesn't work. I will create an account for you then you can login to these machines to check what happend. In fact today when I doing some restart, I saw some machines registered itself to higgs07. But unfortunately when reinstalling, the logs have been cleaned. I found the supervisor will come to "suspend" state after a while it's started, will it cause the supervisor fails to get some information. Wen On Fri, Dec 18, 2009 at 3:05 AM, Andrew Hanushevsky <[log in to unmask]> wrote: > Hi Wen, > > Something is really going wrong with your data servers. For instance, c109 > is quite happy from midnight to 7:23am. Then it dropped the connection. Then > reconnected 7:24:03 and was again happy 12:37:20 but here it reported that's > it's xrootd died but then the cmsd promptly killed its connection afterward. > This appears as if someone restarted the xrootd followed by the cmsd on > c109. This continued like this until 12:43:00 (i.e., connect, suspend, die, > repeat). All your servers, in fact, started doing this at 12:36:41 to > 12:42:51 causing a massive swap of servers. New servers were added and old > ones reconnecting were redirected to the supervisor. However, it would > appear that those machines could not connect there as they kept comming back > to the atlas-bkp1. I can't tell you anything about what was happening on > higgs07. As far as I can tell it was happily connected to the redirector > cmsd. The reason is that y=there is no log for higgs07 on the web site for > 12/17 starting at midnight. Perhaps you can put one there. > > So, > > 1) Are you *absolutely* sure that *all* your (data, etc) servers are running > the corrected cmsd? > 2) Please provide the higgs07 log for 12/17. > > 3) Please provide logs for a sampling of data servers say c0109, c094, > higgs15, and higgs13 between 1/17 12:00:00 to 15:44. > > I have never seen a situation like yours so something is very wrong here. In > the mean time I will add more debugging information to the redirector and > supervisor and let you know when that is available. > > Andy > > > ----- Original Message ----- From: "wen guan" <[log in to unmask]> > To: "Fabrizio Furano" <[log in to unmask]> > Cc: "Andrew Hanushevsky" <[log in to unmask]>; <[log in to unmask]> > Sent: Thursday, December 17, 2009 3:12 PM > Subject: Re: xrootd with more than 65 machines > > > Hi Fabrizio, > > This is the xrdcp debug message. > ClientHeader.header.dlen = 41 > =================== END CLIENT HEADER DUMPING =================== > > 091217 16:47:54 15961 Xrd: WriteRaw: Writing 24 bytes to physical connection > 091217 16:47:54 15961 Xrd: WriteRaw: Writing to substreamid 0 > 091217 16:47:54 15961 Xrd: WriteRaw: Writing 41 bytes to physical connection > 091217 16:47:54 15961 Xrd: WriteRaw: Writing to substreamid 0 > 091217 16:47:54 15961 Xrd: ReadPartialAnswer: Reading a > XrdClientMessage from the server [atlas-bkp1.cs.wisc.edu:1094]... > 091217 16:47:54 15961 Xrd: XrdClientMessage::ReadRaw: sid: 1, IsAttn: > 0, substreamid: 0 > 091217 16:47:54 15961 Xrd: XrdClientMessage::ReadRaw: Reading data (4 > bytes) from substream 0 > 091217 16:47:54 15961 Xrd: ReadRaw: Reading from atlas-bkp1.cs.wisc.edu:1094 > 091217 16:47:54 15961 Xrd: BuildMessage: posting id 1 > 091217 16:47:54 15961 Xrd: XrdClientMessage::ReadRaw: Reading header (8 > bytes). > 091217 16:47:54 15961 Xrd: ReadRaw: Reading from atlas-bkp1.cs.wisc.edu:1094 > > > ======== DUMPING SERVER RESPONSE HEADER ======== > ServerHeader.streamid = 0x01 0x00 > ServerHeader.status = kXR_wait (4005) > ServerHeader.dlen = 4 > ========== END DUMPING SERVER HEADER =========== > > 091217 16:47:54 15961 Xrd: ReadPartialAnswer: Server > [atlas-bkp1.cs.wisc.edu:1094] answered [kXR_wait] (4005) > 091217 16:47:54 15961 Xrd: CheckErrorStatus: Server > [atlas-bkp1.cs.wisc.edu:1094] requested 10 seconds of wait > 091217 16:48:04 15961 Xrd: DumpPhyConn: Phyconn entry, > [log in to unmask]:1094', LogCnt=1 Valid > 091217 16:48:04 15961 Xrd: SendGenCommand: Sending command Open > > > ================= DUMPING CLIENT REQUEST HEADER ================= > ClientHeader.streamid = 0x01 0x00 > ClientHeader.requestid = kXR_open (3010) > ClientHeader.open.mode = 0x00 0x00 > ClientHeader.open.options = 0x40 0x04 > ClientHeader.open.reserved = 0 repeated 12 times > ClientHeader.header.dlen = 41 > =================== END CLIENT HEADER DUMPING =================== > > 091217 16:48:04 15961 Xrd: WriteRaw: Writing 24 bytes to physical connection > 091217 16:48:04 15961 Xrd: WriteRaw: Writing to substreamid 0 > 091217 16:48:04 15961 Xrd: WriteRaw: Writing 41 bytes to physical connection > 091217 16:48:04 15961 Xrd: WriteRaw: Writing to substreamid 0 > 091217 16:48:04 15961 Xrd: ReadPartialAnswer: Reading a > XrdClientMessage from the server [atlas-bkp1.cs.wisc.edu:1094]... > 091217 16:48:04 15961 Xrd: XrdClientMessage::ReadRaw: sid: 1, IsAttn: > 0, substreamid: 0 > 091217 16:48:04 15961 Xrd: XrdClientMessage::ReadRaw: Reading data (4 > bytes) from substream 0 > 091217 16:48:04 15961 Xrd: ReadRaw: Reading from atlas-bkp1.cs.wisc.edu:1094 > 091217 16:48:04 15961 Xrd: BuildMessage: posting id 1 > 091217 16:48:04 15961 Xrd: XrdClientMessage::ReadRaw: Reading header (8 > bytes). > 091217 16:48:04 15961 Xrd: ReadRaw: Reading from atlas-bkp1.cs.wisc.edu:1094 > > > ======== DUMPING SERVER RESPONSE HEADER ======== > ServerHeader.streamid = 0x01 0x00 > ServerHeader.status = kXR_wait (4005) > ServerHeader.dlen = 4 > ========== END DUMPING SERVER HEADER =========== > > 091217 16:48:04 15961 Xrd: ReadPartialAnswer: Server > [atlas-bkp1.cs.wisc.edu:1094] answered [kXR_wait] (4005) > 091217 16:48:04 15961 Xrd: CheckErrorStatus: Server > [atlas-bkp1.cs.wisc.edu:1094] requested 10 seconds of wait > 091217 16:48:14 15961 Xrd: SendGenCommand: Sending command Open > > > ================= DUMPING CLIENT REQUEST HEADER ================= > ClientHeader.streamid = 0x01 0x00 > ClientHeader.requestid = kXR_open (3010) > ClientHeader.open.mode = 0x00 0x00 > ClientHeader.open.options = 0x40 0x04 > ClientHeader.open.reserved = 0 repeated 12 times > ClientHeader.header.dlen = 41 > =================== END CLIENT HEADER DUMPING =================== > > 091217 16:48:14 15961 Xrd: WriteRaw: Writing 24 bytes to physical connection > 091217 16:48:14 15961 Xrd: WriteRaw: Writing to substreamid 0 > 091217 16:48:14 15961 Xrd: WriteRaw: Writing 41 bytes to physical connection > 091217 16:48:14 15961 Xrd: WriteRaw: Writing to substreamid 0 > 091217 16:48:14 15961 Xrd: ReadPartialAnswer: Reading a > XrdClientMessage from the server [atlas-bkp1.cs.wisc.edu:1094]... > 091217 16:48:14 15961 Xrd: XrdClientMessage::ReadRaw: sid: 1, IsAttn: > 0, substreamid: 0 > 091217 16:48:14 15961 Xrd: XrdClientMessage::ReadRaw: Reading data (4 > bytes) from substream 0 > 091217 16:48:14 15961 Xrd: ReadRaw: Reading from atlas-bkp1.cs.wisc.edu:1094 > 091217 16:48:14 15961 Xrd: BuildMessage: posting id 1 > 091217 16:48:14 15961 Xrd: XrdClientMessage::ReadRaw: Reading header (8 > bytes). > 091217 16:48:14 15961 Xrd: ReadRaw: Reading from atlas-bkp1.cs.wisc.edu:1094 > > > ======== DUMPING SERVER RESPONSE HEADER ======== > ServerHeader.streamid = 0x01 0x00 > ServerHeader.status = kXR_wait (4005) > ServerHeader.dlen = 4 > ========== END DUMPING SERVER HEADER =========== > > 091217 16:48:14 15961 Xrd: ReadPartialAnswer: Server > [atlas-bkp1.cs.wisc.edu:1094] answered [kXR_wait] (4005) > 091217 16:48:14 15961 Xrd: SendGenCommand: Max time limit elapsed for > request kXR_open. Aborting command. > Last server error 10000 ('') > Error accessing path/file for > root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 > > > Wen > > On Thu, Dec 17, 2009 at 11:27 PM, Fabrizio Furano <[log in to unmask]> wrote: >> >> Hi Wen, >> >> I see that you are getting error 10000, which means "generic error before >> any interaction". Could you please run the same command with debug level 3 >> and post the log with the same kind of issue? Something like >> >> xrdcp -d 3 .... >> >> Most likely this time the problem is different. I may be wrong here, but a >> possible reason for that error is that the servers require authentication >> and xrdcp does not find some library in the LD_LIBRARY_PATH. >> >> Fabrizio >> >> >> wen guan ha scritto: >>> >>> Hi Andy, >>> >>> I put new logs in web. >>> >>> It still doesn't work. I cannot copy files in and out. >>> >>> It seems xrootd daemon at atlas-bkp1 hasn't talked with cmsd. >>> Normally if xrootd daemont tries to copy a file, in the cms.log I >>> should see "do_Select: filename". But in this cms.log, there is >>> nothing from atlas-bkp1. >>> >>> (*) >>> [root@atlas-bkp1 ~]# xrdcp >>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>> /tmp/ >>> Last server error 10000 ('') >>> Error accessing path/file for >>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>> [root@atlas-bkp1 ~]# xrdcp /bin/mv >>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123 >>> 133 >>> >>> >>> Wen >>> >>> On Thu, Dec 17, 2009 at 10:54 PM, Andrew Hanushevsky <[log in to unmask]> >>> wrote: >>>> >>>> Hi Wen, >>>> >>>> I reviewed the log file. Other than the odd redirect of c131 at 17:47:25 >>>> which I can't comment on because its logs on the web site do not overlap >>>> with the manager or supervisor. Unless all the logs include the full >>>> time >>>> in >>>> question I can't say much of anything. Can you provide me with inclusive >>>> logs? >>>> >>>> atlas-bkp1 cms: 17:20:57 to 17:42:19 xrd: 17:20:57 to 17:40:57 >>>> higgs07 cms & xrd 17:22:33 to 17:42:33 >>>> c131 cms & xrd 17:31:57 to 17:47:28 >>>> >>>> That said, it certainly looks like things were working and files were >>>> being >>>> accessed and discovered on all the machines. You even werw able to open >>>> /atlas/xrootd/users/wguan/test/test98123313 >>>> through not >>>> /atlas/xrootd/users/wguan/test/test123131The other issue is that you did >>>> not >>>> specify a stable adminpath and the adminpath defaults to /tmp. If you >>>> have a >>>> "cleanup" script that runs periodically for /tmp then eventually your >>>> cluster will go catonic as important (but not often used) files are >>>> deleted >>>> by that script. Could you please find a stable home for the adminpath? >>>> >>>> I reran my tests here and things worked as expected. I will ramp up some >>>> more tests. So, what is your status today? >>>> >>>> Andy >>>> >>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]> >>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>> Cc: <[log in to unmask]> >>>> Sent: Thursday, December 17, 2009 5:05 AM >>>> Subject: Re: xrootd with more than 65 machines >>>> >>>> >>>> Hi Andy, >>>> >>>> Yes. I am using the file download from >>>> http://www.slac.stanford.edu/~abh/cmsd/ which compiled yesterday. I >>>> just now compiled it again and compare it with one I compiled >>>> yesterday. they are the same(same md5sum). >>>> >>>> Wen >>>> >>>> On Thu, Dec 17, 2009 at 2:09 AM, Andrew Hanushevsky <[log in to unmask]> >>>> wrote: >>>>> >>>>> Hi Wen, >>>>> >>>>> If c131 cannot connect then either c131 does not have the new cms or >>>>> atlas-bkp1 does not have the new cms as that would be what would happen >>>>> if >>>>> either were true. Looking at the log on c131 it would appear that >>>>> atlas-bkp1 >>>>> is still using the old cmsd as the response data length is wrong. Could >>>>> you >>>>> verify please. >>>>> >>>>> Andy >>>>> >>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]> >>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>> Cc: <[log in to unmask]> >>>>> Sent: Wednesday, December 16, 2009 3:58 PM >>>>> Subject: Re: xrootd with more than 65 machines >>>>> >>>>> >>>>> Hi Andy, >>>>> >>>>> I tried it. But there are still some problem. I put the logs in >>>>> higgs03.cs.wisc.edu/wguan/ >>>>> >>>>> In my test, c131 is the 65 nodes to be added the the manager. >>>>> and I can copy the file to the pool through manager. But I cannot >>>>> copy a file out which is in c131. >>>>> >>>>> In c131's cms.log, I see "Manager: >>>>> manager.0:[log in to unmask] removed; redirected" again and >>>>> again. and I cannot see any thing about c131 in higgs07's >>>>> log(supervisor). Does it mean manager tries to redirect it to higgs07, >>>>> but c131 hasn't try to connect higgs07. It only tries to connect >>>>> manager again. >>>>> >>>>> (*) >>>>> [root@c131 ~]# xrdcp /bin/mv >>>>> root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331 >>>>> Last server error 10000 ('') >>>>> Error accessing path/file for >>>>> root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331 >>>>> [root@c131 ~]# xrdcp /bin/mv >>>>> >>>>> >>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123311 >>>>> [xrootd] Total 0.06 MB |====================| 100.00 % [3.1 MB/s] >>>>> [root@c131 ~]# xrdcp /bin/mv >>>>> >>>>> >>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123312 >>>>> [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s] >>>>> [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/ >>>>> test123131 >>>>> [root@c131 ~]# xrdcp >>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>>> /tmp/ >>>>> Last server error 3011 ('No servers are available to read the file.') >>>>> Error accessing path/file for >>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>>> [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/test123131 >>>>> /atlas/xrootd/users/wguan/test/test123131 >>>>> [root@c131 ~]# xrdcp >>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>>> /tmp/ >>>>> Last server error 3011 ('No servers are available to read the file.') >>>>> Error accessing path/file for >>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>>> [root@c131 ~]# xrdcp /bin/mv >>>>> >>>>> >>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123313 >>>>> [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s] >>>>> [root@c131 ~]# xrdcp >>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>>> /tmp/ >>>>> Last server error 3011 ('No servers are available to read the file.') >>>>> Error accessing path/file for >>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>>> [root@c131 ~]# xrdcp >>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>>> /tmp/ >>>>> Last server error 3011 ('No servers are available to read the file.') >>>>> Error accessing path/file for >>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>>> [root@c131 ~]# xrdcp >>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>>> /tmp/ >>>>> Last server error 3011 ('No servers are available to read the file.') >>>>> Error accessing path/file for >>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>>> [root@c131 ~]# tail -f /var/log/xrootd/cms.log >>>>> 091216 17:45:52 3103 manager.0:[log in to unmask] XrdLink: >>>>> Setting ref to 2+-1 post=0 >>>>> 091216 17:45:55 3103 Pander trying to connect to lvl 0 >>>>> atlas-bkp1.cs.wisc.edu:3121 >>>>> 091216 17:45:55 3103 XrdInet: Connected to atlas-bkp1.cs.wisc.edu:3121 >>>>> 091216 17:45:55 3103 Add atlas-bkp1.cs.wisc.edu to manager config; id=0 >>>>> 091216 17:45:55 3103 ManTree: Now connected to 3 root node(s) >>>>> 091216 17:45:55 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu >>>>> 091216 17:45:55 3103 Dispatch manager.0:[log in to unmask] for >>>>> try >>>>> dlen=3 >>>>> 091216 17:45:55 3103 manager.0:[log in to unmask] do_Try: >>>>> 091216 17:45:55 3103 Remove completed atlas-bkp1.cs.wisc.edu manager >>>>> 0.95 >>>>> 091216 17:45:55 3103 Manager: manager.0:[log in to unmask] >>>>> removed; redirected >>>>> 091216 17:46:04 3103 Pander trying to connect to lvl 0 >>>>> atlas-bkp1.cs.wisc.edu:3121 >>>>> 091216 17:46:04 3103 XrdInet: Connected to atlas-bkp1.cs.wisc.edu:3121 >>>>> 091216 17:46:04 3103 Add atlas-bkp1.cs.wisc.edu to manager config; id=0 >>>>> 091216 17:46:04 3103 ManTree: Now connected to 3 root node(s) >>>>> 091216 17:46:04 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu >>>>> 091216 17:46:04 3103 Dispatch manager.0:[log in to unmask] for >>>>> try >>>>> dlen=3 >>>>> 091216 17:46:04 3103 Protocol: No buffers to serve >>>>> atlas-bkp1.cs.wisc.edu >>>>> 091216 17:46:04 3103 Remove completed atlas-bkp1.cs.wisc.edu manager >>>>> 0.96 >>>>> 091216 17:46:04 3103 Manager: manager.0:[log in to unmask] >>>>> removed; insufficient buffers >>>>> 091216 17:46:11 3103 Dispatch manager.0:[log in to unmask] for >>>>> state dlen=169 >>>>> 091216 17:46:11 3103 manager.0:[log in to unmask] XrdLink: >>>>> Setting ref to 1+1 post=0 >>>>> >>>>> Thanks >>>>> Wen >>>>> >>>>> On Thu, Dec 17, 2009 at 12:10 AM, wen guan <[log in to unmask]> >>>>> wrote: >>>>>> >>>>>> Hi Andy, >>>>>> >>>>>>> OK, I understand. As for stalling, too many nodes were deemed to be >>>>>>> in >>>>>>> trouble for the manager to allow service resumption. >>>>>>> >>>>>>> Please make sure that all of the nodes in the cluster receive the new >>>>>>> cmsd >>>>>>> as they will drop off with the old one and you'll see the same kind >>>>>>> of >>>>>>> activity. Perhaps the best way to know that you suceeded in putting >>>>>>> everything in sync is to start with 63 data nodes plus one >>>>>>> supervisor. >>>>>>> Once >>>>>>> all connections are established; adding an additional server should >>>>>>> simply >>>>>>> send it to the supervisor. >>>>>> >>>>>> I will do it. >>>>>> you said start 63 data server and one supervisor. Does it mean the >>>>>> supervisor is managed using the same policy? If I there are 64 >>>>>> dataservers which are connected before the supervisor, will the >>>>>> supervisor be dropped? Is the supervisor has high priority to be >>>>>> added to the manager? I mean, if there are already 64 dataservers and >>>>>> a supervisor comes in, will the supervisor be accepted and a datasever >>>>>> be redirected to the supervisor? >>>>>> >>>>>> Thanks >>>>>> Wen >>>>>> >>>>>>> Hi Andrew, >>>>>>> >>>>>>> But when I tried to xrdcp a file to it, it doesn't response. In >>>>>>> atlas-bkp1-xrd.log.20091213, it always prints "stalling client for 10 >>>>>>> sec". But in cms.log, I can find any message about the file. >>>>>>> >>>>>>>> I don't see why you say it doesn't work. With the debugging level >>>>>>>> set >>>>>>>> so >>>>>>>> high the noise may make it look like something is going wrong but >>>>>>>> that >>>>>>>> isn't >>>>>>>> necessarily the case. >>>>>>>> >>>>>>>> 1) The 'too many subscribers' is correct. The manager was simply >>>>>>>> redirecting >>>>>>>> them because there were already 64 servers. However, in your case >>>>>>>> the >>>>>>>> supervisor wasn't started until almost 30 minutes after everyone >>>>>>>> else >>>>>>>> (i.e., >>>>>>>> 10:42 AM). Why was that? I'm not suprised about the flurry of >>>>>>>> messages >>>>>>>> with >>>>>>>> a critical component missing for 30 minutes. >>>>>>> >>>>>>> Because the manager is 64bit machine but supervisor is 32 bit >>>>>>> machine. >>>>>>> Then I have to recompile the it. At that time, I was interrupted by >>>>>>> something else. >>>>>>> >>>>>>> >>>>>>>> 2) Once the supervisor started, it started accepting the redirected >>>>>>>> servers. >>>>>>>> >>>>>>>> 3) Then 10 seconds (10:42:10) later the supervisor was restarted. >>>>>>>> So, >>>>>>>> that >>>>>>>> would cause a flurry of activity to occur as there is no backup >>>>>>>> supervisor >>>>>>>> to take over. >>>>>>>> >>>>>>>> 4) This happened again at 10:42:34 AM then again at 10:48:49. Is the >>>>>>>> supervisor crashing? Is there a core file? >>>>>>>> >>>>>>>> 5) At 11:11 AM the manager restarted. Again, is there a core file >>>>>>>> here >>>>>>>> or >>>>>>>> was this a manual action? >>>>>>>> >>>>>>>> During the course of all of this. All nodes connected were operating >>>>>>>> propely >>>>>>>> and files were being located. >>>>>>>> >>>>>>>> So, the two big questions are: >>>>>>>> >>>>>>>> a) Why was the supervisor not started until 30 minutes after the >>>>>>>> system >>>>>>>> was >>>>>>>> started? >>>>>>>> >>>>>>>> b) Is there an explanation of the restarts? If this was a crash then >>>>>>>> we >>>>>>>> need >>>>>>>> a core file to figure out what happened. >>>>>>> >>>>>>> It's not a crash. There are some reasons that I restarted some >>>>>>> daemons. >>>>>>> (1)I thought if a dataserver tried many times to connect to a >>>>>>> redirector but failed, the dataserver would not try to connect a >>>>>>> redirector again. The supervisor was missing for long time. So maybe >>>>>>> some dataservers would not try to connect to atlas-bkp1 again. To >>>>>>> reactive these dataservers, I restarted any servers. >>>>>>> (2)When I tried to xrdcp, it was hanging for long time. I thought >>>>>>> maybe manager was affected by some others things. then I restarte >>>>>>> manager to see whether a restart can make this xrdcp work. >>>>>>> >>>>>>> >>>>>>> Thanks >>>>>>> Wen >>>>>>> >>>>>>>> Andy >>>>>>>> >>>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>>> <[log in to unmask]> >>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>>> Cc: <[log in to unmask]> >>>>>>>> Sent: Wednesday, December 16, 2009 9:38 AM >>>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>>> >>>>>>>> >>>>>>>> Hi Andrew, >>>>>>>> >>>>>>>> It still doesn't work. >>>>>>>> The log file is in higgs03.cs.wisc.edu/wguan/. The name is >>>>>>>> *.20091216 >>>>>>>> The manager complains there are too many subscribers and the removes >>>>>>>> nodes. >>>>>>>> >>>>>>>> (*) >>>>>>>> Add server.10040:[log in to unmask] redirected; too many >>>>>>>> subscribers. >>>>>>>> >>>>>>>> Wen >>>>>>>> >>>>>>>> On Wed, Dec 16, 2009 at 4:25 AM, Andrew Hanushevsky >>>>>>>> <[log in to unmask]> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi Wen, >>>>>>>>> >>>>>>>>> It will be easier for me to retroft as the changes were pretty >>>>>>>>> minor. >>>>>>>>> Please >>>>>>>>> lift the new XrdCmsNode.cc file from >>>>>>>>> >>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd >>>>>>>>> >>>>>>>>> Andy >>>>>>>>> >>>>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>>>> <[log in to unmask]> >>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>>>> Cc: <[log in to unmask]> >>>>>>>>> Sent: Tuesday, December 15, 2009 5:12 PM >>>>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi Andy, >>>>>>>>> >>>>>>>>> I can switch to 20091104-1102. Then you don't need to patch >>>>>>>>> another version. How can I download v20091104-1102? >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Wen >>>>>>>>> >>>>>>>>> On Wed, Dec 16, 2009 at 12:52 AM, Andrew Hanushevsky >>>>>>>>> <[log in to unmask]> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi Wen, >>>>>>>>>> >>>>>>>>>> Ah yes, I see that now. The file I gave you is based on >>>>>>>>>> v20091104-1102. >>>>>>>>>> Let >>>>>>>>>> me see if I can retrofit the patch for you. >>>>>>>>>> >>>>>>>>>> Andy >>>>>>>>>> >>>>>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>>>>> <[log in to unmask]> >>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>>>>> Cc: <[log in to unmask]> >>>>>>>>>> Sent: Tuesday, December 15, 2009 1:04 PM >>>>>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi Andy, >>>>>>>>>> >>>>>>>>>> Which xrootd version are you using? XrdCmsConfig.hh is different. >>>>>>>>>> XrdCmsConfig.hh is downloaded from >>>>>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/. >>>>>>>>>> >>>>>>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsNode.cc >>>>>>>>>> 6fb3ae40fe4e10bdd4d372818a341f2c src/XrdCms/XrdCmsNode.cc >>>>>>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsConfig.hh >>>>>>>>>> 7d57753847d9448186c718f98e963cbe src/XrdCms/XrdCmsConfig.hh >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Wen >>>>>>>>>> >>>>>>>>>> On Tue, Dec 15, 2009 at 10:50 PM, Andrew Hanushevsky >>>>>>>>>> <[log in to unmask]> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Wen, >>>>>>>>>>> >>>>>>>>>>> Just compiled on Linux and it was clean. Something is really >>>>>>>>>>> wrong >>>>>>>>>>> with >>>>>>>>>>> your >>>>>>>>>>> source files, specifically XrdCmsConfig.cc >>>>>>>>>>> >>>>>>>>>>> The MD5 checksums on the relevant files are: >>>>>>>>>>> >>>>>>>>>>> MD5 (XrdCmsNode.cc) = 6fb3ae40fe4e10bdd4d372818a341f2c >>>>>>>>>>> >>>>>>>>>>> MD5 (XrdCmsConfig.hh) = 4a7d655582a7cd43b098947d0676924b >>>>>>>>>>> >>>>>>>>>>> Andy >>>>>>>>>>> >>>>>>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>>>>>> <[log in to unmask]> >>>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>>>>>> Cc: <[log in to unmask]> >>>>>>>>>>> Sent: Tuesday, December 15, 2009 4:24 AM >>>>>>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi Andy, >>>>>>>>>>> >>>>>>>>>>> No problem. Thanks for the fix. But it cannot be compiled. The >>>>>>>>>>> version I am using is >>>>>>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/. >>>>>>>>>>> >>>>>>>>>>> Making cms component... >>>>>>>>>>> Compiling XrdCmsNode.cc >>>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>>> XrdCmsNode::do_Chmod(XrdCmsRRData&)': >>>>>>>>>>> XrdCmsNode.cc:268: error: `fsExec' was not declared in this scope >>>>>>>>>>> XrdCmsNode.cc:268: warning: unused variable 'fsExec' >>>>>>>>>>> XrdCmsNode.cc:269: error: 'class XrdCmsConfig' has no member >>>>>>>>>>> named >>>>>>>>>>> 'ossFS' >>>>>>>>>>> XrdCmsNode.cc:273: error: `fsFail' was not declared in this scope >>>>>>>>>>> XrdCmsNode.cc:273: warning: unused variable 'fsFail' >>>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>>> XrdCmsNode::do_Mkdir(XrdCmsRRData&)': >>>>>>>>>>> XrdCmsNode.cc:600: error: `fsExec' was not declared in this scope >>>>>>>>>>> XrdCmsNode.cc:600: warning: unused variable 'fsExec' >>>>>>>>>>> XrdCmsNode.cc:601: error: 'class XrdCmsConfig' has no member >>>>>>>>>>> named >>>>>>>>>>> 'ossFS' >>>>>>>>>>> XrdCmsNode.cc:605: error: `fsFail' was not declared in this scope >>>>>>>>>>> XrdCmsNode.cc:605: warning: unused variable 'fsFail' >>>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>>> XrdCmsNode::do_Mkpath(XrdCmsRRData&)': >>>>>>>>>>> XrdCmsNode.cc:640: error: `fsExec' was not declared in this scope >>>>>>>>>>> XrdCmsNode.cc:640: warning: unused variable 'fsExec' >>>>>>>>>>> XrdCmsNode.cc:641: error: 'class XrdCmsConfig' has no member >>>>>>>>>>> named >>>>>>>>>>> 'ossFS' >>>>>>>>>>> XrdCmsNode.cc:645: error: `fsFail' was not declared in this scope >>>>>>>>>>> XrdCmsNode.cc:645: warning: unused variable 'fsFail' >>>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>>> XrdCmsNode::do_Mv(XrdCmsRRData&)': >>>>>>>>>>> XrdCmsNode.cc:704: error: `fsExec' was not declared in this scope >>>>>>>>>>> XrdCmsNode.cc:704: warning: unused variable 'fsExec' >>>>>>>>>>> XrdCmsNode.cc:705: error: 'class XrdCmsConfig' has no member >>>>>>>>>>> named >>>>>>>>>>> 'ossFS' >>>>>>>>>>> XrdCmsNode.cc:709: error: `fsFail' was not declared in this scope >>>>>>>>>>> XrdCmsNode.cc:709: warning: unused variable 'fsFail' >>>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>>> XrdCmsNode::do_Rm(XrdCmsRRData&)': >>>>>>>>>>> XrdCmsNode.cc:831: error: `fsExec' was not declared in this scope >>>>>>>>>>> XrdCmsNode.cc:831: warning: unused variable 'fsExec' >>>>>>>>>>> XrdCmsNode.cc:832: error: 'class XrdCmsConfig' has no member >>>>>>>>>>> named >>>>>>>>>>> 'ossFS' >>>>>>>>>>> XrdCmsNode.cc:836: error: `fsFail' was not declared in this scope >>>>>>>>>>> XrdCmsNode.cc:836: warning: unused variable 'fsFail' >>>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>>> XrdCmsNode::do_Rmdir(XrdCmsRRData&)': >>>>>>>>>>> XrdCmsNode.cc:873: error: `fsExec' was not declared in this scope >>>>>>>>>>> XrdCmsNode.cc:873: warning: unused variable 'fsExec' >>>>>>>>>>> XrdCmsNode.cc:874: error: 'class XrdCmsConfig' has no member >>>>>>>>>>> named >>>>>>>>>>> 'ossFS' >>>>>>>>>>> XrdCmsNode.cc:878: error: `fsFail' was not declared in this scope >>>>>>>>>>> XrdCmsNode.cc:878: warning: unused variable 'fsFail' >>>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>>> XrdCmsNode::do_Trunc(XrdCmsRRData&)': >>>>>>>>>>> XrdCmsNode.cc:1377: error: `fsExec' was not declared in this >>>>>>>>>>> scope >>>>>>>>>>> XrdCmsNode.cc:1377: warning: unused variable 'fsExec' >>>>>>>>>>> XrdCmsNode.cc:1378: error: 'class XrdCmsConfig' has no member >>>>>>>>>>> named >>>>>>>>>>> 'ossFS' >>>>>>>>>>> XrdCmsNode.cc:1382: error: `fsFail' was not declared in this >>>>>>>>>>> scope >>>>>>>>>>> XrdCmsNode.cc:1382: warning: unused variable 'fsFail' >>>>>>>>>>> XrdCmsNode.cc: At global scope: >>>>>>>>>>> XrdCmsNode.cc:1524: error: no `int >>>>>>>>>>> XrdCmsNode::fsExec(XrdOucProg*, >>>>>>>>>>> char*, char*)' member function declared in class `XrdCmsNode' >>>>>>>>>>> XrdCmsNode.cc: In member function `int >>>>>>>>>>> XrdCmsNode::fsExec(XrdOucProg*, >>>>>>>>>>> char*, char*)': >>>>>>>>>>> XrdCmsNode.cc:1533: error: `fsL2PFail1' was not declared in this >>>>>>>>>>> scope >>>>>>>>>>> XrdCmsNode.cc:1533: warning: unused variable 'fsL2PFail1' >>>>>>>>>>> XrdCmsNode.cc:1537: error: `fsL2PFail2' was not declared in this >>>>>>>>>>> scope >>>>>>>>>>> XrdCmsNode.cc:1537: warning: unused variable 'fsL2PFail2' >>>>>>>>>>> XrdCmsNode.cc: At global scope: >>>>>>>>>>> XrdCmsNode.cc:1553: error: no `const char* >>>>>>>>>>> XrdCmsNode::fsFail(const >>>>>>>>>>> char*, const char*, const char*, int)' member function declared >>>>>>>>>>> in >>>>>>>>>>> class `XrdCmsNode' >>>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>>> XrdCmsNode::fsFail(const char*, const char*, const char*, int)': >>>>>>>>>>> XrdCmsNode.cc:1559: error: `fsL2PFail1' was not declared in this >>>>>>>>>>> scope >>>>>>>>>>> XrdCmsNode.cc:1559: warning: unused variable 'fsL2PFail1' >>>>>>>>>>> XrdCmsNode.cc:1560: error: `fsL2PFail2' was not declared in this >>>>>>>>>>> scope >>>>>>>>>>> XrdCmsNode.cc:1560: warning: unused variable 'fsL2PFail2' >>>>>>>>>>> XrdCmsNode.cc: In static member function `static int >>>>>>>>>>> XrdCmsNode::isOnline(char*, int)': >>>>>>>>>>> XrdCmsNode.cc:1608: error: 'class XrdCmsConfig' has no member >>>>>>>>>>> named >>>>>>>>>>> 'ossFS' >>>>>>>>>>> make[4]: *** [../../obj/XrdCmsNode.o] Error 1 >>>>>>>>>>> make[3]: *** [Linuxall] Error 2 >>>>>>>>>>> make[2]: *** [all] Error 2 >>>>>>>>>>> make[1]: *** [XrdCms] Error 2 >>>>>>>>>>> make: *** [all] Error 2 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Wen >>>>>>>>>>> >>>>>>>>>>> On Tue, Dec 15, 2009 at 2:08 AM, Andrew Hanushevsky >>>>>>>>>>> <[log in to unmask]> >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Wen, >>>>>>>>>>>> >>>>>>>>>>>> I have developed a permanent fix. You will find the source files >>>>>>>>>>>> in >>>>>>>>>>>> >>>>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/ >>>>>>>>>>>> >>>>>>>>>>>> There are three files: XrdCmsCluster.cc XrdCmsNode.cc >>>>>>>>>>>> XrdCmsProtocol.cc >>>>>>>>>>>> >>>>>>>>>>>> Please do a source replacement and recompile. Unfortunately, the >>>>>>>>>>>> cmsd >>>>>>>>>>>> will >>>>>>>>>>>> need to be replaced on each node regardless of role. My >>>>>>>>>>>> apologies >>>>>>>>>>>> for >>>>>>>>>>>> the >>>>>>>>>>>> disruption. Please let me know how it goes. >>>>>>>>>>>> >>>>>>>>>>>> Andy >>>>>>>>>>>> >>>>>>>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>>>>>>> <[log in to unmask]> >>>>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>>>>>>> Cc: <[log in to unmask]> >>>>>>>>>>>> Sent: Sunday, December 13, 2009 7:04 AM >>>>>>>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Thanks. >>>>>>>>>>>> I used the new cmsd at atlas-bkp1 manager. But it's still >>>>>>>>>>>> dropping >>>>>>>>>>>> nodes. And in supervisor's log, I cannot find any dataserver to >>>>>>>>>>>> register to it. >>>>>>>>>>>> >>>>>>>>>>>> The new logs are in http://higgs03.cs.wisc.edu/wguan/*.20091213. >>>>>>>>>>>> The manager is patched at 091213 08:38:15. >>>>>>>>>>>> >>>>>>>>>>>> Wen >>>>>>>>>>>> >>>>>>>>>>>> On Sun, Dec 13, 2009 at 1:52 AM, Andrew Hanushevsky >>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Wen >>>>>>>>>>>>> >>>>>>>>>>>>> You will find the source replacement at: >>>>>>>>>>>>> >>>>>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/ >>>>>>>>>>>>> >>>>>>>>>>>>> It's XrdCmsCluster.cc and it replaces >>>>>>>>>>>>> xrootd/src/XrdCms/XrdCmsCluster.cc >>>>>>>>>>>>> >>>>>>>>>>>>> I'm stepping out for a couple of hours but will be back to see >>>>>>>>>>>>> how >>>>>>>>>>>>> things >>>>>>>>>>>>> went. Sorry for the issues :-( >>>>>>>>>>>>> >>>>>>>>>>>>> Andy >>>>>>>>>>>>> >>>>>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I prefer a source replacement. Then I can compile it. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I can do one of two things here: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 1) Supply a source replacement and then you would recompile, >>>>>>>>>>>>>>> or >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2) Give me the uname -a of where the cmsd will run and I'll >>>>>>>>>>>>>>> supply >>>>>>>>>>>>>>> a >>>>>>>>>>>>>>> binary >>>>>>>>>>>>>>> replacement for you. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Your choice. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Andrew >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The problem is found. Great. Thanks. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Where can I find the patched cmsd? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky >>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I found the problem. Looks like a regression from way back >>>>>>>>>>>>>>>>> when. >>>>>>>>>>>>>>>>> There >>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>> missing flag on the redirect. This will require a patched >>>>>>>>>>>>>>>>> cmsd >>>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>> need >>>>>>>>>>>>>>>>> only to replace the redirector's cmsd as this only affects >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> redirector. >>>>>>>>>>>>>>>>> How would you like to proceed? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> It doesn't work. atlas-bkp1 manager still dropping nodes >>>>>>>>>>>>>>>>>> again. >>>>>>>>>>>>>>>>>> In supervisor, I still haven't seen any dataserver >>>>>>>>>>>>>>>>>> registered. >>>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>> said >>>>>>>>>>>>>>>>>> "I updated the ntp" because you said "the log timestamp do >>>>>>>>>>>>>>>>>> not >>>>>>>>>>>>>>>>>> overlap". >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky >>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Do you mean that everything is now working? It could be >>>>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>> removed >>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>> xrd.timeout directive. That really could cause problems. >>>>>>>>>>>>>>>>>>> As >>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>> delays, >>>>>>>>>>>>>>>>>>> that is normal when the redirector thinks something is >>>>>>>>>>>>>>>>>>> going >>>>>>>>>>>>>>>>>>> wrong. >>>>>>>>>>>>>>>>>>> The >>>>>>>>>>>>>>>>>>> strategy is to delay clients until it can get back to a >>>>>>>>>>>>>>>>>>> stable >>>>>>>>>>>>>>>>>>> configuration. This usually prevents jobs from crashing >>>>>>>>>>>>>>>>>>> during >>>>>>>>>>>>>>>>>>> stressful >>>>>>>>>>>>>>>>>>> periods. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I restarted it to do supervisor test. Also because >>>>>>>>>>>>>>>>>>>> xrootd >>>>>>>>>>>>>>>>>>>> manager >>>>>>>>>>>>>>>>>>>> frequently doesn't response. (*) is the cms.log, the >>>>>>>>>>>>>>>>>>>> file >>>>>>>>>>>>>>>>>>>> select >>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>> delayed again and again. When do a restart, all things >>>>>>>>>>>>>>>>>>>> are >>>>>>>>>>>>>>>>>>>> fine. >>>>>>>>>>>>>>>>>>>> Now >>>>>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>>>> am trying to find a clue about it. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> (*) >>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] >>>>>>>>>>>>>>>>>>>> do_Select: >>>>>>>>>>>>>>>>>>>> wc >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 Select seeking >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 UnkFile rc=1 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] >>>>>>>>>>>>>>>>>>>> do_Select: >>>>>>>>>>>>>>>>>>>> delay 5 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 2+-1 >>>>>>>>>>>>>>>>>>>> post=0 >>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 Dispatch >>>>>>>>>>>>>>>>>>>> redirector.21313:14@atlas-bkp2 >>>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>> select dlen=166 >>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1 >>>>>>>>>>>>>>>>>>>> post=0 >>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdSched: running redirector inq=0 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> There is no core file. I copied a new copies of the logs >>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>> link >>>>>>>>>>>>>>>>>>>> below. >>>>>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky >>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I see in the server log that it is restarting often. >>>>>>>>>>>>>>>>>>>>> Could >>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>> take >>>>>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>>>> look >>>>>>>>>>>>>>>>>>>>> in the c193 to see if you have any core files? Also >>>>>>>>>>>>>>>>>>>>> please >>>>>>>>>>>>>>>>>>>>> make >>>>>>>>>>>>>>>>>>>>> sure >>>>>>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>>>>> core files are enabled as Linux defaults the size to 0. >>>>>>>>>>>>>>>>>>>>> The >>>>>>>>>>>>>>>>>>>>> first >>>>>>>>>>>>>>>>>>>>> step >>>>>>>>>>>>>>>>>>>>> here >>>>>>>>>>>>>>>>>>>>> is to find out why your servers are restarting. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> the logs can be found here. From the log you can see >>>>>>>>>>>>>>>>>>>>>> atlas-bkp1 >>>>>>>>>>>>>>>>>>>>>> manager are dropping nodes again and again which tries >>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>> it. >>>>>>>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky >>>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Hi Wen, Could you start everything up and provide me >>>>>>>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>>>>>> pointer >>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>> manager log file, supervisor log file, and one data >>>>>>>>>>>>>>>>>>>>>>> server >>>>>>>>>>>>>>>>>>>>>>> logfile >>>>>>>>>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>>>>> which cover the same time-frame (from start to some >>>>>>>>>>>>>>>>>>>>>>> point >>>>>>>>>>>>>>>>>>>>>>> where >>>>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>>>> think >>>>>>>>>>>>>>>>>>>>>>> things are working or not). That way I can see what >>>>>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>>>> happening. >>>>>>>>>>>>>>>>>>>>>>> At >>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>> moment I only see two "bad" things in the config >>>>>>>>>>>>>>>>>>>>>>> file: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a >>>>>>>>>>>>>>>>>>>>>>> manager >>>>>>>>>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>>>> claim, >>>>>>>>>>>>>>>>>>>>>>> via >>>>>>>>>>>>>>>>>>>>>>> the all.manager directive, that there are three (bkp2 >>>>>>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>>>> bkp3). >>>>>>>>>>>>>>>>>>>>>>> While >>>>>>>>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>>>>>>>> should work, the log file will be dense with error >>>>>>>>>>>>>>>>>>>>>>> messages. >>>>>>>>>>>>>>>>>>>>>>> Please >>>>>>>>>>>>>>>>>>>>>>> correct >>>>>>>>>>>>>>>>>>>>>>> this to be consistent and make it easier to see real >>>>>>>>>>>>>>>>>>>>>>> errors. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> This is not a problem for me. Because this config is >>>>>>>>>>>>>>>>>>>>>> used >>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>> dataserver. In manager, I updated the if >>>>>>>>>>>>>>>>>>>>>> atlas-bkp1.cs.wisc.edu >>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>> atlas-bkp2 or something. This is a history problem. at >>>>>>>>>>>>>>>>>>>>>> first >>>>>>>>>>>>>>>>>>>>>> only >>>>>>>>>>>>>>>>>>>>>> atlas-bkp1 is used. atlas-bkp2 and atlas-bkp3 are >>>>>>>>>>>>>>>>>>>>>> added >>>>>>>>>>>>>>>>>>>>>> later. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> 2) Please use cms.space not olb.space (for historical >>>>>>>>>>>>>>>>>>>>>>> reasons >>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>> latter >>>>>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>>>> still accepted and over-rides the former, but that >>>>>>>>>>>>>>>>>>>>>>> will >>>>>>>>>>>>>>>>>>>>>>> soon >>>>>>>>>>>>>>>>>>>>>>> end), >>>>>>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>>>> please use only one (the config file uses both >>>>>>>>>>>>>>>>>>>>>>> directives). >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> yes. I should remove this line. in fact cms.space is >>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>> cfg >>>>>>>>>>>>>>>>>>>>>> too. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> The xrootd has an internal mechanism to connect >>>>>>>>>>>>>>>>>>>>>>> servers >>>>>>>>>>>>>>>>>>>>>>> with >>>>>>>>>>>>>>>>>>>>>>> supervisors >>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>> allow for maximum reliability. You cannot change that >>>>>>>>>>>>>>>>>>>>>>> algorithm >>>>>>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>>>> there >>>>>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>>>> no need to do so. You should *never* tell anyone to >>>>>>>>>>>>>>>>>>>>>>> directly >>>>>>>>>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>>>>>> supervisor. If you do, you will likely get >>>>>>>>>>>>>>>>>>>>>>> unreachable >>>>>>>>>>>>>>>>>>>>>>> nodes. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> As for dropping data servers, it would appear to me, >>>>>>>>>>>>>>>>>>>>>>> given >>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>> flurry >>>>>>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>>>>> such activity, that something either crashed or was >>>>>>>>>>>>>>>>>>>>>>> restarted. >>>>>>>>>>>>>>>>>>>>>>> That's >>>>>>>>>>>>>>>>>>>>>>> why >>>>>>>>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>>>>>>>> would be good to see the complete log of each one of >>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>> entities. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> I read the document. and write a config >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg). >>>>>>>>>>>>>>>>>>>>>>>> I used my conf, I can see manager is dispatch >>>>>>>>>>>>>>>>>>>>>>>> message >>>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>>> supervisor. But I cannot see any dataserver tries to >>>>>>>>>>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>> supervisor. At the same time, in the manager's log, >>>>>>>>>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>>>>>>>> can >>>>>>>>>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>>>>>>>> some >>>>>>>>>>>>>>>>>>>>>>>> dataserver are Dropped. >>>>>>>>>>>>>>>>>>>>>>>> How does xrootd decide which dataserver will connect >>>>>>>>>>>>>>>>>>>>>>>> supervisor? >>>>>>>>>>>>>>>>>>>>>>>> should I specify some dataservers to connect the >>>>>>>>>>>>>>>>>>>>>>>> supervisor? >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> (*) supervisor log >>>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 Dispatch >>>>>>>>>>>>>>>>>>>>>>>> manager.0:20@atlas-bkp2 >>>>>>>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>>>>>> state >>>>>>>>>>>>>>>>>>>>>>>> dlen=42 >>>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 >>>>>>>>>>>>>>>>>>>>>>>> do_State: >>>>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 >>>>>>>>>>>>>>>>>>>>>>>> do_StateFWD: >>>>>>>>>>>>>>>>>>>>>>>> Path >>>>>>>>>>>>>>>>>>>>>>>> find >>>>>>>>>>>>>>>>>>>>>>>> failed for state >>>>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> (*)manager log >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>> TSpace=5587GB >>>>>>>>>>>>>>>>>>>>>>>> NumFS=1 >>>>>>>>>>>>>>>>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>> adding >>>>>>>>>>>>>>>>>>>>>>>> path: >>>>>>>>>>>>>>>>>>>>>>>> w >>>>>>>>>>>>>>>>>>>>>>>> /atlas >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 >>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>> do_Space: 5696231MB free; 0% util >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 logged in. >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection >>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>> [log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: running >>>>>>>>>>>>>>>>>>>>>>>> ?:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched protocol >>>>>>>>>>>>>>>>>>>>>>>> cmsd >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>> 79 >>>>>>>>>>>>>>>>>>>>>>>> attached >>>>>>>>>>>>>>>>>>>>>>>> to poller 2; num=22 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add >>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>> bumps >>>>>>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 #63 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Drop_Node: >>>>>>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 dropped. >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add Shoved >>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 to cluster; >>>>>>>>>>>>>>>>>>>>>>>> id=63.78; >>>>>>>>>>>>>>>>>>>>>>>> num=64; >>>>>>>>>>>>>>>>>>>>>>>> min=51 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>> TSpace=5587GB >>>>>>>>>>>>>>>>>>>>>>>> NumFS=1 >>>>>>>>>>>>>>>>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>> adding >>>>>>>>>>>>>>>>>>>>>>>> path: >>>>>>>>>>>>>>>>>>>>>>>> w >>>>>>>>>>>>>>>>>>>>>>>> /atlas >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 >>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>> do_Space: 5721854MB free; 0% util >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 logged in. >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve >>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop node >>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>> 60 >>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 node 63.78 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 >>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>> 79 detached from poller 2; num=21 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data >>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>> c177.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>> FD=16 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 node 0.3 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>> 16 detached from poller 2; num=20 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data >>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>> c179.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>> FD=21 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>> server.17065:[log in to unmask]:1094 node 1.4 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>> server.7978:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>> server.7978:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>> 21 >>>>>>>>>>>>>>>>>>>>>>>> detached from poller 1; num=21 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 State: Status changed to >>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Send status to >>>>>>>>>>>>>>>>>>>>>>>> redirector.15656:14@atlas-bkp2 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data >>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>> c182.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>> FD=19 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 node 7.10 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>> 19 detached from poller 2; num=19 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data >>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>> c178.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>> FD=15 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 node 9.12 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>> 15 detached from poller 1; num=20 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data >>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>> c181.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>> FD=17 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 node 5.8 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>> 17 detached from poller 0; num=21 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data >>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>> c183.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>> FD=22 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 node 8.11 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>> 22 detached from poller 2; num=18 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data >>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>> c184.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>> FD=20 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>> server.4131:[log in to unmask]:1094 node 3.6 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>> 20 detached from poller 0; num=20 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data >>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>> c185.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>> FD=23 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 node 6.9 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>> server.8524:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>> server.8524:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>> 23 >>>>>>>>>>>>>>>>>>>>>>>> detached from poller 0; num=19 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data >>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>> c180.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>> FD=18 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 node 4.7 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>> 18 detached from poller 1; num=19 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data >>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>> c186.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>> FD=24 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 node 2.5 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>> server.7849:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>> server.7849:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>> 24 >>>>>>>>>>>>>>>>>>>>>>>> detached from poller 1; num=18 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node >>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node >>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node >>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node >>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node >>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node >>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node >>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node >>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node >>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node >>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled. >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled. >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled. >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled. >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled. >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled. >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled. >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled. >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled. >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled. >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled. >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled. >>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky >>>>>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> To go past 64 data servers you will need to setup >>>>>>>>>>>>>>>>>>>>>>>>> one >>>>>>>>>>>>>>>>>>>>>>>>> or >>>>>>>>>>>>>>>>>>>>>>>>> more >>>>>>>>>>>>>>>>>>>>>>>>> supervisors. >>>>>>>>>>>>>>>>>>>>>>>>> This does not logically change the current >>>>>>>>>>>>>>>>>>>>>>>>> configuration >>>>>>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>>>>>> have. >>>>>>>>>>>>>>>>>>>>>>>>> You >>>>>>>>>>>>>>>>>>>>>>>>> only >>>>>>>>>>>>>>>>>>>>>>>>> need to configure one or more *new* servers (or at >>>>>>>>>>>>>>>>>>>>>>>>> least >>>>>>>>>>>>>>>>>>>>>>>>> xrootd >>>>>>>>>>>>>>>>>>>>>>>>> processes) >>>>>>>>>>>>>>>>>>>>>>>>> whose role is supervisor. We'd like them to run in >>>>>>>>>>>>>>>>>>>>>>>>> separate >>>>>>>>>>>>>>>>>>>>>>>>> machines >>>>>>>>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>>>>>>> reliability purposes, but they could run on the >>>>>>>>>>>>>>>>>>>>>>>>> manager >>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>>>>>>>>> long >>>>>>>>>>>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>>>>>> give each one a unique instance name (i.e., -n >>>>>>>>>>>>>>>>>>>>>>>>> option). >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> The front part of the cmsd reference explains how >>>>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>>>> do >>>>>>>>>>>>>>>>>>>>>>>>> this. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Is there any change to configure xrootd with more >>>>>>>>>>>>>>>>>>>>>>>>>> than >>>>>>>>>>>>>>>>>>>>>>>>>> 65 >>>>>>>>>>>>>>>>>>>>>>>>>> machines? I used the configure below but it >>>>>>>>>>>>>>>>>>>>>>>>>> doesn't >>>>>>>>>>>>>>>>>>>>>>>>>> work. >>>>>>>>>>>>>>>>>>>>>>>>>> Should >>>>>>>>>>>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>>>>>>>>>> configure some machines' manager to be >>>>>>>>>>>>>>>>>>>>>>>>>> supvervisor? >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >>>> >>>> >> > > >