Wen, I was wondering if you finally did succeed in getting >64 data server nodes working using a supervisor, etc. thanks, Rob On Dec 18, 2009, at 8:58 AM, wen guan wrote: > Hi Andy, > > I am sure I am using the right cmsd code. Today I compiled and > reinstall all cmsd and xrootd. But now it still doesn't work. I will > create an account for you then you can login to these machines to > check what happend. > > In fact today when I doing some restart, I saw some machines > registered itself to higgs07. But unfortunately when reinstalling, the > logs have been cleaned. > > I found the supervisor will come to "suspend" state after a while > it's started, will it cause the supervisor fails to get some > information. > > Wen > > > On Fri, Dec 18, 2009 at 3:05 AM, Andrew Hanushevsky > <[log in to unmask]> wrote: >> Hi Wen, >> >> Something is really going wrong with your data servers. For >> instance, c109 >> is quite happy from midnight to 7:23am. Then it dropped the >> connection. Then >> reconnected 7:24:03 and was again happy 12:37:20 but here it >> reported that's >> it's xrootd died but then the cmsd promptly killed its connection >> afterward. >> This appears as if someone restarted the xrootd followed by the >> cmsd on >> c109. This continued like this until 12:43:00 (i.e., connect, >> suspend, die, >> repeat). All your servers, in fact, started doing this at 12:36:41 to >> 12:42:51 causing a massive swap of servers. New servers were added >> and old >> ones reconnecting were redirected to the supervisor. However, it >> would >> appear that those machines could not connect there as they kept >> comming back >> to the atlas-bkp1. I can't tell you anything about what was >> happening on >> higgs07. As far as I can tell it was happily connected to the >> redirector >> cmsd. The reason is that y=there is no log for higgs07 on the web >> site for >> 12/17 starting at midnight. Perhaps you can put one there. >> >> So, >> >> 1) Are you *absolutely* sure that *all* your (data, etc) servers >> are running >> the corrected cmsd? >> 2) Please provide the higgs07 log for 12/17. >> >> 3) Please provide logs for a sampling of data servers say c0109, >> c094, >> higgs15, and higgs13 between 1/17 12:00:00 to 15:44. >> >> I have never seen a situation like yours so something is very wrong >> here. In >> the mean time I will add more debugging information to the >> redirector and >> supervisor and let you know when that is available. >> >> Andy >> >> >> ----- Original Message ----- From: "wen guan" >> <[log in to unmask]> >> To: "Fabrizio Furano" <[log in to unmask]> >> Cc: "Andrew Hanushevsky" <[log in to unmask]>; <[log in to unmask] >> > >> Sent: Thursday, December 17, 2009 3:12 PM >> Subject: Re: xrootd with more than 65 machines >> >> >> Hi Fabrizio, >> >> This is the xrdcp debug message. >> ClientHeader.header.dlen = 41 >> =================== END CLIENT HEADER DUMPING =================== >> >> 091217 16:47:54 15961 Xrd: WriteRaw: Writing 24 bytes to physical >> connection >> 091217 16:47:54 15961 Xrd: WriteRaw: Writing to substreamid 0 >> 091217 16:47:54 15961 Xrd: WriteRaw: Writing 41 bytes to physical >> connection >> 091217 16:47:54 15961 Xrd: WriteRaw: Writing to substreamid 0 >> 091217 16:47:54 15961 Xrd: ReadPartialAnswer: Reading a >> XrdClientMessage from the server [atlas-bkp1.cs.wisc.edu:1094]... >> 091217 16:47:54 15961 Xrd: XrdClientMessage::ReadRaw: sid: 1, >> IsAttn: >> 0, substreamid: 0 >> 091217 16:47:54 15961 Xrd: XrdClientMessage::ReadRaw: Reading data (4 >> bytes) from substream 0 >> 091217 16:47:54 15961 Xrd: ReadRaw: Reading from atlas- >> bkp1.cs.wisc.edu:1094 >> 091217 16:47:54 15961 Xrd: BuildMessage: posting id 1 >> 091217 16:47:54 15961 Xrd: XrdClientMessage::ReadRaw: Reading >> header (8 >> bytes). >> 091217 16:47:54 15961 Xrd: ReadRaw: Reading from atlas- >> bkp1.cs.wisc.edu:1094 >> >> >> ======== DUMPING SERVER RESPONSE HEADER ======== >> ServerHeader.streamid = 0x01 0x00 >> ServerHeader.status = kXR_wait (4005) >> ServerHeader.dlen = 4 >> ========== END DUMPING SERVER HEADER =========== >> >> 091217 16:47:54 15961 Xrd: ReadPartialAnswer: Server >> [atlas-bkp1.cs.wisc.edu:1094] answered [kXR_wait] (4005) >> 091217 16:47:54 15961 Xrd: CheckErrorStatus: Server >> [atlas-bkp1.cs.wisc.edu:1094] requested 10 seconds of wait >> 091217 16:48:04 15961 Xrd: DumpPhyConn: Phyconn entry, >> [log in to unmask]:1094', LogCnt=1 Valid >> 091217 16:48:04 15961 Xrd: SendGenCommand: Sending command Open >> >> >> ================= DUMPING CLIENT REQUEST HEADER ================= >> ClientHeader.streamid = 0x01 0x00 >> ClientHeader.requestid = kXR_open (3010) >> ClientHeader.open.mode = 0x00 0x00 >> ClientHeader.open.options = 0x40 0x04 >> ClientHeader.open.reserved = 0 repeated 12 times >> ClientHeader.header.dlen = 41 >> =================== END CLIENT HEADER DUMPING =================== >> >> 091217 16:48:04 15961 Xrd: WriteRaw: Writing 24 bytes to physical >> connection >> 091217 16:48:04 15961 Xrd: WriteRaw: Writing to substreamid 0 >> 091217 16:48:04 15961 Xrd: WriteRaw: Writing 41 bytes to physical >> connection >> 091217 16:48:04 15961 Xrd: WriteRaw: Writing to substreamid 0 >> 091217 16:48:04 15961 Xrd: ReadPartialAnswer: Reading a >> XrdClientMessage from the server [atlas-bkp1.cs.wisc.edu:1094]... >> 091217 16:48:04 15961 Xrd: XrdClientMessage::ReadRaw: sid: 1, >> IsAttn: >> 0, substreamid: 0 >> 091217 16:48:04 15961 Xrd: XrdClientMessage::ReadRaw: Reading data (4 >> bytes) from substream 0 >> 091217 16:48:04 15961 Xrd: ReadRaw: Reading from atlas- >> bkp1.cs.wisc.edu:1094 >> 091217 16:48:04 15961 Xrd: BuildMessage: posting id 1 >> 091217 16:48:04 15961 Xrd: XrdClientMessage::ReadRaw: Reading >> header (8 >> bytes). >> 091217 16:48:04 15961 Xrd: ReadRaw: Reading from atlas- >> bkp1.cs.wisc.edu:1094 >> >> >> ======== DUMPING SERVER RESPONSE HEADER ======== >> ServerHeader.streamid = 0x01 0x00 >> ServerHeader.status = kXR_wait (4005) >> ServerHeader.dlen = 4 >> ========== END DUMPING SERVER HEADER =========== >> >> 091217 16:48:04 15961 Xrd: ReadPartialAnswer: Server >> [atlas-bkp1.cs.wisc.edu:1094] answered [kXR_wait] (4005) >> 091217 16:48:04 15961 Xrd: CheckErrorStatus: Server >> [atlas-bkp1.cs.wisc.edu:1094] requested 10 seconds of wait >> 091217 16:48:14 15961 Xrd: SendGenCommand: Sending command Open >> >> >> ================= DUMPING CLIENT REQUEST HEADER ================= >> ClientHeader.streamid = 0x01 0x00 >> ClientHeader.requestid = kXR_open (3010) >> ClientHeader.open.mode = 0x00 0x00 >> ClientHeader.open.options = 0x40 0x04 >> ClientHeader.open.reserved = 0 repeated 12 times >> ClientHeader.header.dlen = 41 >> =================== END CLIENT HEADER DUMPING =================== >> >> 091217 16:48:14 15961 Xrd: WriteRaw: Writing 24 bytes to physical >> connection >> 091217 16:48:14 15961 Xrd: WriteRaw: Writing to substreamid 0 >> 091217 16:48:14 15961 Xrd: WriteRaw: Writing 41 bytes to physical >> connection >> 091217 16:48:14 15961 Xrd: WriteRaw: Writing to substreamid 0 >> 091217 16:48:14 15961 Xrd: ReadPartialAnswer: Reading a >> XrdClientMessage from the server [atlas-bkp1.cs.wisc.edu:1094]... >> 091217 16:48:14 15961 Xrd: XrdClientMessage::ReadRaw: sid: 1, >> IsAttn: >> 0, substreamid: 0 >> 091217 16:48:14 15961 Xrd: XrdClientMessage::ReadRaw: Reading data (4 >> bytes) from substream 0 >> 091217 16:48:14 15961 Xrd: ReadRaw: Reading from atlas- >> bkp1.cs.wisc.edu:1094 >> 091217 16:48:14 15961 Xrd: BuildMessage: posting id 1 >> 091217 16:48:14 15961 Xrd: XrdClientMessage::ReadRaw: Reading >> header (8 >> bytes). >> 091217 16:48:14 15961 Xrd: ReadRaw: Reading from atlas- >> bkp1.cs.wisc.edu:1094 >> >> >> ======== DUMPING SERVER RESPONSE HEADER ======== >> ServerHeader.streamid = 0x01 0x00 >> ServerHeader.status = kXR_wait (4005) >> ServerHeader.dlen = 4 >> ========== END DUMPING SERVER HEADER =========== >> >> 091217 16:48:14 15961 Xrd: ReadPartialAnswer: Server >> [atlas-bkp1.cs.wisc.edu:1094] answered [kXR_wait] (4005) >> 091217 16:48:14 15961 Xrd: SendGenCommand: Max time limit elapsed for >> request kXR_open. Aborting command. >> Last server error 10000 ('') >> Error accessing path/file for >> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ >> test123131 >> >> >> Wen >> >> On Thu, Dec 17, 2009 at 11:27 PM, Fabrizio Furano <[log in to unmask]> >> wrote: >>> >>> Hi Wen, >>> >>> I see that you are getting error 10000, which means "generic error >>> before >>> any interaction". Could you please run the same command with debug >>> level 3 >>> and post the log with the same kind of issue? Something like >>> >>> xrdcp -d 3 .... >>> >>> Most likely this time the problem is different. I may be wrong >>> here, but a >>> possible reason for that error is that the servers require >>> authentication >>> and xrdcp does not find some library in the LD_LIBRARY_PATH. >>> >>> Fabrizio >>> >>> >>> wen guan ha scritto: >>>> >>>> Hi Andy, >>>> >>>> I put new logs in web. >>>> >>>> It still doesn't work. I cannot copy files in and out. >>>> >>>> It seems xrootd daemon at atlas-bkp1 hasn't talked with cmsd. >>>> Normally if xrootd daemont tries to copy a file, in the cms.log I >>>> should see "do_Select: filename". But in this cms.log, there is >>>> nothing from atlas-bkp1. >>>> >>>> (*) >>>> [root@atlas-bkp1 ~]# xrdcp >>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ >>>> test123131 >>>> /tmp/ >>>> Last server error 10000 ('') >>>> Error accessing path/file for >>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ >>>> test123131 >>>> [root@atlas-bkp1 ~]# xrdcp /bin/mv >>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ >>>> test123 >>>> 133 >>>> >>>> >>>> Wen >>>> >>>> On Thu, Dec 17, 2009 at 10:54 PM, Andrew Hanushevsky <[log in to unmask] >>>> > >>>> wrote: >>>>> >>>>> Hi Wen, >>>>> >>>>> I reviewed the log file. Other than the odd redirect of c131 at >>>>> 17:47:25 >>>>> which I can't comment on because its logs on the web site do not >>>>> overlap >>>>> with the manager or supervisor. Unless all the logs include the >>>>> full >>>>> time >>>>> in >>>>> question I can't say much of anything. Can you provide me with >>>>> inclusive >>>>> logs? >>>>> >>>>> atlas-bkp1 cms: 17:20:57 to 17:42:19 xrd: 17:20:57 to 17:40:57 >>>>> higgs07 cms & xrd 17:22:33 to 17:42:33 >>>>> c131 cms & xrd 17:31:57 to 17:47:28 >>>>> >>>>> That said, it certainly looks like things were working and files >>>>> were >>>>> being >>>>> accessed and discovered on all the machines. You even werw able >>>>> to open >>>>> /atlas/xrootd/users/wguan/test/test98123313 >>>>> through not >>>>> /atlas/xrootd/users/wguan/test/test123131The other issue is that >>>>> you did >>>>> not >>>>> specify a stable adminpath and the adminpath defaults to /tmp. >>>>> If you >>>>> have a >>>>> "cleanup" script that runs periodically for /tmp then eventually >>>>> your >>>>> cluster will go catonic as important (but not often used) files >>>>> are >>>>> deleted >>>>> by that script. Could you please find a stable home for the >>>>> adminpath? >>>>> >>>>> I reran my tests here and things worked as expected. I will ramp >>>>> up some >>>>> more tests. So, what is your status today? >>>>> >>>>> Andy >>>>> >>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask] >>>>> > >>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>> Cc: <[log in to unmask]> >>>>> Sent: Thursday, December 17, 2009 5:05 AM >>>>> Subject: Re: xrootd with more than 65 machines >>>>> >>>>> >>>>> Hi Andy, >>>>> >>>>> Yes. I am using the file download from >>>>> http://www.slac.stanford.edu/~abh/cmsd/ which compiled >>>>> yesterday. I >>>>> just now compiled it again and compare it with one I compiled >>>>> yesterday. they are the same(same md5sum). >>>>> >>>>> Wen >>>>> >>>>> On Thu, Dec 17, 2009 at 2:09 AM, Andrew Hanushevsky <[log in to unmask] >>>>> > >>>>> wrote: >>>>>> >>>>>> Hi Wen, >>>>>> >>>>>> If c131 cannot connect then either c131 does not have the new >>>>>> cms or >>>>>> atlas-bkp1 does not have the new cms as that would be what >>>>>> would happen >>>>>> if >>>>>> either were true. Looking at the log on c131 it would appear that >>>>>> atlas-bkp1 >>>>>> is still using the old cmsd as the response data length is >>>>>> wrong. Could >>>>>> you >>>>>> verify please. >>>>>> >>>>>> Andy >>>>>> >>>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask] >>>>>> > >>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>> Cc: <[log in to unmask]> >>>>>> Sent: Wednesday, December 16, 2009 3:58 PM >>>>>> Subject: Re: xrootd with more than 65 machines >>>>>> >>>>>> >>>>>> Hi Andy, >>>>>> >>>>>> I tried it. But there are still some problem. I put the logs in >>>>>> higgs03.cs.wisc.edu/wguan/ >>>>>> >>>>>> In my test, c131 is the 65 nodes to be added the the manager. >>>>>> and I can copy the file to the pool through manager. But I cannot >>>>>> copy a file out which is in c131. >>>>>> >>>>>> In c131's cms.log, I see "Manager: >>>>>> manager.0:[log in to unmask] removed; redirected" again >>>>>> and >>>>>> again. and I cannot see any thing about c131 in higgs07's >>>>>> log(supervisor). Does it mean manager tries to redirect it to >>>>>> higgs07, >>>>>> but c131 hasn't try to connect higgs07. It only tries to connect >>>>>> manager again. >>>>>> >>>>>> (*) >>>>>> [root@c131 ~]# xrdcp /bin/mv >>>>>> root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331 >>>>>> Last server error 10000 ('') >>>>>> Error accessing path/file for >>>>>> root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331 >>>>>> [root@c131 ~]# xrdcp /bin/mv >>>>>> >>>>>> >>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ >>>>>> test98123311 >>>>>> [xrootd] Total 0.06 MB |====================| 100.00 % [3.1 MB/s] >>>>>> [root@c131 ~]# xrdcp /bin/mv >>>>>> >>>>>> >>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ >>>>>> test98123312 >>>>>> [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s] >>>>>> [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/ >>>>>> test123131 >>>>>> [root@c131 ~]# xrdcp >>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ >>>>>> test123131 >>>>>> /tmp/ >>>>>> Last server error 3011 ('No servers are available to read the >>>>>> file.') >>>>>> Error accessing path/file for >>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ >>>>>> test123131 >>>>>> [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/test123131 >>>>>> /atlas/xrootd/users/wguan/test/test123131 >>>>>> [root@c131 ~]# xrdcp >>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ >>>>>> test123131 >>>>>> /tmp/ >>>>>> Last server error 3011 ('No servers are available to read the >>>>>> file.') >>>>>> Error accessing path/file for >>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ >>>>>> test123131 >>>>>> [root@c131 ~]# xrdcp /bin/mv >>>>>> >>>>>> >>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ >>>>>> test98123313 >>>>>> [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s] >>>>>> [root@c131 ~]# xrdcp >>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ >>>>>> test123131 >>>>>> /tmp/ >>>>>> Last server error 3011 ('No servers are available to read the >>>>>> file.') >>>>>> Error accessing path/file for >>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ >>>>>> test123131 >>>>>> [root@c131 ~]# xrdcp >>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ >>>>>> test123131 >>>>>> /tmp/ >>>>>> Last server error 3011 ('No servers are available to read the >>>>>> file.') >>>>>> Error accessing path/file for >>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ >>>>>> test123131 >>>>>> [root@c131 ~]# xrdcp >>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ >>>>>> test123131 >>>>>> /tmp/ >>>>>> Last server error 3011 ('No servers are available to read the >>>>>> file.') >>>>>> Error accessing path/file for >>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ >>>>>> test123131 >>>>>> [root@c131 ~]# tail -f /var/log/xrootd/cms.log >>>>>> 091216 17:45:52 3103 manager.0:[log in to unmask] XrdLink: >>>>>> Setting ref to 2+-1 post=0 >>>>>> 091216 17:45:55 3103 Pander trying to connect to lvl 0 >>>>>> atlas-bkp1.cs.wisc.edu:3121 >>>>>> 091216 17:45:55 3103 XrdInet: Connected to atlas- >>>>>> bkp1.cs.wisc.edu:3121 >>>>>> 091216 17:45:55 3103 Add atlas-bkp1.cs.wisc.edu to manager >>>>>> config; id=0 >>>>>> 091216 17:45:55 3103 ManTree: Now connected to 3 root node(s) >>>>>> 091216 17:45:55 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu >>>>>> 091216 17:45:55 3103 Dispatch manager.0:17@atlas- >>>>>> bkp1.cs.wisc.edu for >>>>>> try >>>>>> dlen=3 >>>>>> 091216 17:45:55 3103 manager.0:[log in to unmask] do_Try: >>>>>> 091216 17:45:55 3103 Remove completed atlas-bkp1.cs.wisc.edu >>>>>> manager >>>>>> 0.95 >>>>>> 091216 17:45:55 3103 Manager: manager.0:[log in to unmask] >>>>>> removed; redirected >>>>>> 091216 17:46:04 3103 Pander trying to connect to lvl 0 >>>>>> atlas-bkp1.cs.wisc.edu:3121 >>>>>> 091216 17:46:04 3103 XrdInet: Connected to atlas- >>>>>> bkp1.cs.wisc.edu:3121 >>>>>> 091216 17:46:04 3103 Add atlas-bkp1.cs.wisc.edu to manager >>>>>> config; id=0 >>>>>> 091216 17:46:04 3103 ManTree: Now connected to 3 root node(s) >>>>>> 091216 17:46:04 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu >>>>>> 091216 17:46:04 3103 Dispatch manager.0:17@atlas- >>>>>> bkp1.cs.wisc.edu for >>>>>> try >>>>>> dlen=3 >>>>>> 091216 17:46:04 3103 Protocol: No buffers to serve >>>>>> atlas-bkp1.cs.wisc.edu >>>>>> 091216 17:46:04 3103 Remove completed atlas-bkp1.cs.wisc.edu >>>>>> manager >>>>>> 0.96 >>>>>> 091216 17:46:04 3103 Manager: manager.0:[log in to unmask] >>>>>> removed; insufficient buffers >>>>>> 091216 17:46:11 3103 Dispatch manager.0:19@atlas- >>>>>> bkp2.cs.wisc.edu for >>>>>> state dlen=169 >>>>>> 091216 17:46:11 3103 manager.0:[log in to unmask] XrdLink: >>>>>> Setting ref to 1+1 post=0 >>>>>> >>>>>> Thanks >>>>>> Wen >>>>>> >>>>>> On Thu, Dec 17, 2009 at 12:10 AM, wen guan <[log in to unmask] >>>>>> > >>>>>> wrote: >>>>>>> >>>>>>> Hi Andy, >>>>>>> >>>>>>>> OK, I understand. As for stalling, too many nodes were deemed >>>>>>>> to be >>>>>>>> in >>>>>>>> trouble for the manager to allow service resumption. >>>>>>>> >>>>>>>> Please make sure that all of the nodes in the cluster receive >>>>>>>> the new >>>>>>>> cmsd >>>>>>>> as they will drop off with the old one and you'll see the >>>>>>>> same kind >>>>>>>> of >>>>>>>> activity. Perhaps the best way to know that you suceeded in >>>>>>>> putting >>>>>>>> everything in sync is to start with 63 data nodes plus one >>>>>>>> supervisor. >>>>>>>> Once >>>>>>>> all connections are established; adding an additional server >>>>>>>> should >>>>>>>> simply >>>>>>>> send it to the supervisor. >>>>>>> >>>>>>> I will do it. >>>>>>> you said start 63 data server and one supervisor. Does it mean >>>>>>> the >>>>>>> supervisor is managed using the same policy? If I there are 64 >>>>>>> dataservers which are connected before the supervisor, will the >>>>>>> supervisor be dropped? Is the supervisor has high priority to be >>>>>>> added to the manager? I mean, if there are already 64 >>>>>>> dataservers and >>>>>>> a supervisor comes in, will the supervisor be accepted and a >>>>>>> datasever >>>>>>> be redirected to the supervisor? >>>>>>> >>>>>>> Thanks >>>>>>> Wen >>>>>>> >>>>>>>> Hi Andrew, >>>>>>>> >>>>>>>> But when I tried to xrdcp a file to it, it doesn't response. In >>>>>>>> atlas-bkp1-xrd.log.20091213, it always prints "stalling >>>>>>>> client for 10 >>>>>>>> sec". But in cms.log, I can find any message about the file. >>>>>>>> >>>>>>>>> I don't see why you say it doesn't work. With the debugging >>>>>>>>> level >>>>>>>>> set >>>>>>>>> so >>>>>>>>> high the noise may make it look like something is going >>>>>>>>> wrong but >>>>>>>>> that >>>>>>>>> isn't >>>>>>>>> necessarily the case. >>>>>>>>> >>>>>>>>> 1) The 'too many subscribers' is correct. The manager was >>>>>>>>> simply >>>>>>>>> redirecting >>>>>>>>> them because there were already 64 servers. However, in your >>>>>>>>> case >>>>>>>>> the >>>>>>>>> supervisor wasn't started until almost 30 minutes after >>>>>>>>> everyone >>>>>>>>> else >>>>>>>>> (i.e., >>>>>>>>> 10:42 AM). Why was that? I'm not suprised about the flurry of >>>>>>>>> messages >>>>>>>>> with >>>>>>>>> a critical component missing for 30 minutes. >>>>>>>> >>>>>>>> Because the manager is 64bit machine but supervisor is 32 bit >>>>>>>> machine. >>>>>>>> Then I have to recompile the it. At that time, I was >>>>>>>> interrupted by >>>>>>>> something else. >>>>>>>> >>>>>>>> >>>>>>>>> 2) Once the supervisor started, it started accepting the >>>>>>>>> redirected >>>>>>>>> servers. >>>>>>>>> >>>>>>>>> 3) Then 10 seconds (10:42:10) later the supervisor was >>>>>>>>> restarted. >>>>>>>>> So, >>>>>>>>> that >>>>>>>>> would cause a flurry of activity to occur as there is no >>>>>>>>> backup >>>>>>>>> supervisor >>>>>>>>> to take over. >>>>>>>>> >>>>>>>>> 4) This happened again at 10:42:34 AM then again at >>>>>>>>> 10:48:49. Is the >>>>>>>>> supervisor crashing? Is there a core file? >>>>>>>>> >>>>>>>>> 5) At 11:11 AM the manager restarted. Again, is there a core >>>>>>>>> file >>>>>>>>> here >>>>>>>>> or >>>>>>>>> was this a manual action? >>>>>>>>> >>>>>>>>> During the course of all of this. All nodes connected were >>>>>>>>> operating >>>>>>>>> propely >>>>>>>>> and files were being located. >>>>>>>>> >>>>>>>>> So, the two big questions are: >>>>>>>>> >>>>>>>>> a) Why was the supervisor not started until 30 minutes after >>>>>>>>> the >>>>>>>>> system >>>>>>>>> was >>>>>>>>> started? >>>>>>>>> >>>>>>>>> b) Is there an explanation of the restarts? If this was a >>>>>>>>> crash then >>>>>>>>> we >>>>>>>>> need >>>>>>>>> a core file to figure out what happened. >>>>>>>> >>>>>>>> It's not a crash. There are some reasons that I restarted some >>>>>>>> daemons. >>>>>>>> (1)I thought if a dataserver tried many times to connect to a >>>>>>>> redirector but failed, the dataserver would not try to >>>>>>>> connect a >>>>>>>> redirector again. The supervisor was missing for long time. >>>>>>>> So maybe >>>>>>>> some dataservers would not try to connect to atlas-bkp1 >>>>>>>> again. To >>>>>>>> reactive these dataservers, I restarted any servers. >>>>>>>> (2)When I tried to xrdcp, it was hanging for long time. I >>>>>>>> thought >>>>>>>> maybe manager was affected by some others things. then I >>>>>>>> restarte >>>>>>>> manager to see whether a restart can make this xrdcp work. >>>>>>>> >>>>>>>> >>>>>>>> Thanks >>>>>>>> Wen >>>>>>>> >>>>>>>>> Andy >>>>>>>>> >>>>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>>>> <[log in to unmask]> >>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>>>> Cc: <[log in to unmask]> >>>>>>>>> Sent: Wednesday, December 16, 2009 9:38 AM >>>>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi Andrew, >>>>>>>>> >>>>>>>>> It still doesn't work. >>>>>>>>> The log file is in higgs03.cs.wisc.edu/wguan/. The name is >>>>>>>>> *.20091216 >>>>>>>>> The manager complains there are too many subscribers and the >>>>>>>>> removes >>>>>>>>> nodes. >>>>>>>>> >>>>>>>>> (*) >>>>>>>>> Add server.10040:[log in to unmask] redirected; too many >>>>>>>>> subscribers. >>>>>>>>> >>>>>>>>> Wen >>>>>>>>> >>>>>>>>> On Wed, Dec 16, 2009 at 4:25 AM, Andrew Hanushevsky >>>>>>>>> <[log in to unmask]> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi Wen, >>>>>>>>>> >>>>>>>>>> It will be easier for me to retroft as the changes were >>>>>>>>>> pretty >>>>>>>>>> minor. >>>>>>>>>> Please >>>>>>>>>> lift the new XrdCmsNode.cc file from >>>>>>>>>> >>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd >>>>>>>>>> >>>>>>>>>> Andy >>>>>>>>>> >>>>>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>>>>> <[log in to unmask]> >>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>>>>> Cc: <[log in to unmask]> >>>>>>>>>> Sent: Tuesday, December 15, 2009 5:12 PM >>>>>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi Andy, >>>>>>>>>> >>>>>>>>>> I can switch to 20091104-1102. Then you don't need to patch >>>>>>>>>> another version. How can I download v20091104-1102? >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Wen >>>>>>>>>> >>>>>>>>>> On Wed, Dec 16, 2009 at 12:52 AM, Andrew Hanushevsky >>>>>>>>>> <[log in to unmask]> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Wen, >>>>>>>>>>> >>>>>>>>>>> Ah yes, I see that now. The file I gave you is based on >>>>>>>>>>> v20091104-1102. >>>>>>>>>>> Let >>>>>>>>>>> me see if I can retrofit the patch for you. >>>>>>>>>>> >>>>>>>>>>> Andy >>>>>>>>>>> >>>>>>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>>>>>> <[log in to unmask]> >>>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>>>>>> Cc: <[log in to unmask]> >>>>>>>>>>> Sent: Tuesday, December 15, 2009 1:04 PM >>>>>>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi Andy, >>>>>>>>>>> >>>>>>>>>>> Which xrootd version are you using? XrdCmsConfig.hh is >>>>>>>>>>> different. >>>>>>>>>>> XrdCmsConfig.hh is downloaded from >>>>>>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/. >>>>>>>>>>> >>>>>>>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsNode.cc >>>>>>>>>>> 6fb3ae40fe4e10bdd4d372818a341f2c src/XrdCms/XrdCmsNode.cc >>>>>>>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsConfig.hh >>>>>>>>>>> 7d57753847d9448186c718f98e963cbe src/XrdCms/XrdCmsConfig.hh >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> Wen >>>>>>>>>>> >>>>>>>>>>> On Tue, Dec 15, 2009 at 10:50 PM, Andrew Hanushevsky >>>>>>>>>>> <[log in to unmask]> >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Wen, >>>>>>>>>>>> >>>>>>>>>>>> Just compiled on Linux and it was clean. Something is >>>>>>>>>>>> really >>>>>>>>>>>> wrong >>>>>>>>>>>> with >>>>>>>>>>>> your >>>>>>>>>>>> source files, specifically XrdCmsConfig.cc >>>>>>>>>>>> >>>>>>>>>>>> The MD5 checksums on the relevant files are: >>>>>>>>>>>> >>>>>>>>>>>> MD5 (XrdCmsNode.cc) = 6fb3ae40fe4e10bdd4d372818a341f2c >>>>>>>>>>>> >>>>>>>>>>>> MD5 (XrdCmsConfig.hh) = 4a7d655582a7cd43b098947d0676924b >>>>>>>>>>>> >>>>>>>>>>>> Andy >>>>>>>>>>>> >>>>>>>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>>>>>>> <[log in to unmask]> >>>>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>>>>>>> Cc: <[log in to unmask]> >>>>>>>>>>>> Sent: Tuesday, December 15, 2009 4:24 AM >>>>>>>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Hi Andy, >>>>>>>>>>>> >>>>>>>>>>>> No problem. Thanks for the fix. But it cannot be >>>>>>>>>>>> compiled. The >>>>>>>>>>>> version I am using is >>>>>>>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/. >>>>>>>>>>>> >>>>>>>>>>>> Making cms component... >>>>>>>>>>>> Compiling XrdCmsNode.cc >>>>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>>>> XrdCmsNode::do_Chmod(XrdCmsRRData&)': >>>>>>>>>>>> XrdCmsNode.cc:268: error: `fsExec' was not declared in >>>>>>>>>>>> this scope >>>>>>>>>>>> XrdCmsNode.cc:268: warning: unused variable 'fsExec' >>>>>>>>>>>> XrdCmsNode.cc:269: error: 'class XrdCmsConfig' has no >>>>>>>>>>>> member >>>>>>>>>>>> named >>>>>>>>>>>> 'ossFS' >>>>>>>>>>>> XrdCmsNode.cc:273: error: `fsFail' was not declared in >>>>>>>>>>>> this scope >>>>>>>>>>>> XrdCmsNode.cc:273: warning: unused variable 'fsFail' >>>>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>>>> XrdCmsNode::do_Mkdir(XrdCmsRRData&)': >>>>>>>>>>>> XrdCmsNode.cc:600: error: `fsExec' was not declared in >>>>>>>>>>>> this scope >>>>>>>>>>>> XrdCmsNode.cc:600: warning: unused variable 'fsExec' >>>>>>>>>>>> XrdCmsNode.cc:601: error: 'class XrdCmsConfig' has no >>>>>>>>>>>> member >>>>>>>>>>>> named >>>>>>>>>>>> 'ossFS' >>>>>>>>>>>> XrdCmsNode.cc:605: error: `fsFail' was not declared in >>>>>>>>>>>> this scope >>>>>>>>>>>> XrdCmsNode.cc:605: warning: unused variable 'fsFail' >>>>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>>>> XrdCmsNode::do_Mkpath(XrdCmsRRData&)': >>>>>>>>>>>> XrdCmsNode.cc:640: error: `fsExec' was not declared in >>>>>>>>>>>> this scope >>>>>>>>>>>> XrdCmsNode.cc:640: warning: unused variable 'fsExec' >>>>>>>>>>>> XrdCmsNode.cc:641: error: 'class XrdCmsConfig' has no >>>>>>>>>>>> member >>>>>>>>>>>> named >>>>>>>>>>>> 'ossFS' >>>>>>>>>>>> XrdCmsNode.cc:645: error: `fsFail' was not declared in >>>>>>>>>>>> this scope >>>>>>>>>>>> XrdCmsNode.cc:645: warning: unused variable 'fsFail' >>>>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>>>> XrdCmsNode::do_Mv(XrdCmsRRData&)': >>>>>>>>>>>> XrdCmsNode.cc:704: error: `fsExec' was not declared in >>>>>>>>>>>> this scope >>>>>>>>>>>> XrdCmsNode.cc:704: warning: unused variable 'fsExec' >>>>>>>>>>>> XrdCmsNode.cc:705: error: 'class XrdCmsConfig' has no >>>>>>>>>>>> member >>>>>>>>>>>> named >>>>>>>>>>>> 'ossFS' >>>>>>>>>>>> XrdCmsNode.cc:709: error: `fsFail' was not declared in >>>>>>>>>>>> this scope >>>>>>>>>>>> XrdCmsNode.cc:709: warning: unused variable 'fsFail' >>>>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>>>> XrdCmsNode::do_Rm(XrdCmsRRData&)': >>>>>>>>>>>> XrdCmsNode.cc:831: error: `fsExec' was not declared in >>>>>>>>>>>> this scope >>>>>>>>>>>> XrdCmsNode.cc:831: warning: unused variable 'fsExec' >>>>>>>>>>>> XrdCmsNode.cc:832: error: 'class XrdCmsConfig' has no >>>>>>>>>>>> member >>>>>>>>>>>> named >>>>>>>>>>>> 'ossFS' >>>>>>>>>>>> XrdCmsNode.cc:836: error: `fsFail' was not declared in >>>>>>>>>>>> this scope >>>>>>>>>>>> XrdCmsNode.cc:836: warning: unused variable 'fsFail' >>>>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>>>> XrdCmsNode::do_Rmdir(XrdCmsRRData&)': >>>>>>>>>>>> XrdCmsNode.cc:873: error: `fsExec' was not declared in >>>>>>>>>>>> this scope >>>>>>>>>>>> XrdCmsNode.cc:873: warning: unused variable 'fsExec' >>>>>>>>>>>> XrdCmsNode.cc:874: error: 'class XrdCmsConfig' has no >>>>>>>>>>>> member >>>>>>>>>>>> named >>>>>>>>>>>> 'ossFS' >>>>>>>>>>>> XrdCmsNode.cc:878: error: `fsFail' was not declared in >>>>>>>>>>>> this scope >>>>>>>>>>>> XrdCmsNode.cc:878: warning: unused variable 'fsFail' >>>>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>>>> XrdCmsNode::do_Trunc(XrdCmsRRData&)': >>>>>>>>>>>> XrdCmsNode.cc:1377: error: `fsExec' was not declared in >>>>>>>>>>>> this >>>>>>>>>>>> scope >>>>>>>>>>>> XrdCmsNode.cc:1377: warning: unused variable 'fsExec' >>>>>>>>>>>> XrdCmsNode.cc:1378: error: 'class XrdCmsConfig' has no >>>>>>>>>>>> member >>>>>>>>>>>> named >>>>>>>>>>>> 'ossFS' >>>>>>>>>>>> XrdCmsNode.cc:1382: error: `fsFail' was not declared in >>>>>>>>>>>> this >>>>>>>>>>>> scope >>>>>>>>>>>> XrdCmsNode.cc:1382: warning: unused variable 'fsFail' >>>>>>>>>>>> XrdCmsNode.cc: At global scope: >>>>>>>>>>>> XrdCmsNode.cc:1524: error: no `int >>>>>>>>>>>> XrdCmsNode::fsExec(XrdOucProg*, >>>>>>>>>>>> char*, char*)' member function declared in class >>>>>>>>>>>> `XrdCmsNode' >>>>>>>>>>>> XrdCmsNode.cc: In member function `int >>>>>>>>>>>> XrdCmsNode::fsExec(XrdOucProg*, >>>>>>>>>>>> char*, char*)': >>>>>>>>>>>> XrdCmsNode.cc:1533: error: `fsL2PFail1' was not declared >>>>>>>>>>>> in this >>>>>>>>>>>> scope >>>>>>>>>>>> XrdCmsNode.cc:1533: warning: unused variable 'fsL2PFail1' >>>>>>>>>>>> XrdCmsNode.cc:1537: error: `fsL2PFail2' was not declared >>>>>>>>>>>> in this >>>>>>>>>>>> scope >>>>>>>>>>>> XrdCmsNode.cc:1537: warning: unused variable 'fsL2PFail2' >>>>>>>>>>>> XrdCmsNode.cc: At global scope: >>>>>>>>>>>> XrdCmsNode.cc:1553: error: no `const char* >>>>>>>>>>>> XrdCmsNode::fsFail(const >>>>>>>>>>>> char*, const char*, const char*, int)' member function >>>>>>>>>>>> declared >>>>>>>>>>>> in >>>>>>>>>>>> class `XrdCmsNode' >>>>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>>>> XrdCmsNode::fsFail(const char*, const char*, const char*, >>>>>>>>>>>> int)': >>>>>>>>>>>> XrdCmsNode.cc:1559: error: `fsL2PFail1' was not declared >>>>>>>>>>>> in this >>>>>>>>>>>> scope >>>>>>>>>>>> XrdCmsNode.cc:1559: warning: unused variable 'fsL2PFail1' >>>>>>>>>>>> XrdCmsNode.cc:1560: error: `fsL2PFail2' was not declared >>>>>>>>>>>> in this >>>>>>>>>>>> scope >>>>>>>>>>>> XrdCmsNode.cc:1560: warning: unused variable 'fsL2PFail2' >>>>>>>>>>>> XrdCmsNode.cc: In static member function `static int >>>>>>>>>>>> XrdCmsNode::isOnline(char*, int)': >>>>>>>>>>>> XrdCmsNode.cc:1608: error: 'class XrdCmsConfig' has no >>>>>>>>>>>> member >>>>>>>>>>>> named >>>>>>>>>>>> 'ossFS' >>>>>>>>>>>> make[4]: *** [../../obj/XrdCmsNode.o] Error 1 >>>>>>>>>>>> make[3]: *** [Linuxall] Error 2 >>>>>>>>>>>> make[2]: *** [all] Error 2 >>>>>>>>>>>> make[1]: *** [XrdCms] Error 2 >>>>>>>>>>>> make: *** [all] Error 2 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Wen >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Dec 15, 2009 at 2:08 AM, Andrew Hanushevsky >>>>>>>>>>>> <[log in to unmask]> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>> >>>>>>>>>>>>> I have developed a permanent fix. You will find the >>>>>>>>>>>>> source files >>>>>>>>>>>>> in >>>>>>>>>>>>> >>>>>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/ >>>>>>>>>>>>> >>>>>>>>>>>>> There are three files: XrdCmsCluster.cc XrdCmsNode.cc >>>>>>>>>>>>> XrdCmsProtocol.cc >>>>>>>>>>>>> >>>>>>>>>>>>> Please do a source replacement and recompile. >>>>>>>>>>>>> Unfortunately, the >>>>>>>>>>>>> cmsd >>>>>>>>>>>>> will >>>>>>>>>>>>> need to be replaced on each node regardless of role. My >>>>>>>>>>>>> apologies >>>>>>>>>>>>> for >>>>>>>>>>>>> the >>>>>>>>>>>>> disruption. Please let me know how it goes. >>>>>>>>>>>>> >>>>>>>>>>>>> Andy >>>>>>>>>>>>> >>>>>>>>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>>>>>>>> <[log in to unmask]> >>>>>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>>>>>>>> Cc: <[log in to unmask]> >>>>>>>>>>>>> Sent: Sunday, December 13, 2009 7:04 AM >>>>>>>>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks. >>>>>>>>>>>>> I used the new cmsd at atlas-bkp1 manager. But it's still >>>>>>>>>>>>> dropping >>>>>>>>>>>>> nodes. And in supervisor's log, I cannot find any >>>>>>>>>>>>> dataserver to >>>>>>>>>>>>> register to it. >>>>>>>>>>>>> >>>>>>>>>>>>> The new logs are in http://higgs03.cs.wisc.edu/wguan/*.20091213 >>>>>>>>>>>>> . >>>>>>>>>>>>> The manager is patched at 091213 08:38:15. >>>>>>>>>>>>> >>>>>>>>>>>>> Wen >>>>>>>>>>>>> >>>>>>>>>>>>> On Sun, Dec 13, 2009 at 1:52 AM, Andrew Hanushevsky >>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Wen >>>>>>>>>>>>>> >>>>>>>>>>>>>> You will find the source replacement at: >>>>>>>>>>>>>> >>>>>>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/ >>>>>>>>>>>>>> >>>>>>>>>>>>>> It's XrdCmsCluster.cc and it replaces >>>>>>>>>>>>>> xrootd/src/XrdCms/XrdCmsCluster.cc >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm stepping out for a couple of hours but will be back >>>>>>>>>>>>>> to see >>>>>>>>>>>>>> how >>>>>>>>>>>>>> things >>>>>>>>>>>>>> went. Sorry for the issues :-( >>>>>>>>>>>>>> >>>>>>>>>>>>>> Andy >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I prefer a source replacement. Then I can compile it. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I can do one of two things here: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 1) Supply a source replacement and then you would >>>>>>>>>>>>>>>> recompile, >>>>>>>>>>>>>>>> or >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2) Give me the uname -a of where the cmsd will run >>>>>>>>>>>>>>>> and I'll >>>>>>>>>>>>>>>> supply >>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>> binary >>>>>>>>>>>>>>>> replacement for you. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Your choice. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Andrew >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The problem is found. Great. Thanks. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Where can I find the patched cmsd? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky >>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I found the problem. Looks like a regression from >>>>>>>>>>>>>>>>>> way back >>>>>>>>>>>>>>>>>> when. >>>>>>>>>>>>>>>>>> There >>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>> missing flag on the redirect. This will require a >>>>>>>>>>>>>>>>>> patched >>>>>>>>>>>>>>>>>> cmsd >>>>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>> need >>>>>>>>>>>>>>>>>> only to replace the redirector's cmsd as this only >>>>>>>>>>>>>>>>>> affects >>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> redirector. >>>>>>>>>>>>>>>>>> How would you like to proceed? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> It doesn't work. atlas-bkp1 manager still dropping >>>>>>>>>>>>>>>>>>> nodes >>>>>>>>>>>>>>>>>>> again. >>>>>>>>>>>>>>>>>>> In supervisor, I still haven't seen any dataserver >>>>>>>>>>>>>>>>>>> registered. >>>>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>>> said >>>>>>>>>>>>>>>>>>> "I updated the ntp" because you said "the log >>>>>>>>>>>>>>>>>>> timestamp do >>>>>>>>>>>>>>>>>>> not >>>>>>>>>>>>>>>>>>> overlap". >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky >>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Do you mean that everything is now working? It >>>>>>>>>>>>>>>>>>>> could be >>>>>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>> removed >>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>> xrd.timeout directive. That really could cause >>>>>>>>>>>>>>>>>>>> problems. >>>>>>>>>>>>>>>>>>>> As >>>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>> delays, >>>>>>>>>>>>>>>>>>>> that is normal when the redirector thinks >>>>>>>>>>>>>>>>>>>> something is >>>>>>>>>>>>>>>>>>>> going >>>>>>>>>>>>>>>>>>>> wrong. >>>>>>>>>>>>>>>>>>>> The >>>>>>>>>>>>>>>>>>>> strategy is to delay clients until it can get >>>>>>>>>>>>>>>>>>>> back to a >>>>>>>>>>>>>>>>>>>> stable >>>>>>>>>>>>>>>>>>>> configuration. This usually prevents jobs from >>>>>>>>>>>>>>>>>>>> crashing >>>>>>>>>>>>>>>>>>>> during >>>>>>>>>>>>>>>>>>>> stressful >>>>>>>>>>>>>>>>>>>> periods. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I restarted it to do supervisor test. Also because >>>>>>>>>>>>>>>>>>>>> xrootd >>>>>>>>>>>>>>>>>>>>> manager >>>>>>>>>>>>>>>>>>>>> frequently doesn't response. (*) is the cms.log, >>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>> file >>>>>>>>>>>>>>>>>>>>> select >>>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>> delayed again and again. When do a restart, all >>>>>>>>>>>>>>>>>>>>> things >>>>>>>>>>>>>>>>>>>>> are >>>>>>>>>>>>>>>>>>>>> fine. >>>>>>>>>>>>>>>>>>>>> Now >>>>>>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>>>>> am trying to find a clue about it. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> (*) >>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] >>>>>>>>>>>>>>>>>>>>> do_Select: >>>>>>>>>>>>>>>>>>>>> wc >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/ >>>>>>>>>>>>>>>>>>>>> MC8.108004 >>>>>>>>>>>>>>>>>>>>> .PythiaPhotonJet4.7TeV >>>>>>>>>>>>>>>>>>>>> .e444_s479_r635_dmp81_tid001090/LOG/dig. >>>>>>>>>>>>>>>>>>>>> 001090._000066.log.2 >>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 Select seeking >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/ >>>>>>>>>>>>>>>>>>>>> MC8.108004 >>>>>>>>>>>>>>>>>>>>> .PythiaPhotonJet4.7TeV >>>>>>>>>>>>>>>>>>>>> .e444_s479_r635_dmp81_tid001090/LOG/dig. >>>>>>>>>>>>>>>>>>>>> 001090._000066.log.2 >>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 UnkFile rc=1 >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> path=/atlas/xrootd/users/fang/ >>>>>>>>>>>>>>>>>>>>> MC8.108004 >>>>>>>>>>>>>>>>>>>>> .PythiaPhotonJet4.7TeV >>>>>>>>>>>>>>>>>>>>> .e444_s479_r635_dmp81_tid001090/LOG/dig. >>>>>>>>>>>>>>>>>>>>> 001090._000066.log.2 >>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask] >>>>>>>>>>>>>>>>>>>>> do_Select: >>>>>>>>>>>>>>>>>>>>> delay 5 >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/ >>>>>>>>>>>>>>>>>>>>> MC8.108004 >>>>>>>>>>>>>>>>>>>>> .PythiaPhotonJet4.7TeV >>>>>>>>>>>>>>>>>>>>> .e444_s479_r635_dmp81_tid001090/LOG/dig. >>>>>>>>>>>>>>>>>>>>> 001090._000066.log.2 >>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref >>>>>>>>>>>>>>>>>>>>> to 2+-1 >>>>>>>>>>>>>>>>>>>>> post=0 >>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 Dispatch >>>>>>>>>>>>>>>>>>>>> redirector.21313:14@atlas-bkp2 >>>>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>>> select dlen=166 >>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref >>>>>>>>>>>>>>>>>>>>> to 1+1 >>>>>>>>>>>>>>>>>>>>> post=0 >>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdSched: running >>>>>>>>>>>>>>>>>>>>> redirector inq=0 >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> There is no core file. I copied a new copies of >>>>>>>>>>>>>>>>>>>>> the logs >>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>> link >>>>>>>>>>>>>>>>>>>>> below. >>>>>>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew >>>>>>>>>>>>>>>>>>>>> Hanushevsky >>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I see in the server log that it is restarting >>>>>>>>>>>>>>>>>>>>>> often. >>>>>>>>>>>>>>>>>>>>>> Could >>>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>>> take >>>>>>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>>>>> look >>>>>>>>>>>>>>>>>>>>>> in the c193 to see if you have any core files? >>>>>>>>>>>>>>>>>>>>>> Also >>>>>>>>>>>>>>>>>>>>>> please >>>>>>>>>>>>>>>>>>>>>> make >>>>>>>>>>>>>>>>>>>>>> sure >>>>>>>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>>>>>> core files are enabled as Linux defaults the >>>>>>>>>>>>>>>>>>>>>> size to 0. >>>>>>>>>>>>>>>>>>>>>> The >>>>>>>>>>>>>>>>>>>>>> first >>>>>>>>>>>>>>>>>>>>>> step >>>>>>>>>>>>>>>>>>>>>> here >>>>>>>>>>>>>>>>>>>>>> is to find out why your servers are restarting. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> the logs can be found here. From the log you >>>>>>>>>>>>>>>>>>>>>>> can see >>>>>>>>>>>>>>>>>>>>>>> atlas-bkp1 >>>>>>>>>>>>>>>>>>>>>>> manager are dropping nodes again and again >>>>>>>>>>>>>>>>>>>>>>> which tries >>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>> it. >>>>>>>>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew >>>>>>>>>>>>>>>>>>>>>>> Hanushevsky >>>>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Hi Wen, Could you start everything up and >>>>>>>>>>>>>>>>>>>>>>>> provide me >>>>>>>>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>>>>>>> pointer >>>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>> manager log file, supervisor log file, and >>>>>>>>>>>>>>>>>>>>>>>> one data >>>>>>>>>>>>>>>>>>>>>>>> server >>>>>>>>>>>>>>>>>>>>>>>> logfile >>>>>>>>>>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>>>>>> which cover the same time-frame (from start >>>>>>>>>>>>>>>>>>>>>>>> to some >>>>>>>>>>>>>>>>>>>>>>>> point >>>>>>>>>>>>>>>>>>>>>>>> where >>>>>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>>>>> think >>>>>>>>>>>>>>>>>>>>>>>> things are working or not). That way I can >>>>>>>>>>>>>>>>>>>>>>>> see what >>>>>>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>>>>> happening. >>>>>>>>>>>>>>>>>>>>>>>> At >>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>> moment I only see two "bad" things in the >>>>>>>>>>>>>>>>>>>>>>>> config >>>>>>>>>>>>>>>>>>>>>>>> file: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated >>>>>>>>>>>>>>>>>>>>>>>> as a >>>>>>>>>>>>>>>>>>>>>>>> manager >>>>>>>>>>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>>>>> claim, >>>>>>>>>>>>>>>>>>>>>>>> via >>>>>>>>>>>>>>>>>>>>>>>> the all.manager directive, that there are >>>>>>>>>>>>>>>>>>>>>>>> three (bkp2 >>>>>>>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>>>>> bkp3). >>>>>>>>>>>>>>>>>>>>>>>> While >>>>>>>>>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>>>>>>>>> should work, the log file will be dense with >>>>>>>>>>>>>>>>>>>>>>>> error >>>>>>>>>>>>>>>>>>>>>>>> messages. >>>>>>>>>>>>>>>>>>>>>>>> Please >>>>>>>>>>>>>>>>>>>>>>>> correct >>>>>>>>>>>>>>>>>>>>>>>> this to be consistent and make it easier to >>>>>>>>>>>>>>>>>>>>>>>> see real >>>>>>>>>>>>>>>>>>>>>>>> errors. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> This is not a problem for me. Because this >>>>>>>>>>>>>>>>>>>>>>> config is >>>>>>>>>>>>>>>>>>>>>>> used >>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>> dataserver. In manager, I updated the if >>>>>>>>>>>>>>>>>>>>>>> atlas-bkp1.cs.wisc.edu >>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>> atlas-bkp2 or something. This is a history >>>>>>>>>>>>>>>>>>>>>>> problem. at >>>>>>>>>>>>>>>>>>>>>>> first >>>>>>>>>>>>>>>>>>>>>>> only >>>>>>>>>>>>>>>>>>>>>>> atlas-bkp1 is used. atlas-bkp2 and atlas-bkp3 >>>>>>>>>>>>>>>>>>>>>>> are >>>>>>>>>>>>>>>>>>>>>>> added >>>>>>>>>>>>>>>>>>>>>>> later. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> 2) Please use cms.space not olb.space (for >>>>>>>>>>>>>>>>>>>>>>>> historical >>>>>>>>>>>>>>>>>>>>>>>> reasons >>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>> latter >>>>>>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>>>>> still accepted and over-rides the former, but >>>>>>>>>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>>>>>>>> will >>>>>>>>>>>>>>>>>>>>>>>> soon >>>>>>>>>>>>>>>>>>>>>>>> end), >>>>>>>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>>>>> please use only one (the config file uses both >>>>>>>>>>>>>>>>>>>>>>>> directives). >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> yes. I should remove this line. in fact >>>>>>>>>>>>>>>>>>>>>>> cms.space is >>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>> cfg >>>>>>>>>>>>>>>>>>>>>>> too. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> The xrootd has an internal mechanism to connect >>>>>>>>>>>>>>>>>>>>>>>> servers >>>>>>>>>>>>>>>>>>>>>>>> with >>>>>>>>>>>>>>>>>>>>>>>> supervisors >>>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>>> allow for maximum reliability. You cannot >>>>>>>>>>>>>>>>>>>>>>>> change that >>>>>>>>>>>>>>>>>>>>>>>> algorithm >>>>>>>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>>>>> there >>>>>>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>>>>> no need to do so. You should *never* tell >>>>>>>>>>>>>>>>>>>>>>>> anyone to >>>>>>>>>>>>>>>>>>>>>>>> directly >>>>>>>>>>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>>>>>>> supervisor. If you do, you will likely get >>>>>>>>>>>>>>>>>>>>>>>> unreachable >>>>>>>>>>>>>>>>>>>>>>>> nodes. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> As for dropping data servers, it would appear >>>>>>>>>>>>>>>>>>>>>>>> to me, >>>>>>>>>>>>>>>>>>>>>>>> given >>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>> flurry >>>>>>>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>>>>>> such activity, that something either crashed >>>>>>>>>>>>>>>>>>>>>>>> or was >>>>>>>>>>>>>>>>>>>>>>>> restarted. >>>>>>>>>>>>>>>>>>>>>>>> That's >>>>>>>>>>>>>>>>>>>>>>>> why >>>>>>>>>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>>>>>>>>> would be good to see the complete log of each >>>>>>>>>>>>>>>>>>>>>>>> one of >>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>> entities. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> I read the document. and write a config >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg >>>>>>>>>>>>>>>>>>>>>>>>> ). >>>>>>>>>>>>>>>>>>>>>>>>> I used my conf, I can see manager is dispatch >>>>>>>>>>>>>>>>>>>>>>>>> message >>>>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>>>> supervisor. But I cannot see any dataserver >>>>>>>>>>>>>>>>>>>>>>>>> tries to >>>>>>>>>>>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>>> supervisor. At the same time, in the >>>>>>>>>>>>>>>>>>>>>>>>> manager's log, >>>>>>>>>>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>>>>>>>>> can >>>>>>>>>>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>>>>>>>>> some >>>>>>>>>>>>>>>>>>>>>>>>> dataserver are Dropped. >>>>>>>>>>>>>>>>>>>>>>>>> How does xrootd decide which dataserver will >>>>>>>>>>>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>>>>>>>>>>> supervisor? >>>>>>>>>>>>>>>>>>>>>>>>> should I specify some dataservers to connect >>>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>>> supervisor? >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> (*) supervisor log >>>>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 Dispatch >>>>>>>>>>>>>>>>>>>>>>>>> manager.0:20@atlas-bkp2 >>>>>>>>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>>>>>>> state >>>>>>>>>>>>>>>>>>>>>>>>> dlen=42 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 >>>>>>>>>>>>>>>>>>>>>>>>> do_State: >>>>>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 >>>>>>>>>>>>>>>>>>>>>>>>> do_StateFWD: >>>>>>>>>>>>>>>>>>>>>>>>> Path >>>>>>>>>>>>>>>>>>>>>>>>> find >>>>>>>>>>>>>>>>>>>>>>>>> failed for state >>>>>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> (*)manager log >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>> TSpace=5587GB >>>>>>>>>>>>>>>>>>>>>>>>> NumFS=1 >>>>>>>>>>>>>>>>>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>> adding >>>>>>>>>>>>>>>>>>>>>>>>> path: >>>>>>>>>>>>>>>>>>>>>>>>> w >>>>>>>>>>>>>>>>>>>>>>>>> /atlas >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 >>>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>> do_Space: 5696231MB free; 0% util >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>> logged in. >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted >>>>>>>>>>>>>>>>>>>>>>>>> connection >>>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>>> [log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: running >>>>>>>>>>>>>>>>>>>>>>>>> ?:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched >>>>>>>>>>>>>>>>>>>>>>>>> protocol >>>>>>>>>>>>>>>>>>>>>>>>> cmsd >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>>> 79 >>>>>>>>>>>>>>>>>>>>>>>>> attached >>>>>>>>>>>>>>>>>>>>>>>>> to poller 2; num=22 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add >>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>> bumps >>>>>>>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 #63 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 >>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Drop_Node: >>>>>>>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>> dropped. >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add Shoved >>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 to >>>>>>>>>>>>>>>>>>>>>>>>> cluster; >>>>>>>>>>>>>>>>>>>>>>>>> id=63.78; >>>>>>>>>>>>>>>>>>>>>>>>> num=64; >>>>>>>>>>>>>>>>>>>>>>>>> min=51 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 >>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>> TSpace=5587GB >>>>>>>>>>>>>>>>>>>>>>>>> NumFS=1 >>>>>>>>>>>>>>>>>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>> adding >>>>>>>>>>>>>>>>>>>>>>>>> path: >>>>>>>>>>>>>>>>>>>>>>>>> w >>>>>>>>>>>>>>>>>>>>>>>>> /atlas >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 >>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>> do_Space: 5721854MB free; 0% util >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>> logged in. >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to >>>>>>>>>>>>>>>>>>>>>>>>> recieve >>>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 >>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling >>>>>>>>>>>>>>>>>>>>>>>>> drop node >>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>> 60 >>>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 node >>>>>>>>>>>>>>>>>>>>>>>>> 63.78 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 >>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>>> 79 detached from poller 2; num=21 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 >>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() >>>>>>>>>>>>>>>>>>>>>>>>> data >>>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>>> c177.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>> FD=16 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 >>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 node >>>>>>>>>>>>>>>>>>>>>>>>> 0.3 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>>> 16 detached from poller 2; num=20 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() >>>>>>>>>>>>>>>>>>>>>>>>> data >>>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>>> c179.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>> FD=21 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 >>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>>> server.17065:[log in to unmask]:1094 node >>>>>>>>>>>>>>>>>>>>>>>>> 1.4 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>>> server.7978:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>> server.7978:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>>> 21 >>>>>>>>>>>>>>>>>>>>>>>>> detached from poller 1; num=21 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 State: Status changed to >>>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Send status to >>>>>>>>>>>>>>>>>>>>>>>>> redirector.15656:14@atlas-bkp2 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 >>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() >>>>>>>>>>>>>>>>>>>>>>>>> data >>>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>>> c182.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>> FD=19 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 >>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 node >>>>>>>>>>>>>>>>>>>>>>>>> 7.10 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>>> 19 detached from poller 2; num=19 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 >>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() >>>>>>>>>>>>>>>>>>>>>>>>> data >>>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>>> c178.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>> FD=15 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 >>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 node >>>>>>>>>>>>>>>>>>>>>>>>> 9.12 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>>> 15 detached from poller 1; num=20 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 >>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() >>>>>>>>>>>>>>>>>>>>>>>>> data >>>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>>> c181.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>> FD=17 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 >>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 node >>>>>>>>>>>>>>>>>>>>>>>>> 5.8 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>>> 17 detached from poller 0; num=21 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 >>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() >>>>>>>>>>>>>>>>>>>>>>>>> data >>>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>>> c183.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>> FD=22 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 >>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 node >>>>>>>>>>>>>>>>>>>>>>>>> 8.11 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>>> 22 detached from poller 2; num=18 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() >>>>>>>>>>>>>>>>>>>>>>>>> data >>>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>>> c184.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>> FD=20 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 >>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>>> server.4131:[log in to unmask]:1094 node >>>>>>>>>>>>>>>>>>>>>>>>> 3.6 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>>> 20 detached from poller 0; num=20 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 >>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() >>>>>>>>>>>>>>>>>>>>>>>>> data >>>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>>> c185.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>> FD=23 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 >>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 node >>>>>>>>>>>>>>>>>>>>>>>>> 6.9 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>>> server.8524:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>> server.8524:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>>> 23 >>>>>>>>>>>>>>>>>>>>>>>>> detached from poller 0; num=19 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 >>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() >>>>>>>>>>>>>>>>>>>>>>>>> data >>>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>>> c180.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>> FD=18 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 >>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 node >>>>>>>>>>>>>>>>>>>>>>>>> 4.7 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>>> 18 detached from poller 1; num=19 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 >>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() >>>>>>>>>>>>>>>>>>>>>>>>> data >>>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>>> c186.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>> FD=24 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 >>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 node >>>>>>>>>>>>>>>>>>>>>>>>> 2.5 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>>> server.7849:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>> server.7849:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>>> 24 >>>>>>>>>>>>>>>>>>>>>>>>> detached from poller 1; num=18 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop >>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop >>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop >>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop >>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop >>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop >>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop >>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop >>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop >>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop >>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling >>>>>>>>>>>>>>>>>>>>>>>>> drop node >>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling >>>>>>>>>>>>>>>>>>>>>>>>> drop node >>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling >>>>>>>>>>>>>>>>>>>>>>>>> drop node >>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling >>>>>>>>>>>>>>>>>>>>>>>>> drop node >>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling >>>>>>>>>>>>>>>>>>>>>>>>> drop node >>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling >>>>>>>>>>>>>>>>>>>>>>>>> drop node >>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling >>>>>>>>>>>>>>>>>>>>>>>>> drop node >>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling >>>>>>>>>>>>>>>>>>>>>>>>> drop node >>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling >>>>>>>>>>>>>>>>>>>>>>>>> drop node >>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling >>>>>>>>>>>>>>>>>>>>>>>>> drop node >>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop >>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop >>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 >>>>>>>>>>>>>>>>>>>>>>>>> cancelled. >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop >>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 >>>>>>>>>>>>>>>>>>>>>>>>> cancelled. >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop >>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 >>>>>>>>>>>>>>>>>>>>>>>>> cancelled. >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 >>>>>>>>>>>>>>>>>>>>>>>>> cancelled. >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop >>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 >>>>>>>>>>>>>>>>>>>>>>>>> cancelled. >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop >>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 >>>>>>>>>>>>>>>>>>>>>>>>> cancelled. >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop >>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 >>>>>>>>>>>>>>>>>>>>>>>>> cancelled. >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop >>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 >>>>>>>>>>>>>>>>>>>>>>>>> cancelled. >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop >>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 >>>>>>>>>>>>>>>>>>>>>>>>> cancelled. >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop >>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 >>>>>>>>>>>>>>>>>>>>>>>>> cancelled. >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop >>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 >>>>>>>>>>>>>>>>>>>>>>>>> cancelled. >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 >>>>>>>>>>>>>>>>>>>>>>>>> workers >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop >>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 >>>>>>>>>>>>>>>>>>>>>>>>> cancelled. >>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop >>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew >>>>>>>>>>>>>>>>>>>>>>>>> Hanushevsky >>>>>>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> To go past 64 data servers you will need to >>>>>>>>>>>>>>>>>>>>>>>>>> setup >>>>>>>>>>>>>>>>>>>>>>>>>> one >>>>>>>>>>>>>>>>>>>>>>>>>> or >>>>>>>>>>>>>>>>>>>>>>>>>> more >>>>>>>>>>>>>>>>>>>>>>>>>> supervisors. >>>>>>>>>>>>>>>>>>>>>>>>>> This does not logically change the current >>>>>>>>>>>>>>>>>>>>>>>>>> configuration >>>>>>>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>>>>>>> have. >>>>>>>>>>>>>>>>>>>>>>>>>> You >>>>>>>>>>>>>>>>>>>>>>>>>> only >>>>>>>>>>>>>>>>>>>>>>>>>> need to configure one or more *new* servers >>>>>>>>>>>>>>>>>>>>>>>>>> (or at >>>>>>>>>>>>>>>>>>>>>>>>>> least >>>>>>>>>>>>>>>>>>>>>>>>>> xrootd >>>>>>>>>>>>>>>>>>>>>>>>>> processes) >>>>>>>>>>>>>>>>>>>>>>>>>> whose role is supervisor. We'd like them to >>>>>>>>>>>>>>>>>>>>>>>>>> run in >>>>>>>>>>>>>>>>>>>>>>>>>> separate >>>>>>>>>>>>>>>>>>>>>>>>>> machines >>>>>>>>>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>>>>>>>> reliability purposes, but they could run on >>>>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>>>> manager >>>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>>>>>>>>>> long >>>>>>>>>>>>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>>>>>>> give each one a unique instance name (i.e., >>>>>>>>>>>>>>>>>>>>>>>>>> -n >>>>>>>>>>>>>>>>>>>>>>>>>> option). >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> The front part of the cmsd reference >>>>>>>>>>>>>>>>>>>>>>>>>> explains how >>>>>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>>>>> do >>>>>>>>>>>>>>>>>>>>>>>>>> this. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Is there any change to configure xrootd >>>>>>>>>>>>>>>>>>>>>>>>>>> with more >>>>>>>>>>>>>>>>>>>>>>>>>>> than >>>>>>>>>>>>>>>>>>>>>>>>>>> 65 >>>>>>>>>>>>>>>>>>>>>>>>>>> machines? I used the configure below but it >>>>>>>>>>>>>>>>>>>>>>>>>>> doesn't >>>>>>>>>>>>>>>>>>>>>>>>>>> work. >>>>>>>>>>>>>>>>>>>>>>>>>>> Should >>>>>>>>>>>>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>>>>>>>>>>> configure some machines' manager to be >>>>>>>>>>>>>>>>>>>>>>>>>>> supvervisor? >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>> >> >> >>