Hi, I just wanted to say that this is not a recommended configuration. A meta-manager makes a manager nothing more than a specialized supervisor. Internally, it is exactly the same configuration as a manager and a supervisor except for the fact that a supervisor-less configuration will never be able to globally federate. I am suprised that there were problems with a supervisor. BNL runs 500 production nodes with supervisors and has no problems, at least none that they have reported. I strongly discourage using a meta manager to get beyond 64 servers. Andy On Sun, 27 Jun 2010, wen guan wrote: > Hi Rob, > > Our xrootd pool is more than 64 dataservers. But I choose to use > meta manager instead of supervisor. Because when using supervisor, > some data servers seems lost and not easy to control(fix) it. when > restarting a data server, it will cause some problem too. > Below is the redirector cfg. I choose 50 dataservers to connect > meta manager with port 3121, and the other 25 to connect manager with > port 4121. At the same time, the manager will connect to meta manager. > > cheers > Wen > > (*) > if named meta > all.role meta manager > xrd.port 1094 > > #xrootd.manager atlas-bkp2.cs.wisc.edu 4121 > all.manager meta atlas-bkp2.cs.wisc.edu 3121 > #all.manager atlas-bkp3.cs.wisc.edu 3121 > ofs.forward 3way atlas-bkp1.cs.wisc.edu:1095 mv rm rmdir trunc > else if atlas-bkp1.cs.wisc.edu atlas-bkp2.cs.wisc.edu atlas-bkp3.cs.wisc.edu > all.role manager > xrd.port 4094 > # > # 3way forward: redirect the client to the CNS, and forward mv > rm rmdir and > # trunc to the data servers. > # > #ofs.forward 3way atlas-bkp1.cs.wisc.edu:1095 mv rm rmdir trunc > all.manager atlas-bkp2.cs.wisc.edu 4121 > all.manager meta atlas-bkp2.cs.wisc.edu 3121 > fi > > On Sun, Jun 27, 2010 at 4:44 PM, Rob Gardner <[log in to unmask]> wrote: >> Wen, >> >> I was wondering if you finally did succeed in getting >64 data server >> nodes working using a supervisor, etc. >> >> thanks, >> >> Rob >> >> >> On Dec 18, 2009, at 8:58 AM, wen guan wrote: >> >>> Hi Andy, >>> >>> I am sure I am using the right cmsd code. Today I compiled and >>> reinstall all cmsd and xrootd. But now it still doesn't work. I will >>> create an account for you then you can login to these machines to >>> check what happend. >>> >>> In fact today when I doing some restart, I saw some machines >>> registered itself to higgs07. But unfortunately when reinstalling, the >>> logs have been cleaned. >>> >>> I found the supervisor will come to "suspend" state after a while >>> it's started, will it cause the supervisor fails to get some >>> information. >>> >>> Wen >>> >>> >>> On Fri, Dec 18, 2009 at 3:05 AM, Andrew Hanushevsky <[log in to unmask]> >>> wrote: >>>> >>>> Hi Wen, >>>> >>>> Something is really going wrong with your data servers. For instance, >>>> c109 >>>> is quite happy from midnight to 7:23am. Then it dropped the connection. >>>> Then >>>> reconnected 7:24:03 and was again happy 12:37:20 but here it reported >>>> that's >>>> it's xrootd died but then the cmsd promptly killed its connection >>>> afterward. >>>> This appears as if someone restarted the xrootd followed by the cmsd on >>>> c109. This continued like this until 12:43:00 (i.e., connect, suspend, >>>> die, >>>> repeat). All your servers, in fact, started doing this at 12:36:41 to >>>> 12:42:51 causing a massive swap of servers. New servers were added and >>>> old >>>> ones reconnecting were redirected to the supervisor. However, it would >>>> appear that those machines could not connect there as they kept comming >>>> back >>>> to the atlas-bkp1. I can't tell you anything about what was happening on >>>> higgs07. As far as I can tell it was happily connected to the redirector >>>> cmsd. The reason is that y=there is no log for higgs07 on the web site >>>> for >>>> 12/17 starting at midnight. Perhaps you can put one there. >>>> >>>> So, >>>> >>>> 1) Are you *absolutely* sure that *all* your (data, etc) servers are >>>> running >>>> the corrected cmsd? >>>> 2) Please provide the higgs07 log for 12/17. >>>> >>>> 3) Please provide logs for a sampling of data servers say c0109, c094, >>>> higgs15, and higgs13 between 1/17 12:00:00 to 15:44. >>>> >>>> I have never seen a situation like yours so something is very wrong here. >>>> In >>>> the mean time I will add more debugging information to the redirector and >>>> supervisor and let you know when that is available. >>>> >>>> Andy >>>> >>>> >>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]> >>>> To: "Fabrizio Furano" <[log in to unmask]> >>>> Cc: "Andrew Hanushevsky" <[log in to unmask]>; <[log in to unmask]> >>>> Sent: Thursday, December 17, 2009 3:12 PM >>>> Subject: Re: xrootd with more than 65 machines >>>> >>>> >>>> Hi Fabrizio, >>>> >>>> This is the xrdcp debug message. >>>> ClientHeader.header.dlen = 41 >>>> =================== END CLIENT HEADER DUMPING =================== >>>> >>>> 091217 16:47:54 15961 Xrd: WriteRaw: Writing 24 bytes to physical >>>> connection >>>> 091217 16:47:54 15961 Xrd: WriteRaw: Writing to substreamid 0 >>>> 091217 16:47:54 15961 Xrd: WriteRaw: Writing 41 bytes to physical >>>> connection >>>> 091217 16:47:54 15961 Xrd: WriteRaw: Writing to substreamid 0 >>>> 091217 16:47:54 15961 Xrd: ReadPartialAnswer: Reading a >>>> XrdClientMessage from the server [atlas-bkp1.cs.wisc.edu:1094]... >>>> 091217 16:47:54 15961 Xrd: XrdClientMessage::ReadRaw: sid: 1, IsAttn: >>>> 0, substreamid: 0 >>>> 091217 16:47:54 15961 Xrd: XrdClientMessage::ReadRaw: Reading data (4 >>>> bytes) from substream 0 >>>> 091217 16:47:54 15961 Xrd: ReadRaw: Reading from >>>> atlas-bkp1.cs.wisc.edu:1094 >>>> 091217 16:47:54 15961 Xrd: BuildMessage: posting id 1 >>>> 091217 16:47:54 15961 Xrd: XrdClientMessage::ReadRaw: Reading header (8 >>>> bytes). >>>> 091217 16:47:54 15961 Xrd: ReadRaw: Reading from >>>> atlas-bkp1.cs.wisc.edu:1094 >>>> >>>> >>>> ======== DUMPING SERVER RESPONSE HEADER ======== >>>> ServerHeader.streamid = 0x01 0x00 >>>> ServerHeader.status = kXR_wait (4005) >>>> ServerHeader.dlen = 4 >>>> ========== END DUMPING SERVER HEADER =========== >>>> >>>> 091217 16:47:54 15961 Xrd: ReadPartialAnswer: Server >>>> [atlas-bkp1.cs.wisc.edu:1094] answered [kXR_wait] (4005) >>>> 091217 16:47:54 15961 Xrd: CheckErrorStatus: Server >>>> [atlas-bkp1.cs.wisc.edu:1094] requested 10 seconds of wait >>>> 091217 16:48:04 15961 Xrd: DumpPhyConn: Phyconn entry, >>>> [log in to unmask]:1094', LogCnt=1 Valid >>>> 091217 16:48:04 15961 Xrd: SendGenCommand: Sending command Open >>>> >>>> >>>> ================= DUMPING CLIENT REQUEST HEADER ================= >>>> ClientHeader.streamid = 0x01 0x00 >>>> ClientHeader.requestid = kXR_open (3010) >>>> ClientHeader.open.mode = 0x00 0x00 >>>> ClientHeader.open.options = 0x40 0x04 >>>> ClientHeader.open.reserved = 0 repeated 12 times >>>> ClientHeader.header.dlen = 41 >>>> =================== END CLIENT HEADER DUMPING =================== >>>> >>>> 091217 16:48:04 15961 Xrd: WriteRaw: Writing 24 bytes to physical >>>> connection >>>> 091217 16:48:04 15961 Xrd: WriteRaw: Writing to substreamid 0 >>>> 091217 16:48:04 15961 Xrd: WriteRaw: Writing 41 bytes to physical >>>> connection >>>> 091217 16:48:04 15961 Xrd: WriteRaw: Writing to substreamid 0 >>>> 091217 16:48:04 15961 Xrd: ReadPartialAnswer: Reading a >>>> XrdClientMessage from the server [atlas-bkp1.cs.wisc.edu:1094]... >>>> 091217 16:48:04 15961 Xrd: XrdClientMessage::ReadRaw: sid: 1, IsAttn: >>>> 0, substreamid: 0 >>>> 091217 16:48:04 15961 Xrd: XrdClientMessage::ReadRaw: Reading data (4 >>>> bytes) from substream 0 >>>> 091217 16:48:04 15961 Xrd: ReadRaw: Reading from >>>> atlas-bkp1.cs.wisc.edu:1094 >>>> 091217 16:48:04 15961 Xrd: BuildMessage: posting id 1 >>>> 091217 16:48:04 15961 Xrd: XrdClientMessage::ReadRaw: Reading header (8 >>>> bytes). >>>> 091217 16:48:04 15961 Xrd: ReadRaw: Reading from >>>> atlas-bkp1.cs.wisc.edu:1094 >>>> >>>> >>>> ======== DUMPING SERVER RESPONSE HEADER ======== >>>> ServerHeader.streamid = 0x01 0x00 >>>> ServerHeader.status = kXR_wait (4005) >>>> ServerHeader.dlen = 4 >>>> ========== END DUMPING SERVER HEADER =========== >>>> >>>> 091217 16:48:04 15961 Xrd: ReadPartialAnswer: Server >>>> [atlas-bkp1.cs.wisc.edu:1094] answered [kXR_wait] (4005) >>>> 091217 16:48:04 15961 Xrd: CheckErrorStatus: Server >>>> [atlas-bkp1.cs.wisc.edu:1094] requested 10 seconds of wait >>>> 091217 16:48:14 15961 Xrd: SendGenCommand: Sending command Open >>>> >>>> >>>> ================= DUMPING CLIENT REQUEST HEADER ================= >>>> ClientHeader.streamid = 0x01 0x00 >>>> ClientHeader.requestid = kXR_open (3010) >>>> ClientHeader.open.mode = 0x00 0x00 >>>> ClientHeader.open.options = 0x40 0x04 >>>> ClientHeader.open.reserved = 0 repeated 12 times >>>> ClientHeader.header.dlen = 41 >>>> =================== END CLIENT HEADER DUMPING =================== >>>> >>>> 091217 16:48:14 15961 Xrd: WriteRaw: Writing 24 bytes to physical >>>> connection >>>> 091217 16:48:14 15961 Xrd: WriteRaw: Writing to substreamid 0 >>>> 091217 16:48:14 15961 Xrd: WriteRaw: Writing 41 bytes to physical >>>> connection >>>> 091217 16:48:14 15961 Xrd: WriteRaw: Writing to substreamid 0 >>>> 091217 16:48:14 15961 Xrd: ReadPartialAnswer: Reading a >>>> XrdClientMessage from the server [atlas-bkp1.cs.wisc.edu:1094]... >>>> 091217 16:48:14 15961 Xrd: XrdClientMessage::ReadRaw: sid: 1, IsAttn: >>>> 0, substreamid: 0 >>>> 091217 16:48:14 15961 Xrd: XrdClientMessage::ReadRaw: Reading data (4 >>>> bytes) from substream 0 >>>> 091217 16:48:14 15961 Xrd: ReadRaw: Reading from >>>> atlas-bkp1.cs.wisc.edu:1094 >>>> 091217 16:48:14 15961 Xrd: BuildMessage: posting id 1 >>>> 091217 16:48:14 15961 Xrd: XrdClientMessage::ReadRaw: Reading header (8 >>>> bytes). >>>> 091217 16:48:14 15961 Xrd: ReadRaw: Reading from >>>> atlas-bkp1.cs.wisc.edu:1094 >>>> >>>> >>>> ======== DUMPING SERVER RESPONSE HEADER ======== >>>> ServerHeader.streamid = 0x01 0x00 >>>> ServerHeader.status = kXR_wait (4005) >>>> ServerHeader.dlen = 4 >>>> ========== END DUMPING SERVER HEADER =========== >>>> >>>> 091217 16:48:14 15961 Xrd: ReadPartialAnswer: Server >>>> [atlas-bkp1.cs.wisc.edu:1094] answered [kXR_wait] (4005) >>>> 091217 16:48:14 15961 Xrd: SendGenCommand: Max time limit elapsed for >>>> request kXR_open. Aborting command. >>>> Last server error 10000 ('') >>>> Error accessing path/file for >>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>> >>>> >>>> Wen >>>> >>>> On Thu, Dec 17, 2009 at 11:27 PM, Fabrizio Furano <[log in to unmask]> wrote: >>>>> >>>>> Hi Wen, >>>>> >>>>> I see that you are getting error 10000, which means "generic error >>>>> before >>>>> any interaction". Could you please run the same command with debug level >>>>> 3 >>>>> and post the log with the same kind of issue? Something like >>>>> >>>>> xrdcp -d 3 .... >>>>> >>>>> Most likely this time the problem is different. I may be wrong here, but >>>>> a >>>>> possible reason for that error is that the servers require >>>>> authentication >>>>> and xrdcp does not find some library in the LD_LIBRARY_PATH. >>>>> >>>>> Fabrizio >>>>> >>>>> >>>>> wen guan ha scritto: >>>>>> >>>>>> Hi Andy, >>>>>> >>>>>> I put new logs in web. >>>>>> >>>>>> It still doesn't work. I cannot copy files in and out. >>>>>> >>>>>> It seems xrootd daemon at atlas-bkp1 hasn't talked with cmsd. >>>>>> Normally if xrootd daemont tries to copy a file, in the cms.log I >>>>>> should see "do_Select: filename". But in this cms.log, there is >>>>>> nothing from atlas-bkp1. >>>>>> >>>>>> (*) >>>>>> [root@atlas-bkp1 ~]# xrdcp >>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>>>> /tmp/ >>>>>> Last server error 10000 ('') >>>>>> Error accessing path/file for >>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>>>> [root@atlas-bkp1 ~]# xrdcp /bin/mv >>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123 >>>>>> 133 >>>>>> >>>>>> >>>>>> Wen >>>>>> >>>>>> On Thu, Dec 17, 2009 at 10:54 PM, Andrew Hanushevsky <[log in to unmask]> >>>>>> wrote: >>>>>>> >>>>>>> Hi Wen, >>>>>>> >>>>>>> I reviewed the log file. Other than the odd redirect of c131 at >>>>>>> 17:47:25 >>>>>>> which I can't comment on because its logs on the web site do not >>>>>>> overlap >>>>>>> with the manager or supervisor. Unless all the logs include the full >>>>>>> time >>>>>>> in >>>>>>> question I can't say much of anything. Can you provide me with >>>>>>> inclusive >>>>>>> logs? >>>>>>> >>>>>>> atlas-bkp1 cms: 17:20:57 to 17:42:19 xrd: 17:20:57 to 17:40:57 >>>>>>> higgs07 cms & xrd 17:22:33 to 17:42:33 >>>>>>> c131 cms & xrd 17:31:57 to 17:47:28 >>>>>>> >>>>>>> That said, it certainly looks like things were working and files were >>>>>>> being >>>>>>> accessed and discovered on all the machines. You even werw able to >>>>>>> open >>>>>>> /atlas/xrootd/users/wguan/test/test98123313 >>>>>>> through not >>>>>>> /atlas/xrootd/users/wguan/test/test123131The other issue is that you >>>>>>> did >>>>>>> not >>>>>>> specify a stable adminpath and the adminpath defaults to /tmp. If you >>>>>>> have a >>>>>>> "cleanup" script that runs periodically for /tmp then eventually your >>>>>>> cluster will go catonic as important (but not often used) files are >>>>>>> deleted >>>>>>> by that script. Could you please find a stable home for the adminpath? >>>>>>> >>>>>>> I reran my tests here and things worked as expected. I will ramp up >>>>>>> some >>>>>>> more tests. So, what is your status today? >>>>>>> >>>>>>> Andy >>>>>>> >>>>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]> >>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>> Cc: <[log in to unmask]> >>>>>>> Sent: Thursday, December 17, 2009 5:05 AM >>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>> >>>>>>> >>>>>>> Hi Andy, >>>>>>> >>>>>>> Yes. I am using the file download from >>>>>>> http://www.slac.stanford.edu/~abh/cmsd/ which compiled yesterday. I >>>>>>> just now compiled it again and compare it with one I compiled >>>>>>> yesterday. they are the same(same md5sum). >>>>>>> >>>>>>> Wen >>>>>>> >>>>>>> On Thu, Dec 17, 2009 at 2:09 AM, Andrew Hanushevsky <[log in to unmask]> >>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Wen, >>>>>>>> >>>>>>>> If c131 cannot connect then either c131 does not have the new cms or >>>>>>>> atlas-bkp1 does not have the new cms as that would be what would >>>>>>>> happen >>>>>>>> if >>>>>>>> either were true. Looking at the log on c131 it would appear that >>>>>>>> atlas-bkp1 >>>>>>>> is still using the old cmsd as the response data length is wrong. >>>>>>>> Could >>>>>>>> you >>>>>>>> verify please. >>>>>>>> >>>>>>>> Andy >>>>>>>> >>>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>>> <[log in to unmask]> >>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>>> Cc: <[log in to unmask]> >>>>>>>> Sent: Wednesday, December 16, 2009 3:58 PM >>>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>>> >>>>>>>> >>>>>>>> Hi Andy, >>>>>>>> >>>>>>>> I tried it. But there are still some problem. I put the logs in >>>>>>>> higgs03.cs.wisc.edu/wguan/ >>>>>>>> >>>>>>>> In my test, c131 is the 65 nodes to be added the the manager. >>>>>>>> and I can copy the file to the pool through manager. But I cannot >>>>>>>> copy a file out which is in c131. >>>>>>>> >>>>>>>> In c131's cms.log, I see "Manager: >>>>>>>> manager.0:[log in to unmask] removed; redirected" again and >>>>>>>> again. and I cannot see any thing about c131 in higgs07's >>>>>>>> log(supervisor). Does it mean manager tries to redirect it to >>>>>>>> higgs07, >>>>>>>> but c131 hasn't try to connect higgs07. It only tries to connect >>>>>>>> manager again. >>>>>>>> >>>>>>>> (*) >>>>>>>> [root@c131 ~]# xrdcp /bin/mv >>>>>>>> root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331 >>>>>>>> Last server error 10000 ('') >>>>>>>> Error accessing path/file for >>>>>>>> root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331 >>>>>>>> [root@c131 ~]# xrdcp /bin/mv >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123311 >>>>>>>> [xrootd] Total 0.06 MB |====================| 100.00 % [3.1 MB/s] >>>>>>>> [root@c131 ~]# xrdcp /bin/mv >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123312 >>>>>>>> [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s] >>>>>>>> [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/ >>>>>>>> test123131 >>>>>>>> [root@c131 ~]# xrdcp >>>>>>>> >>>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>>>>>> /tmp/ >>>>>>>> Last server error 3011 ('No servers are available to read the file.') >>>>>>>> Error accessing path/file for >>>>>>>> >>>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>>>>>> [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/test123131 >>>>>>>> /atlas/xrootd/users/wguan/test/test123131 >>>>>>>> [root@c131 ~]# xrdcp >>>>>>>> >>>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>>>>>> /tmp/ >>>>>>>> Last server error 3011 ('No servers are available to read the file.') >>>>>>>> Error accessing path/file for >>>>>>>> >>>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>>>>>> [root@c131 ~]# xrdcp /bin/mv >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123313 >>>>>>>> [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s] >>>>>>>> [root@c131 ~]# xrdcp >>>>>>>> >>>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>>>>>> /tmp/ >>>>>>>> Last server error 3011 ('No servers are available to read the file.') >>>>>>>> Error accessing path/file for >>>>>>>> >>>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>>>>>> [root@c131 ~]# xrdcp >>>>>>>> >>>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>>>>>> /tmp/ >>>>>>>> Last server error 3011 ('No servers are available to read the file.') >>>>>>>> Error accessing path/file for >>>>>>>> >>>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>>>>>> [root@c131 ~]# xrdcp >>>>>>>> >>>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>>>>>> /tmp/ >>>>>>>> Last server error 3011 ('No servers are available to read the file.') >>>>>>>> Error accessing path/file for >>>>>>>> >>>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131 >>>>>>>> [root@c131 ~]# tail -f /var/log/xrootd/cms.log >>>>>>>> 091216 17:45:52 3103 manager.0:[log in to unmask] XrdLink: >>>>>>>> Setting ref to 2+-1 post=0 >>>>>>>> 091216 17:45:55 3103 Pander trying to connect to lvl 0 >>>>>>>> atlas-bkp1.cs.wisc.edu:3121 >>>>>>>> 091216 17:45:55 3103 XrdInet: Connected to >>>>>>>> atlas-bkp1.cs.wisc.edu:3121 >>>>>>>> 091216 17:45:55 3103 Add atlas-bkp1.cs.wisc.edu to manager config; >>>>>>>> id=0 >>>>>>>> 091216 17:45:55 3103 ManTree: Now connected to 3 root node(s) >>>>>>>> 091216 17:45:55 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu >>>>>>>> 091216 17:45:55 3103 Dispatch manager.0:[log in to unmask] for >>>>>>>> try >>>>>>>> dlen=3 >>>>>>>> 091216 17:45:55 3103 manager.0:[log in to unmask] do_Try: >>>>>>>> 091216 17:45:55 3103 Remove completed atlas-bkp1.cs.wisc.edu manager >>>>>>>> 0.95 >>>>>>>> 091216 17:45:55 3103 Manager: manager.0:[log in to unmask] >>>>>>>> removed; redirected >>>>>>>> 091216 17:46:04 3103 Pander trying to connect to lvl 0 >>>>>>>> atlas-bkp1.cs.wisc.edu:3121 >>>>>>>> 091216 17:46:04 3103 XrdInet: Connected to >>>>>>>> atlas-bkp1.cs.wisc.edu:3121 >>>>>>>> 091216 17:46:04 3103 Add atlas-bkp1.cs.wisc.edu to manager config; >>>>>>>> id=0 >>>>>>>> 091216 17:46:04 3103 ManTree: Now connected to 3 root node(s) >>>>>>>> 091216 17:46:04 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu >>>>>>>> 091216 17:46:04 3103 Dispatch manager.0:[log in to unmask] for >>>>>>>> try >>>>>>>> dlen=3 >>>>>>>> 091216 17:46:04 3103 Protocol: No buffers to serve >>>>>>>> atlas-bkp1.cs.wisc.edu >>>>>>>> 091216 17:46:04 3103 Remove completed atlas-bkp1.cs.wisc.edu manager >>>>>>>> 0.96 >>>>>>>> 091216 17:46:04 3103 Manager: manager.0:[log in to unmask] >>>>>>>> removed; insufficient buffers >>>>>>>> 091216 17:46:11 3103 Dispatch manager.0:[log in to unmask] for >>>>>>>> state dlen=169 >>>>>>>> 091216 17:46:11 3103 manager.0:[log in to unmask] XrdLink: >>>>>>>> Setting ref to 1+1 post=0 >>>>>>>> >>>>>>>> Thanks >>>>>>>> Wen >>>>>>>> >>>>>>>> On Thu, Dec 17, 2009 at 12:10 AM, wen guan <[log in to unmask]> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi Andy, >>>>>>>>> >>>>>>>>>> OK, I understand. As for stalling, too many nodes were deemed to be >>>>>>>>>> in >>>>>>>>>> trouble for the manager to allow service resumption. >>>>>>>>>> >>>>>>>>>> Please make sure that all of the nodes in the cluster receive the >>>>>>>>>> new >>>>>>>>>> cmsd >>>>>>>>>> as they will drop off with the old one and you'll see the same kind >>>>>>>>>> of >>>>>>>>>> activity. Perhaps the best way to know that you suceeded in putting >>>>>>>>>> everything in sync is to start with 63 data nodes plus one >>>>>>>>>> supervisor. >>>>>>>>>> Once >>>>>>>>>> all connections are established; adding an additional server should >>>>>>>>>> simply >>>>>>>>>> send it to the supervisor. >>>>>>>>> >>>>>>>>> I will do it. >>>>>>>>> you said start 63 data server and one supervisor. Does it mean the >>>>>>>>> supervisor is managed using the same policy? If I there are 64 >>>>>>>>> dataservers which are connected before the supervisor, will the >>>>>>>>> supervisor be dropped? Is the supervisor has high priority to be >>>>>>>>> added to the manager? I mean, if there are already 64 dataservers >>>>>>>>> and >>>>>>>>> a supervisor comes in, will the supervisor be accepted and a >>>>>>>>> datasever >>>>>>>>> be redirected to the supervisor? >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Wen >>>>>>>>> >>>>>>>>>> Hi Andrew, >>>>>>>>>> >>>>>>>>>> But when I tried to xrdcp a file to it, it doesn't response. In >>>>>>>>>> atlas-bkp1-xrd.log.20091213, it always prints "stalling client for >>>>>>>>>> 10 >>>>>>>>>> sec". But in cms.log, I can find any message about the file. >>>>>>>>>> >>>>>>>>>>> I don't see why you say it doesn't work. With the debugging level >>>>>>>>>>> set >>>>>>>>>>> so >>>>>>>>>>> high the noise may make it look like something is going wrong but >>>>>>>>>>> that >>>>>>>>>>> isn't >>>>>>>>>>> necessarily the case. >>>>>>>>>>> >>>>>>>>>>> 1) The 'too many subscribers' is correct. The manager was simply >>>>>>>>>>> redirecting >>>>>>>>>>> them because there were already 64 servers. However, in your case >>>>>>>>>>> the >>>>>>>>>>> supervisor wasn't started until almost 30 minutes after everyone >>>>>>>>>>> else >>>>>>>>>>> (i.e., >>>>>>>>>>> 10:42 AM). Why was that? I'm not suprised about the flurry of >>>>>>>>>>> messages >>>>>>>>>>> with >>>>>>>>>>> a critical component missing for 30 minutes. >>>>>>>>>> >>>>>>>>>> Because the manager is 64bit machine but supervisor is 32 bit >>>>>>>>>> machine. >>>>>>>>>> Then I have to recompile the it. At that time, I was interrupted by >>>>>>>>>> something else. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> 2) Once the supervisor started, it started accepting the >>>>>>>>>>> redirected >>>>>>>>>>> servers. >>>>>>>>>>> >>>>>>>>>>> 3) Then 10 seconds (10:42:10) later the supervisor was restarted. >>>>>>>>>>> So, >>>>>>>>>>> that >>>>>>>>>>> would cause a flurry of activity to occur as there is no backup >>>>>>>>>>> supervisor >>>>>>>>>>> to take over. >>>>>>>>>>> >>>>>>>>>>> 4) This happened again at 10:42:34 AM then again at 10:48:49. Is >>>>>>>>>>> the >>>>>>>>>>> supervisor crashing? Is there a core file? >>>>>>>>>>> >>>>>>>>>>> 5) At 11:11 AM the manager restarted. Again, is there a core file >>>>>>>>>>> here >>>>>>>>>>> or >>>>>>>>>>> was this a manual action? >>>>>>>>>>> >>>>>>>>>>> During the course of all of this. All nodes connected were >>>>>>>>>>> operating >>>>>>>>>>> propely >>>>>>>>>>> and files were being located. >>>>>>>>>>> >>>>>>>>>>> So, the two big questions are: >>>>>>>>>>> >>>>>>>>>>> a) Why was the supervisor not started until 30 minutes after the >>>>>>>>>>> system >>>>>>>>>>> was >>>>>>>>>>> started? >>>>>>>>>>> >>>>>>>>>>> b) Is there an explanation of the restarts? If this was a crash >>>>>>>>>>> then >>>>>>>>>>> we >>>>>>>>>>> need >>>>>>>>>>> a core file to figure out what happened. >>>>>>>>>> >>>>>>>>>> It's not a crash. There are some reasons that I restarted some >>>>>>>>>> daemons. >>>>>>>>>> (1)I thought if a dataserver tried many times to connect to a >>>>>>>>>> redirector but failed, the dataserver would not try to connect a >>>>>>>>>> redirector again. The supervisor was missing for long time. So >>>>>>>>>> maybe >>>>>>>>>> some dataservers would not try to connect to atlas-bkp1 again. To >>>>>>>>>> reactive these dataservers, I restarted any servers. >>>>>>>>>> (2)When I tried to xrdcp, it was hanging for long time. I thought >>>>>>>>>> maybe manager was affected by some others things. then I restarte >>>>>>>>>> manager to see whether a restart can make this xrdcp work. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Wen >>>>>>>>>> >>>>>>>>>>> Andy >>>>>>>>>>> >>>>>>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>>>>>> <[log in to unmask]> >>>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>>>>>> Cc: <[log in to unmask]> >>>>>>>>>>> Sent: Wednesday, December 16, 2009 9:38 AM >>>>>>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi Andrew, >>>>>>>>>>> >>>>>>>>>>> It still doesn't work. >>>>>>>>>>> The log file is in higgs03.cs.wisc.edu/wguan/. The name is >>>>>>>>>>> *.20091216 >>>>>>>>>>> The manager complains there are too many subscribers and the >>>>>>>>>>> removes >>>>>>>>>>> nodes. >>>>>>>>>>> >>>>>>>>>>> (*) >>>>>>>>>>> Add server.10040:[log in to unmask] redirected; too many >>>>>>>>>>> subscribers. >>>>>>>>>>> >>>>>>>>>>> Wen >>>>>>>>>>> >>>>>>>>>>> On Wed, Dec 16, 2009 at 4:25 AM, Andrew Hanushevsky >>>>>>>>>>> <[log in to unmask]> >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Wen, >>>>>>>>>>>> >>>>>>>>>>>> It will be easier for me to retroft as the changes were pretty >>>>>>>>>>>> minor. >>>>>>>>>>>> Please >>>>>>>>>>>> lift the new XrdCmsNode.cc file from >>>>>>>>>>>> >>>>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd >>>>>>>>>>>> >>>>>>>>>>>> Andy >>>>>>>>>>>> >>>>>>>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>>>>>>> <[log in to unmask]> >>>>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>>>>>>> Cc: <[log in to unmask]> >>>>>>>>>>>> Sent: Tuesday, December 15, 2009 5:12 PM >>>>>>>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Hi Andy, >>>>>>>>>>>> >>>>>>>>>>>> I can switch to 20091104-1102. Then you don't need to patch >>>>>>>>>>>> another version. How can I download v20091104-1102? >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Wen >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Dec 16, 2009 at 12:52 AM, Andrew Hanushevsky >>>>>>>>>>>> <[log in to unmask]> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>> >>>>>>>>>>>>> Ah yes, I see that now. The file I gave you is based on >>>>>>>>>>>>> v20091104-1102. >>>>>>>>>>>>> Let >>>>>>>>>>>>> me see if I can retrofit the patch for you. >>>>>>>>>>>>> >>>>>>>>>>>>> Andy >>>>>>>>>>>>> >>>>>>>>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>>>>>>>> <[log in to unmask]> >>>>>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>>>>>>>> Cc: <[log in to unmask]> >>>>>>>>>>>>> Sent: Tuesday, December 15, 2009 1:04 PM >>>>>>>>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Andy, >>>>>>>>>>>>> >>>>>>>>>>>>> Which xrootd version are you using? XrdCmsConfig.hh is >>>>>>>>>>>>> different. >>>>>>>>>>>>> XrdCmsConfig.hh is downloaded from >>>>>>>>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/. >>>>>>>>>>>>> >>>>>>>>>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsNode.cc >>>>>>>>>>>>> 6fb3ae40fe4e10bdd4d372818a341f2c src/XrdCms/XrdCmsNode.cc >>>>>>>>>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsConfig.hh >>>>>>>>>>>>> 7d57753847d9448186c718f98e963cbe src/XrdCms/XrdCmsConfig.hh >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks >>>>>>>>>>>>> Wen >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Dec 15, 2009 at 10:50 PM, Andrew Hanushevsky >>>>>>>>>>>>> <[log in to unmask]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Just compiled on Linux and it was clean. Something is really >>>>>>>>>>>>>> wrong >>>>>>>>>>>>>> with >>>>>>>>>>>>>> your >>>>>>>>>>>>>> source files, specifically XrdCmsConfig.cc >>>>>>>>>>>>>> >>>>>>>>>>>>>> The MD5 checksums on the relevant files are: >>>>>>>>>>>>>> >>>>>>>>>>>>>> MD5 (XrdCmsNode.cc) = 6fb3ae40fe4e10bdd4d372818a341f2c >>>>>>>>>>>>>> >>>>>>>>>>>>>> MD5 (XrdCmsConfig.hh) = 4a7d655582a7cd43b098947d0676924b >>>>>>>>>>>>>> >>>>>>>>>>>>>> Andy >>>>>>>>>>>>>> >>>>>>>>>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>>>>>>>>> <[log in to unmask]> >>>>>>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>>>>>>>>> Cc: <[log in to unmask]> >>>>>>>>>>>>>> Sent: Tuesday, December 15, 2009 4:24 AM >>>>>>>>>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Andy, >>>>>>>>>>>>>> >>>>>>>>>>>>>> No problem. Thanks for the fix. But it cannot be compiled. The >>>>>>>>>>>>>> version I am using is >>>>>>>>>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Making cms component... >>>>>>>>>>>>>> Compiling XrdCmsNode.cc >>>>>>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>>>>>> XrdCmsNode::do_Chmod(XrdCmsRRData&)': >>>>>>>>>>>>>> XrdCmsNode.cc:268: error: `fsExec' was not declared in this >>>>>>>>>>>>>> scope >>>>>>>>>>>>>> XrdCmsNode.cc:268: warning: unused variable 'fsExec' >>>>>>>>>>>>>> XrdCmsNode.cc:269: error: 'class XrdCmsConfig' has no member >>>>>>>>>>>>>> named >>>>>>>>>>>>>> 'ossFS' >>>>>>>>>>>>>> XrdCmsNode.cc:273: error: `fsFail' was not declared in this >>>>>>>>>>>>>> scope >>>>>>>>>>>>>> XrdCmsNode.cc:273: warning: unused variable 'fsFail' >>>>>>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>>>>>> XrdCmsNode::do_Mkdir(XrdCmsRRData&)': >>>>>>>>>>>>>> XrdCmsNode.cc:600: error: `fsExec' was not declared in this >>>>>>>>>>>>>> scope >>>>>>>>>>>>>> XrdCmsNode.cc:600: warning: unused variable 'fsExec' >>>>>>>>>>>>>> XrdCmsNode.cc:601: error: 'class XrdCmsConfig' has no member >>>>>>>>>>>>>> named >>>>>>>>>>>>>> 'ossFS' >>>>>>>>>>>>>> XrdCmsNode.cc:605: error: `fsFail' was not declared in this >>>>>>>>>>>>>> scope >>>>>>>>>>>>>> XrdCmsNode.cc:605: warning: unused variable 'fsFail' >>>>>>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>>>>>> XrdCmsNode::do_Mkpath(XrdCmsRRData&)': >>>>>>>>>>>>>> XrdCmsNode.cc:640: error: `fsExec' was not declared in this >>>>>>>>>>>>>> scope >>>>>>>>>>>>>> XrdCmsNode.cc:640: warning: unused variable 'fsExec' >>>>>>>>>>>>>> XrdCmsNode.cc:641: error: 'class XrdCmsConfig' has no member >>>>>>>>>>>>>> named >>>>>>>>>>>>>> 'ossFS' >>>>>>>>>>>>>> XrdCmsNode.cc:645: error: `fsFail' was not declared in this >>>>>>>>>>>>>> scope >>>>>>>>>>>>>> XrdCmsNode.cc:645: warning: unused variable 'fsFail' >>>>>>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>>>>>> XrdCmsNode::do_Mv(XrdCmsRRData&)': >>>>>>>>>>>>>> XrdCmsNode.cc:704: error: `fsExec' was not declared in this >>>>>>>>>>>>>> scope >>>>>>>>>>>>>> XrdCmsNode.cc:704: warning: unused variable 'fsExec' >>>>>>>>>>>>>> XrdCmsNode.cc:705: error: 'class XrdCmsConfig' has no member >>>>>>>>>>>>>> named >>>>>>>>>>>>>> 'ossFS' >>>>>>>>>>>>>> XrdCmsNode.cc:709: error: `fsFail' was not declared in this >>>>>>>>>>>>>> scope >>>>>>>>>>>>>> XrdCmsNode.cc:709: warning: unused variable 'fsFail' >>>>>>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>>>>>> XrdCmsNode::do_Rm(XrdCmsRRData&)': >>>>>>>>>>>>>> XrdCmsNode.cc:831: error: `fsExec' was not declared in this >>>>>>>>>>>>>> scope >>>>>>>>>>>>>> XrdCmsNode.cc:831: warning: unused variable 'fsExec' >>>>>>>>>>>>>> XrdCmsNode.cc:832: error: 'class XrdCmsConfig' has no member >>>>>>>>>>>>>> named >>>>>>>>>>>>>> 'ossFS' >>>>>>>>>>>>>> XrdCmsNode.cc:836: error: `fsFail' was not declared in this >>>>>>>>>>>>>> scope >>>>>>>>>>>>>> XrdCmsNode.cc:836: warning: unused variable 'fsFail' >>>>>>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>>>>>> XrdCmsNode::do_Rmdir(XrdCmsRRData&)': >>>>>>>>>>>>>> XrdCmsNode.cc:873: error: `fsExec' was not declared in this >>>>>>>>>>>>>> scope >>>>>>>>>>>>>> XrdCmsNode.cc:873: warning: unused variable 'fsExec' >>>>>>>>>>>>>> XrdCmsNode.cc:874: error: 'class XrdCmsConfig' has no member >>>>>>>>>>>>>> named >>>>>>>>>>>>>> 'ossFS' >>>>>>>>>>>>>> XrdCmsNode.cc:878: error: `fsFail' was not declared in this >>>>>>>>>>>>>> scope >>>>>>>>>>>>>> XrdCmsNode.cc:878: warning: unused variable 'fsFail' >>>>>>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>>>>>> XrdCmsNode::do_Trunc(XrdCmsRRData&)': >>>>>>>>>>>>>> XrdCmsNode.cc:1377: error: `fsExec' was not declared in this >>>>>>>>>>>>>> scope >>>>>>>>>>>>>> XrdCmsNode.cc:1377: warning: unused variable 'fsExec' >>>>>>>>>>>>>> XrdCmsNode.cc:1378: error: 'class XrdCmsConfig' has no member >>>>>>>>>>>>>> named >>>>>>>>>>>>>> 'ossFS' >>>>>>>>>>>>>> XrdCmsNode.cc:1382: error: `fsFail' was not declared in this >>>>>>>>>>>>>> scope >>>>>>>>>>>>>> XrdCmsNode.cc:1382: warning: unused variable 'fsFail' >>>>>>>>>>>>>> XrdCmsNode.cc: At global scope: >>>>>>>>>>>>>> XrdCmsNode.cc:1524: error: no `int >>>>>>>>>>>>>> XrdCmsNode::fsExec(XrdOucProg*, >>>>>>>>>>>>>> char*, char*)' member function declared in class `XrdCmsNode' >>>>>>>>>>>>>> XrdCmsNode.cc: In member function `int >>>>>>>>>>>>>> XrdCmsNode::fsExec(XrdOucProg*, >>>>>>>>>>>>>> char*, char*)': >>>>>>>>>>>>>> XrdCmsNode.cc:1533: error: `fsL2PFail1' was not declared in >>>>>>>>>>>>>> this >>>>>>>>>>>>>> scope >>>>>>>>>>>>>> XrdCmsNode.cc:1533: warning: unused variable 'fsL2PFail1' >>>>>>>>>>>>>> XrdCmsNode.cc:1537: error: `fsL2PFail2' was not declared in >>>>>>>>>>>>>> this >>>>>>>>>>>>>> scope >>>>>>>>>>>>>> XrdCmsNode.cc:1537: warning: unused variable 'fsL2PFail2' >>>>>>>>>>>>>> XrdCmsNode.cc: At global scope: >>>>>>>>>>>>>> XrdCmsNode.cc:1553: error: no `const char* >>>>>>>>>>>>>> XrdCmsNode::fsFail(const >>>>>>>>>>>>>> char*, const char*, const char*, int)' member function declared >>>>>>>>>>>>>> in >>>>>>>>>>>>>> class `XrdCmsNode' >>>>>>>>>>>>>> XrdCmsNode.cc: In member function `const char* >>>>>>>>>>>>>> XrdCmsNode::fsFail(const char*, const char*, const char*, >>>>>>>>>>>>>> int)': >>>>>>>>>>>>>> XrdCmsNode.cc:1559: error: `fsL2PFail1' was not declared in >>>>>>>>>>>>>> this >>>>>>>>>>>>>> scope >>>>>>>>>>>>>> XrdCmsNode.cc:1559: warning: unused variable 'fsL2PFail1' >>>>>>>>>>>>>> XrdCmsNode.cc:1560: error: `fsL2PFail2' was not declared in >>>>>>>>>>>>>> this >>>>>>>>>>>>>> scope >>>>>>>>>>>>>> XrdCmsNode.cc:1560: warning: unused variable 'fsL2PFail2' >>>>>>>>>>>>>> XrdCmsNode.cc: In static member function `static int >>>>>>>>>>>>>> XrdCmsNode::isOnline(char*, int)': >>>>>>>>>>>>>> XrdCmsNode.cc:1608: error: 'class XrdCmsConfig' has no member >>>>>>>>>>>>>> named >>>>>>>>>>>>>> 'ossFS' >>>>>>>>>>>>>> make[4]: *** [../../obj/XrdCmsNode.o] Error 1 >>>>>>>>>>>>>> make[3]: *** [Linuxall] Error 2 >>>>>>>>>>>>>> make[2]: *** [all] Error 2 >>>>>>>>>>>>>> make[1]: *** [XrdCms] Error 2 >>>>>>>>>>>>>> make: *** [all] Error 2 >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Wen >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Dec 15, 2009 at 2:08 AM, Andrew Hanushevsky >>>>>>>>>>>>>> <[log in to unmask]> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I have developed a permanent fix. You will find the source >>>>>>>>>>>>>>> files >>>>>>>>>>>>>>> in >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/ >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> There are three files: XrdCmsCluster.cc XrdCmsNode.cc >>>>>>>>>>>>>>> XrdCmsProtocol.cc >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Please do a source replacement and recompile. Unfortunately, >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> cmsd >>>>>>>>>>>>>>> will >>>>>>>>>>>>>>> need to be replaced on each node regardless of role. My >>>>>>>>>>>>>>> apologies >>>>>>>>>>>>>>> for >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> disruption. Please let me know how it goes. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> ----- Original Message ----- From: "wen guan" >>>>>>>>>>>>>>> <[log in to unmask]> >>>>>>>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]> >>>>>>>>>>>>>>> Cc: <[log in to unmask]> >>>>>>>>>>>>>>> Sent: Sunday, December 13, 2009 7:04 AM >>>>>>>>>>>>>>> Subject: Re: xrootd with more than 65 machines >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>> I used the new cmsd at atlas-bkp1 manager. But it's still >>>>>>>>>>>>>>> dropping >>>>>>>>>>>>>>> nodes. And in supervisor's log, I cannot find any dataserver >>>>>>>>>>>>>>> to >>>>>>>>>>>>>>> register to it. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The new logs are in >>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/*.20091213. >>>>>>>>>>>>>>> The manager is patched at 091213 08:38:15. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sun, Dec 13, 2009 at 1:52 AM, Andrew Hanushevsky >>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Wen >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> You will find the source replacement at: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It's XrdCmsCluster.cc and it replaces >>>>>>>>>>>>>>>> xrootd/src/XrdCms/XrdCmsCluster.cc >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'm stepping out for a couple of hours but will be back to >>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>> how >>>>>>>>>>>>>>>> things >>>>>>>>>>>>>>>> went. Sorry for the issues :-( >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I prefer a source replacement. Then I can compile it. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I can do one of two things here: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 1) Supply a source replacement and then you would >>>>>>>>>>>>>>>>>> recompile, >>>>>>>>>>>>>>>>>> or >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 2) Give me the uname -a of where the cmsd will run and I'll >>>>>>>>>>>>>>>>>> supply >>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>> binary >>>>>>>>>>>>>>>>>> replacement for you. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Your choice. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi Andrew >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> The problem is found. Great. Thanks. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Where can I find the patched cmsd? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky >>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I found the problem. Looks like a regression from way >>>>>>>>>>>>>>>>>>>> back >>>>>>>>>>>>>>>>>>>> when. >>>>>>>>>>>>>>>>>>>> There >>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>>> missing flag on the redirect. This will require a patched >>>>>>>>>>>>>>>>>>>> cmsd >>>>>>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>> need >>>>>>>>>>>>>>>>>>>> only to replace the redirector's cmsd as this only >>>>>>>>>>>>>>>>>>>> affects >>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>> redirector. >>>>>>>>>>>>>>>>>>>> How would you like to proceed? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> It doesn't work. atlas-bkp1 manager still dropping nodes >>>>>>>>>>>>>>>>>>>>> again. >>>>>>>>>>>>>>>>>>>>> In supervisor, I still haven't seen any dataserver >>>>>>>>>>>>>>>>>>>>> registered. >>>>>>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>>>>> said >>>>>>>>>>>>>>>>>>>>> "I updated the ntp" because you said "the log timestamp >>>>>>>>>>>>>>>>>>>>> do >>>>>>>>>>>>>>>>>>>>> not >>>>>>>>>>>>>>>>>>>>> overlap". >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky >>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Do you mean that everything is now working? It could be >>>>>>>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>>> removed >>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>> xrd.timeout directive. That really could cause >>>>>>>>>>>>>>>>>>>>>> problems. >>>>>>>>>>>>>>>>>>>>>> As >>>>>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>> delays, >>>>>>>>>>>>>>>>>>>>>> that is normal when the redirector thinks something is >>>>>>>>>>>>>>>>>>>>>> going >>>>>>>>>>>>>>>>>>>>>> wrong. >>>>>>>>>>>>>>>>>>>>>> The >>>>>>>>>>>>>>>>>>>>>> strategy is to delay clients until it can get back to a >>>>>>>>>>>>>>>>>>>>>> stable >>>>>>>>>>>>>>>>>>>>>> configuration. This usually prevents jobs from crashing >>>>>>>>>>>>>>>>>>>>>> during >>>>>>>>>>>>>>>>>>>>>> stressful >>>>>>>>>>>>>>>>>>>>>> periods. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> I restarted it to do supervisor test. Also because >>>>>>>>>>>>>>>>>>>>>>> xrootd >>>>>>>>>>>>>>>>>>>>>>> manager >>>>>>>>>>>>>>>>>>>>>>> frequently doesn't response. (*) is the cms.log, the >>>>>>>>>>>>>>>>>>>>>>> file >>>>>>>>>>>>>>>>>>>>>>> select >>>>>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>>>> delayed again and again. When do a restart, all things >>>>>>>>>>>>>>>>>>>>>>> are >>>>>>>>>>>>>>>>>>>>>>> fine. >>>>>>>>>>>>>>>>>>>>>>> Now >>>>>>>>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>>>>>>> am trying to find a clue about it. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> (*) >>>>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 >>>>>>>>>>>>>>>>>>>>>>> slot3.14949:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> do_Select: >>>>>>>>>>>>>>>>>>>>>>> wc >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 Select seeking >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 UnkFile rc=1 >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 >>>>>>>>>>>>>>>>>>>>>>> slot3.14949:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>> do_Select: >>>>>>>>>>>>>>>>>>>>>>> delay 5 >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2 >>>>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to >>>>>>>>>>>>>>>>>>>>>>> 2+-1 >>>>>>>>>>>>>>>>>>>>>>> post=0 >>>>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 Dispatch >>>>>>>>>>>>>>>>>>>>>>> redirector.21313:14@atlas-bkp2 >>>>>>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>>>>> select dlen=166 >>>>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1 >>>>>>>>>>>>>>>>>>>>>>> post=0 >>>>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdSched: running redirector >>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> There is no core file. I copied a new copies of the >>>>>>>>>>>>>>>>>>>>>>> logs >>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>> link >>>>>>>>>>>>>>>>>>>>>>> below. >>>>>>>>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky >>>>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> I see in the server log that it is restarting often. >>>>>>>>>>>>>>>>>>>>>>>> Could >>>>>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>>>>> take >>>>>>>>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>>>>>>> look >>>>>>>>>>>>>>>>>>>>>>>> in the c193 to see if you have any core files? Also >>>>>>>>>>>>>>>>>>>>>>>> please >>>>>>>>>>>>>>>>>>>>>>>> make >>>>>>>>>>>>>>>>>>>>>>>> sure >>>>>>>>>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>>>>>>>> core files are enabled as Linux defaults the size to >>>>>>>>>>>>>>>>>>>>>>>> 0. >>>>>>>>>>>>>>>>>>>>>>>> The >>>>>>>>>>>>>>>>>>>>>>>> first >>>>>>>>>>>>>>>>>>>>>>>> step >>>>>>>>>>>>>>>>>>>>>>>> here >>>>>>>>>>>>>>>>>>>>>>>> is to find out why your servers are restarting. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> the logs can be found here. From the log you can see >>>>>>>>>>>>>>>>>>>>>>>>> atlas-bkp1 >>>>>>>>>>>>>>>>>>>>>>>>> manager are dropping nodes again and again which >>>>>>>>>>>>>>>>>>>>>>>>> tries >>>>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>>>> it. >>>>>>>>>>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/ >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky >>>>>>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Wen, Could you start everything up and provide >>>>>>>>>>>>>>>>>>>>>>>>>> me >>>>>>>>>>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>>>>>>>>> pointer >>>>>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>>>> manager log file, supervisor log file, and one data >>>>>>>>>>>>>>>>>>>>>>>>>> server >>>>>>>>>>>>>>>>>>>>>>>>>> logfile >>>>>>>>>>>>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>>>>>>>> which cover the same time-frame (from start to some >>>>>>>>>>>>>>>>>>>>>>>>>> point >>>>>>>>>>>>>>>>>>>>>>>>>> where >>>>>>>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>>>>>>> think >>>>>>>>>>>>>>>>>>>>>>>>>> things are working or not). That way I can see what >>>>>>>>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>>>>>>> happening. >>>>>>>>>>>>>>>>>>>>>>>>>> At >>>>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>>>> moment I only see two "bad" things in the config >>>>>>>>>>>>>>>>>>>>>>>>>> file: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a >>>>>>>>>>>>>>>>>>>>>>>>>> manager >>>>>>>>>>>>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>>>>>>> claim, >>>>>>>>>>>>>>>>>>>>>>>>>> via >>>>>>>>>>>>>>>>>>>>>>>>>> the all.manager directive, that there are three >>>>>>>>>>>>>>>>>>>>>>>>>> (bkp2 >>>>>>>>>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>>>>>>> bkp3). >>>>>>>>>>>>>>>>>>>>>>>>>> While >>>>>>>>>>>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>>>>>>>>>>> should work, the log file will be dense with error >>>>>>>>>>>>>>>>>>>>>>>>>> messages. >>>>>>>>>>>>>>>>>>>>>>>>>> Please >>>>>>>>>>>>>>>>>>>>>>>>>> correct >>>>>>>>>>>>>>>>>>>>>>>>>> this to be consistent and make it easier to see >>>>>>>>>>>>>>>>>>>>>>>>>> real >>>>>>>>>>>>>>>>>>>>>>>>>> errors. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> This is not a problem for me. Because this config is >>>>>>>>>>>>>>>>>>>>>>>>> used >>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>> dataserver. In manager, I updated the if >>>>>>>>>>>>>>>>>>>>>>>>> atlas-bkp1.cs.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>>>> atlas-bkp2 or something. This is a history problem. >>>>>>>>>>>>>>>>>>>>>>>>> at >>>>>>>>>>>>>>>>>>>>>>>>> first >>>>>>>>>>>>>>>>>>>>>>>>> only >>>>>>>>>>>>>>>>>>>>>>>>> atlas-bkp1 is used. atlas-bkp2 and atlas-bkp3 are >>>>>>>>>>>>>>>>>>>>>>>>> added >>>>>>>>>>>>>>>>>>>>>>>>> later. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> 2) Please use cms.space not olb.space (for >>>>>>>>>>>>>>>>>>>>>>>>>> historical >>>>>>>>>>>>>>>>>>>>>>>>>> reasons >>>>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>>>> latter >>>>>>>>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>>>>>>> still accepted and over-rides the former, but that >>>>>>>>>>>>>>>>>>>>>>>>>> will >>>>>>>>>>>>>>>>>>>>>>>>>> soon >>>>>>>>>>>>>>>>>>>>>>>>>> end), >>>>>>>>>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>>>>>>> please use only one (the config file uses both >>>>>>>>>>>>>>>>>>>>>>>>>> directives). >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> yes. I should remove this line. in fact cms.space is >>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>>> cfg >>>>>>>>>>>>>>>>>>>>>>>>> too. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> The xrootd has an internal mechanism to connect >>>>>>>>>>>>>>>>>>>>>>>>>> servers >>>>>>>>>>>>>>>>>>>>>>>>>> with >>>>>>>>>>>>>>>>>>>>>>>>>> supervisors >>>>>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>>>>> allow for maximum reliability. You cannot change >>>>>>>>>>>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>>>>>>>>>> algorithm >>>>>>>>>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>>>>>>> there >>>>>>>>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>>>>>>> no need to do so. You should *never* tell anyone to >>>>>>>>>>>>>>>>>>>>>>>>>> directly >>>>>>>>>>>>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>>>>>>>>> supervisor. If you do, you will likely get >>>>>>>>>>>>>>>>>>>>>>>>>> unreachable >>>>>>>>>>>>>>>>>>>>>>>>>> nodes. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> As for dropping data servers, it would appear to >>>>>>>>>>>>>>>>>>>>>>>>>> me, >>>>>>>>>>>>>>>>>>>>>>>>>> given >>>>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>>>> flurry >>>>>>>>>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>>>>>>>> such activity, that something either crashed or was >>>>>>>>>>>>>>>>>>>>>>>>>> restarted. >>>>>>>>>>>>>>>>>>>>>>>>>> That's >>>>>>>>>>>>>>>>>>>>>>>>>> why >>>>>>>>>>>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>>>>>>>>>>> would be good to see the complete log of each one >>>>>>>>>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>>>> entities. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> I read the document. and write a config >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg). >>>>>>>>>>>>>>>>>>>>>>>>>>> I used my conf, I can see manager is dispatch >>>>>>>>>>>>>>>>>>>>>>>>>>> message >>>>>>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>>>>>> supervisor. But I cannot see any dataserver tries >>>>>>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>>>>> supervisor. At the same time, in the manager's >>>>>>>>>>>>>>>>>>>>>>>>>>> log, >>>>>>>>>>>>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>>>>>>>>>>> can >>>>>>>>>>>>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>>>>>>>>>>> some >>>>>>>>>>>>>>>>>>>>>>>>>>> dataserver are Dropped. >>>>>>>>>>>>>>>>>>>>>>>>>>> How does xrootd decide which dataserver will >>>>>>>>>>>>>>>>>>>>>>>>>>> connect >>>>>>>>>>>>>>>>>>>>>>>>>>> supervisor? >>>>>>>>>>>>>>>>>>>>>>>>>>> should I specify some dataservers to connect the >>>>>>>>>>>>>>>>>>>>>>>>>>> supervisor? >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> (*) supervisor log >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 Dispatch >>>>>>>>>>>>>>>>>>>>>>>>>>> manager.0:20@atlas-bkp2 >>>>>>>>>>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>>>>>>>>> state >>>>>>>>>>>>>>>>>>>>>>>>>>> dlen=42 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 >>>>>>>>>>>>>>>>>>>>>>>>>>> do_State: >>>>>>>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 >>>>>>>>>>>>>>>>>>>>>>>>>>> do_StateFWD: >>>>>>>>>>>>>>>>>>>>>>>>>>> Path >>>>>>>>>>>>>>>>>>>>>>>>>>> find >>>>>>>>>>>>>>>>>>>>>>>>>>> failed for state >>>>>>>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141 >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> (*)manager log >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>>>> TSpace=5587GB >>>>>>>>>>>>>>>>>>>>>>>>>>> NumFS=1 >>>>>>>>>>>>>>>>>>>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>>>> adding >>>>>>>>>>>>>>>>>>>>>>>>>>> path: >>>>>>>>>>>>>>>>>>>>>>>>>>> w >>>>>>>>>>>>>>>>>>>>>>>>>>> /atlas >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 >>>>>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>>>> do_Space: 5696231MB free; 0% util >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 logged in. >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection >>>>>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>>>>> [log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: running >>>>>>>>>>>>>>>>>>>>>>>>>>> ?:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched >>>>>>>>>>>>>>>>>>>>>>>>>>> protocol >>>>>>>>>>>>>>>>>>>>>>>>>>> cmsd >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>>>>> 79 >>>>>>>>>>>>>>>>>>>>>>>>>>> attached >>>>>>>>>>>>>>>>>>>>>>>>>>> to poller 2; num=22 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add >>>>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>>>> bumps >>>>>>>>>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 #63 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 >>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Drop_Node: >>>>>>>>>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 dropped. >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add Shoved >>>>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 to >>>>>>>>>>>>>>>>>>>>>>>>>>> cluster; >>>>>>>>>>>>>>>>>>>>>>>>>>> id=63.78; >>>>>>>>>>>>>>>>>>>>>>>>>>> num=64; >>>>>>>>>>>>>>>>>>>>>>>>>>> min=51 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 >>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>>>> TSpace=5587GB >>>>>>>>>>>>>>>>>>>>>>>>>>> NumFS=1 >>>>>>>>>>>>>>>>>>>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>>>> adding >>>>>>>>>>>>>>>>>>>>>>>>>>> path: >>>>>>>>>>>>>>>>>>>>>>>>>>> w >>>>>>>>>>>>>>>>>>>>>>>>>>> /atlas >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 >>>>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>>>> do_Space: 5721854MB free; 0% util >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 logged in. >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve >>>>>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 >>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop >>>>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>>>> 60 >>>>>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 node 63.78 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 >>>>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>>>>> 79 detached from poller 2; num=21 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 >>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data >>>>>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>>>>> c177.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>>>> FD=16 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 node 0.3 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>>>>> 16 detached from poller 2; num=20 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data >>>>>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>>>>> c179.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>>>> FD=21 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 >>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>>>>> server.17065:[log in to unmask]:1094 node 1.4 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>>>>> server.7978:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>>>> server.7978:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>>>>> 21 >>>>>>>>>>>>>>>>>>>>>>>>>>> detached from poller 1; num=21 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 State: Status changed to >>>>>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Send status to >>>>>>>>>>>>>>>>>>>>>>>>>>> redirector.15656:14@atlas-bkp2 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 >>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data >>>>>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>>>>> c182.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>>>> FD=19 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 node 7.10 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>>>>> 19 detached from poller 2; num=19 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 >>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data >>>>>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>>>>> c178.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>>>> FD=15 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 node 9.12 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>>>>> 15 detached from poller 1; num=20 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 >>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data >>>>>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>>>>> c181.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>>>> FD=17 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 node 5.8 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>>>>> 17 detached from poller 0; num=21 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 >>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data >>>>>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>>>>> c183.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>>>> FD=22 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 node 8.11 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>>>>> 22 detached from poller 2; num=18 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data >>>>>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>>>>> c184.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>>>> FD=20 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 >>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>>>>> server.4131:[log in to unmask]:1094 node 3.6 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>>>>> 20 detached from poller 0; num=20 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 >>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data >>>>>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>>>>> c185.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>>>> FD=23 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 node 6.9 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>>>>> server.8524:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>>>> server.8524:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>>>>> 23 >>>>>>>>>>>>>>>>>>>>>>>>>>> detached from poller 0; num=19 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 >>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data >>>>>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>>>>> c180.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>>>> FD=18 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 node 4.7 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>>>>> 18 detached from poller 1; num=19 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch >>>>>>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 >>>>>>>>>>>>>>>>>>>>>>>>>>> do_Status: >>>>>>>>>>>>>>>>>>>>>>>>>>> suspend >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 >>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>>>>>> suspended >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data >>>>>>>>>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>>>>>>>> c186.chtc.wisc.edu >>>>>>>>>>>>>>>>>>>>>>>>>>> FD=24 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node >>>>>>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 node 2.5 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol: >>>>>>>>>>>>>>>>>>>>>>>>>>> server.7849:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>>>> logged >>>>>>>>>>>>>>>>>>>>>>>>>>> out. >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 >>>>>>>>>>>>>>>>>>>>>>>>>>> server.7849:[log in to unmask] >>>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll: >>>>>>>>>>>>>>>>>>>>>>>>>>> FD >>>>>>>>>>>>>>>>>>>>>>>>>>> 24 >>>>>>>>>>>>>>>>>>>>>>>>>>> detached from poller 1; num=18 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop >>>>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop >>>>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop >>>>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop >>>>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop >>>>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop >>>>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop >>>>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop >>>>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop >>>>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop >>>>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>>>> 13 >>>>>>>>>>>>>>>>>>>>>>>>>>> seconds >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled. >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled. >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled. >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled. >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled. >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled. >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled. >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled. >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled. >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled. >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>>>>> inq=1 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled. >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 >>>>>>>>>>>>>>>>>>>>>>>>>>> workers >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled. >>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node >>>>>>>>>>>>>>>>>>>>>>>>>>> inq=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew >>>>>>>>>>>>>>>>>>>>>>>>>>> Hanushevsky >>>>>>>>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Wen, >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> To go past 64 data servers you will need to setup >>>>>>>>>>>>>>>>>>>>>>>>>>>> one >>>>>>>>>>>>>>>>>>>>>>>>>>>> or >>>>>>>>>>>>>>>>>>>>>>>>>>>> more >>>>>>>>>>>>>>>>>>>>>>>>>>>> supervisors. >>>>>>>>>>>>>>>>>>>>>>>>>>>> This does not logically change the current >>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration >>>>>>>>>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>>>>>>>>> have. >>>>>>>>>>>>>>>>>>>>>>>>>>>> You >>>>>>>>>>>>>>>>>>>>>>>>>>>> only >>>>>>>>>>>>>>>>>>>>>>>>>>>> need to configure one or more *new* servers (or >>>>>>>>>>>>>>>>>>>>>>>>>>>> at >>>>>>>>>>>>>>>>>>>>>>>>>>>> least >>>>>>>>>>>>>>>>>>>>>>>>>>>> xrootd >>>>>>>>>>>>>>>>>>>>>>>>>>>> processes) >>>>>>>>>>>>>>>>>>>>>>>>>>>> whose role is supervisor. We'd like them to run >>>>>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>>>>> separate >>>>>>>>>>>>>>>>>>>>>>>>>>>> machines >>>>>>>>>>>>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>>>>>>>>>> reliability purposes, but they could run on the >>>>>>>>>>>>>>>>>>>>>>>>>>>> manager >>>>>>>>>>>>>>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>>>>>>>>>>>> long >>>>>>>>>>>>>>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>>>>>>>>>>>>>> give each one a unique instance name (i.e., -n >>>>>>>>>>>>>>>>>>>>>>>>>>>> option). >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> The front part of the cmsd reference explains how >>>>>>>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>>>>>>>> do >>>>>>>>>>>>>>>>>>>>>>>>>>>> this. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Is there any change to configure xrootd with >>>>>>>>>>>>>>>>>>>>>>>>>>>>> more >>>>>>>>>>>>>>>>>>>>>>>>>>>>> than >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 65 >>>>>>>>>>>>>>>>>>>>>>>>>>>>> machines? I used the configure below but it >>>>>>>>>>>>>>>>>>>>>>>>>>>>> doesn't >>>>>>>>>>>>>>>>>>>>>>>>>>>>> work. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Should >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>>>>>>>>>>>>> configure some machines' manager to be >>>>>>>>>>>>>>>>>>>>>>>>>>>>> supvervisor? >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Wen >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>> >>>> >>>> >>>> >> >> >