LISTSERV 16.5 - XROOTD-L Archives

Wen,

I was wondering if you finally did succeed in getting >64 data server
nodes working using a supervisor, etc.

thanks,

Rob


On Dec 18, 2009, at 8:58 AM, wen guan wrote:

> Hi Andy,
>
>   I am sure I am using the right cmsd code. Today I compiled and
> reinstall all cmsd and xrootd. But now it still doesn't work.  I will
> create an account for you then you can login to these machines to
> check what happend.
>
>   In fact today when I doing some restart, I saw some machines
> registered itself to higgs07. But unfortunately when reinstalling, the
> logs have been cleaned.
>
>   I found the supervisor will come to "suspend" state after a while
> it's started, will it cause the supervisor fails to get some
> information.
>
> Wen
>
>
> On Fri, Dec 18, 2009 at 3:05 AM, Andrew Hanushevsky  
> <[log in to unmask]> wrote:
>> Hi Wen,
>>
>> Something is really going wrong with your data servers. For  
>> instance, c109
>> is quite happy from midnight to 7:23am. Then it dropped the  
>> connection. Then
>> reconnected 7:24:03 and was again happy 12:37:20 but here it  
>> reported that's
>> it's xrootd died but then the cmsd promptly killed its connection  
>> afterward.
>> This appears as if someone restarted the xrootd followed by the  
>> cmsd on
>> c109. This continued like this until 12:43:00 (i.e., connect,  
>> suspend, die,
>> repeat). All your servers, in fact, started doing this at 12:36:41 to
>> 12:42:51 causing a massive swap of servers. New servers were added  
>> and old
>> ones reconnecting were redirected to the supervisor. However, it  
>> would
>> appear that those machines could not connect there as they kept  
>> comming back
>> to the atlas-bkp1. I can't tell you anything about what was  
>> happening on
>> higgs07. As far as I can tell it was happily connected to the  
>> redirector
>> cmsd. The reason is that y=there is no log for higgs07 on the web  
>> site for
>> 12/17 starting at midnight. Perhaps you can put one there.
>>
>> So,
>>
>> 1) Are you *absolutely* sure that *all* your (data, etc) servers  
>> are running
>> the corrected cmsd?
>> 2) Please provide the higgs07 log for 12/17.
>>
>> 3) Please provide logs for a sampling of data servers say c0109,  
>> c094,
>> higgs15, and higgs13 between 1/17 12:00:00 to 15:44.
>>
>> I have never seen a situation like yours so something is very wrong  
>> here. In
>> the mean time I will add more debugging information to the  
>> redirector and
>> supervisor and let you know when that is available.
>>
>> Andy
>>
>>
>> ----- Original Message ----- From: "wen guan"  
>> <[log in to unmask]>
>> To: "Fabrizio Furano" <[log in to unmask]>
>> Cc: "Andrew Hanushevsky" <[log in to unmask]>; <[log in to unmask] 
>> >
>> Sent: Thursday, December 17, 2009 3:12 PM
>> Subject: Re: xrootd with more than 65 machines
>>
>>
>> Hi Fabrizio,
>>
>>   This is the xrdcp debug message.
>>            ClientHeader.header.dlen = 41
>> =================== END CLIENT HEADER DUMPING ===================
>>
>> 091217 16:47:54 15961 Xrd: WriteRaw: Writing 24 bytes to physical  
>> connection
>> 091217 16:47:54 15961 Xrd: WriteRaw: Writing to substreamid 0
>> 091217 16:47:54 15961 Xrd: WriteRaw: Writing 41 bytes to physical  
>> connection
>> 091217 16:47:54 15961 Xrd: WriteRaw: Writing to substreamid 0
>> 091217 16:47:54 15961 Xrd: ReadPartialAnswer: Reading a
>> XrdClientMessage from the server [atlas-bkp1.cs.wisc.edu:1094]...
>> 091217 16:47:54 15961 Xrd: XrdClientMessage::ReadRaw:  sid: 1,  
>> IsAttn:
>> 0, substreamid: 0
>> 091217 16:47:54 15961 Xrd: XrdClientMessage::ReadRaw: Reading data (4
>> bytes) from substream 0
>> 091217 16:47:54 15961 Xrd: ReadRaw: Reading from atlas- 
>> bkp1.cs.wisc.edu:1094
>> 091217 16:47:54 15961 Xrd: BuildMessage:  posting id 1
>> 091217 16:47:54 15961 Xrd: XrdClientMessage::ReadRaw: Reading  
>> header (8
>> bytes).
>> 091217 16:47:54 15961 Xrd: ReadRaw: Reading from atlas- 
>> bkp1.cs.wisc.edu:1094
>>
>>
>> ======== DUMPING SERVER RESPONSE HEADER ========
>>     ServerHeader.streamid = 0x01 0x00
>>       ServerHeader.status = kXR_wait (4005)
>>         ServerHeader.dlen = 4
>> ========== END DUMPING SERVER HEADER ===========
>>
>> 091217 16:47:54 15961 Xrd: ReadPartialAnswer: Server
>> [atlas-bkp1.cs.wisc.edu:1094] answered [kXR_wait] (4005)
>> 091217 16:47:54 15961 Xrd: CheckErrorStatus: Server
>> [atlas-bkp1.cs.wisc.edu:1094] requested 10 seconds of wait
>> 091217 16:48:04 15961 Xrd: DumpPhyConn: Phyconn entry,
>> [log in to unmask]:1094', LogCnt=1 Valid
>> 091217 16:48:04 15961 Xrd: SendGenCommand: Sending command Open
>>
>>
>> ================= DUMPING CLIENT REQUEST HEADER =================
>>               ClientHeader.streamid = 0x01 0x00
>>              ClientHeader.requestid = kXR_open (3010)
>>              ClientHeader.open.mode = 0x00 0x00
>>           ClientHeader.open.options = 0x40 0x04
>>          ClientHeader.open.reserved = 0 repeated 12 times
>>            ClientHeader.header.dlen = 41
>> =================== END CLIENT HEADER DUMPING ===================
>>
>> 091217 16:48:04 15961 Xrd: WriteRaw: Writing 24 bytes to physical  
>> connection
>> 091217 16:48:04 15961 Xrd: WriteRaw: Writing to substreamid 0
>> 091217 16:48:04 15961 Xrd: WriteRaw: Writing 41 bytes to physical  
>> connection
>> 091217 16:48:04 15961 Xrd: WriteRaw: Writing to substreamid 0
>> 091217 16:48:04 15961 Xrd: ReadPartialAnswer: Reading a
>> XrdClientMessage from the server [atlas-bkp1.cs.wisc.edu:1094]...
>> 091217 16:48:04 15961 Xrd: XrdClientMessage::ReadRaw:  sid: 1,  
>> IsAttn:
>> 0, substreamid: 0
>> 091217 16:48:04 15961 Xrd: XrdClientMessage::ReadRaw: Reading data (4
>> bytes) from substream 0
>> 091217 16:48:04 15961 Xrd: ReadRaw: Reading from atlas- 
>> bkp1.cs.wisc.edu:1094
>> 091217 16:48:04 15961 Xrd: BuildMessage:  posting id 1
>> 091217 16:48:04 15961 Xrd: XrdClientMessage::ReadRaw: Reading  
>> header (8
>> bytes).
>> 091217 16:48:04 15961 Xrd: ReadRaw: Reading from atlas- 
>> bkp1.cs.wisc.edu:1094
>>
>>
>> ======== DUMPING SERVER RESPONSE HEADER ========
>>     ServerHeader.streamid = 0x01 0x00
>>       ServerHeader.status = kXR_wait (4005)
>>         ServerHeader.dlen = 4
>> ========== END DUMPING SERVER HEADER ===========
>>
>> 091217 16:48:04 15961 Xrd: ReadPartialAnswer: Server
>> [atlas-bkp1.cs.wisc.edu:1094] answered [kXR_wait] (4005)
>> 091217 16:48:04 15961 Xrd: CheckErrorStatus: Server
>> [atlas-bkp1.cs.wisc.edu:1094] requested 10 seconds of wait
>> 091217 16:48:14 15961 Xrd: SendGenCommand: Sending command Open
>>
>>
>> ================= DUMPING CLIENT REQUEST HEADER =================
>>               ClientHeader.streamid = 0x01 0x00
>>              ClientHeader.requestid = kXR_open (3010)
>>              ClientHeader.open.mode = 0x00 0x00
>>           ClientHeader.open.options = 0x40 0x04
>>          ClientHeader.open.reserved = 0 repeated 12 times
>>            ClientHeader.header.dlen = 41
>> =================== END CLIENT HEADER DUMPING ===================
>>
>> 091217 16:48:14 15961 Xrd: WriteRaw: Writing 24 bytes to physical  
>> connection
>> 091217 16:48:14 15961 Xrd: WriteRaw: Writing to substreamid 0
>> 091217 16:48:14 15961 Xrd: WriteRaw: Writing 41 bytes to physical  
>> connection
>> 091217 16:48:14 15961 Xrd: WriteRaw: Writing to substreamid 0
>> 091217 16:48:14 15961 Xrd: ReadPartialAnswer: Reading a
>> XrdClientMessage from the server [atlas-bkp1.cs.wisc.edu:1094]...
>> 091217 16:48:14 15961 Xrd: XrdClientMessage::ReadRaw:  sid: 1,  
>> IsAttn:
>> 0, substreamid: 0
>> 091217 16:48:14 15961 Xrd: XrdClientMessage::ReadRaw: Reading data (4
>> bytes) from substream 0
>> 091217 16:48:14 15961 Xrd: ReadRaw: Reading from atlas- 
>> bkp1.cs.wisc.edu:1094
>> 091217 16:48:14 15961 Xrd: BuildMessage:  posting id 1
>> 091217 16:48:14 15961 Xrd: XrdClientMessage::ReadRaw: Reading  
>> header (8
>> bytes).
>> 091217 16:48:14 15961 Xrd: ReadRaw: Reading from atlas- 
>> bkp1.cs.wisc.edu:1094
>>
>>
>> ======== DUMPING SERVER RESPONSE HEADER ========
>>     ServerHeader.streamid = 0x01 0x00
>>       ServerHeader.status = kXR_wait (4005)
>>         ServerHeader.dlen = 4
>> ========== END DUMPING SERVER HEADER ===========
>>
>> 091217 16:48:14 15961 Xrd: ReadPartialAnswer: Server
>> [atlas-bkp1.cs.wisc.edu:1094] answered [kXR_wait] (4005)
>> 091217 16:48:14 15961 Xrd: SendGenCommand: Max time limit elapsed for
>> request  kXR_open. Aborting command.
>> Last server error 10000 ('')
>> Error accessing path/file for
>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ 
>> test123131
>>
>>
>> Wen
>>
>> On Thu, Dec 17, 2009 at 11:27 PM, Fabrizio Furano <[log in to unmask]>  
>> wrote:
>>>
>>> Hi Wen,
>>>
>>> I see that you are getting error 10000, which means "generic error  
>>> before
>>> any interaction". Could you please run the same command with debug  
>>> level 3
>>> and post the log with the same kind of issue? Something like
>>>
>>> xrdcp -d 3 ....
>>>
>>> Most likely this time the problem is different. I may be wrong  
>>> here, but a
>>> possible reason for that error is that the servers require  
>>> authentication
>>> and xrdcp does not find some library in the LD_LIBRARY_PATH.
>>>
>>> Fabrizio
>>>
>>>
>>> wen guan ha scritto:
>>>>
>>>> Hi Andy,
>>>>
>>>> I put new logs in web.
>>>>
>>>> It still doesn't work. I cannot copy files in and out.
>>>>
>>>> It seems xrootd daemon at atlas-bkp1 hasn't talked with cmsd.
>>>> Normally if xrootd daemont tries to copy a file, in the cms.log I
>>>> should see "do_Select: filename". But in this cms.log, there is
>>>> nothing from atlas-bkp1.
>>>>
>>>> (*)
>>>> [root@atlas-bkp1 ~]# xrdcp
>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ 
>>>> test123131
>>>> /tmp/
>>>> Last server error 10000 ('')
>>>> Error accessing path/file for
>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ 
>>>> test123131
>>>> [root@atlas-bkp1 ~]# xrdcp /bin/mv
>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ 
>>>> test123
>>>> 133
>>>>
>>>>
>>>> Wen
>>>>
>>>> On Thu, Dec 17, 2009 at 10:54 PM, Andrew Hanushevsky <[log in to unmask] 
>>>> >
>>>> wrote:
>>>>>
>>>>> Hi Wen,
>>>>>
>>>>> I reviewed the log file. Other than the odd redirect of c131 at  
>>>>> 17:47:25
>>>>> which I can't comment on because its logs on the web site do not  
>>>>> overlap
>>>>> with the manager or supervisor. Unless all the logs include the  
>>>>> full
>>>>> time
>>>>> in
>>>>> question I can't say much of anything. Can you provide me with  
>>>>> inclusive
>>>>> logs?
>>>>>
>>>>> atlas-bkp1 cms: 17:20:57 to 17:42:19 xrd: 17:20:57 to 17:40:57
>>>>> higgs07 cms & xrd 17:22:33 to 17:42:33
>>>>> c131 cms & xrd 17:31:57 to 17:47:28
>>>>>
>>>>> That said, it certainly looks like things were working and files  
>>>>> were
>>>>> being
>>>>> accessed and discovered on all the machines. You even werw able  
>>>>> to open
>>>>> /atlas/xrootd/users/wguan/test/test98123313
>>>>> through not
>>>>> /atlas/xrootd/users/wguan/test/test123131The other issue is that  
>>>>> you did
>>>>> not
>>>>> specify a stable adminpath and the adminpath defaults to /tmp.  
>>>>> If you
>>>>> have a
>>>>> "cleanup" script that runs periodically for /tmp then eventually  
>>>>> your
>>>>> cluster will go catonic as important (but not often used) files  
>>>>> are
>>>>> deleted
>>>>> by that script. Could you please find a stable home for the  
>>>>> adminpath?
>>>>>
>>>>> I reran my tests here and things worked as expected. I will ramp  
>>>>> up some
>>>>> more tests. So, what is your status today?
>>>>>
>>>>> Andy
>>>>>
>>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask] 
>>>>> >
>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>> Cc: <[log in to unmask]>
>>>>> Sent: Thursday, December 17, 2009 5:05 AM
>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>
>>>>>
>>>>> Hi Andy,
>>>>>
>>>>> Yes. I am using the file download from
>>>>> http://www.slac.stanford.edu/~abh/cmsd/ which compiled  
>>>>> yesterday. I
>>>>> just now compiled it again and compare it with one I compiled
>>>>> yesterday. they are the same(same md5sum).
>>>>>
>>>>> Wen
>>>>>
>>>>> On Thu, Dec 17, 2009 at 2:09 AM, Andrew Hanushevsky <[log in to unmask] 
>>>>> >
>>>>> wrote:
>>>>>>
>>>>>> Hi Wen,
>>>>>>
>>>>>> If c131 cannot connect then either c131 does not have the new  
>>>>>> cms or
>>>>>> atlas-bkp1 does not have the new cms as that would be what  
>>>>>> would happen
>>>>>> if
>>>>>> either were true. Looking at the log on c131 it would appear that
>>>>>> atlas-bkp1
>>>>>> is still using the old cmsd as the response data length is  
>>>>>> wrong. Could
>>>>>> you
>>>>>> verify please.
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask] 
>>>>>> >
>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>> Cc: <[log in to unmask]>
>>>>>> Sent: Wednesday, December 16, 2009 3:58 PM
>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>
>>>>>>
>>>>>> Hi Andy,
>>>>>>
>>>>>> I tried it. But there are still some problem. I put the logs in
>>>>>> higgs03.cs.wisc.edu/wguan/
>>>>>>
>>>>>> In my test, c131 is the 65 nodes to be added the the manager.
>>>>>> and I can copy the file to the pool through manager. But I cannot
>>>>>> copy a file out which is in c131.
>>>>>>
>>>>>> In c131's cms.log, I see "Manager:
>>>>>> manager.0:[log in to unmask] removed; redirected" again  
>>>>>> and
>>>>>> again. and I cannot see any thing about c131 in higgs07's
>>>>>> log(supervisor). Does it mean manager tries to redirect it to  
>>>>>> higgs07,
>>>>>> but c131 hasn't try to connect higgs07. It only tries to connect
>>>>>> manager again.
>>>>>>
>>>>>> (*)
>>>>>> [root@c131 ~]# xrdcp /bin/mv
>>>>>> root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331
>>>>>> Last server error 10000 ('')
>>>>>> Error accessing path/file for
>>>>>> root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331
>>>>>> [root@c131 ~]# xrdcp /bin/mv
>>>>>>
>>>>>>
>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ 
>>>>>> test98123311
>>>>>> [xrootd] Total 0.06 MB |====================| 100.00 % [3.1 MB/s]
>>>>>> [root@c131 ~]# xrdcp /bin/mv
>>>>>>
>>>>>>
>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ 
>>>>>> test98123312
>>>>>> [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s]
>>>>>> [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/
>>>>>> test123131
>>>>>> [root@c131 ~]# xrdcp
>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ 
>>>>>> test123131
>>>>>> /tmp/
>>>>>> Last server error 3011 ('No servers are available to read the  
>>>>>> file.')
>>>>>> Error accessing path/file for
>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ 
>>>>>> test123131
>>>>>> [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/test123131
>>>>>> /atlas/xrootd/users/wguan/test/test123131
>>>>>> [root@c131 ~]# xrdcp
>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ 
>>>>>> test123131
>>>>>> /tmp/
>>>>>> Last server error 3011 ('No servers are available to read the  
>>>>>> file.')
>>>>>> Error accessing path/file for
>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ 
>>>>>> test123131
>>>>>> [root@c131 ~]# xrdcp /bin/mv
>>>>>>
>>>>>>
>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ 
>>>>>> test98123313
>>>>>> [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s]
>>>>>> [root@c131 ~]# xrdcp
>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ 
>>>>>> test123131
>>>>>> /tmp/
>>>>>> Last server error 3011 ('No servers are available to read the  
>>>>>> file.')
>>>>>> Error accessing path/file for
>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ 
>>>>>> test123131
>>>>>> [root@c131 ~]# xrdcp
>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ 
>>>>>> test123131
>>>>>> /tmp/
>>>>>> Last server error 3011 ('No servers are available to read the  
>>>>>> file.')
>>>>>> Error accessing path/file for
>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ 
>>>>>> test123131
>>>>>> [root@c131 ~]# xrdcp
>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ 
>>>>>> test123131
>>>>>> /tmp/
>>>>>> Last server error 3011 ('No servers are available to read the  
>>>>>> file.')
>>>>>> Error accessing path/file for
>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/ 
>>>>>> test123131
>>>>>> [root@c131 ~]# tail -f /var/log/xrootd/cms.log
>>>>>> 091216 17:45:52 3103 manager.0:[log in to unmask] XrdLink:
>>>>>> Setting ref to 2+-1 post=0
>>>>>> 091216 17:45:55 3103 Pander trying to connect to lvl 0
>>>>>> atlas-bkp1.cs.wisc.edu:3121
>>>>>> 091216 17:45:55 3103 XrdInet: Connected to atlas- 
>>>>>> bkp1.cs.wisc.edu:3121
>>>>>> 091216 17:45:55 3103 Add atlas-bkp1.cs.wisc.edu to manager  
>>>>>> config; id=0
>>>>>> 091216 17:45:55 3103 ManTree: Now connected to 3 root node(s)
>>>>>> 091216 17:45:55 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu
>>>>>> 091216 17:45:55 3103 Dispatch manager.0:17@atlas- 
>>>>>> bkp1.cs.wisc.edu for
>>>>>> try
>>>>>> dlen=3
>>>>>> 091216 17:45:55 3103 manager.0:[log in to unmask] do_Try:
>>>>>> 091216 17:45:55 3103 Remove completed atlas-bkp1.cs.wisc.edu  
>>>>>> manager
>>>>>> 0.95
>>>>>> 091216 17:45:55 3103 Manager: manager.0:[log in to unmask]
>>>>>> removed; redirected
>>>>>> 091216 17:46:04 3103 Pander trying to connect to lvl 0
>>>>>> atlas-bkp1.cs.wisc.edu:3121
>>>>>> 091216 17:46:04 3103 XrdInet: Connected to atlas- 
>>>>>> bkp1.cs.wisc.edu:3121
>>>>>> 091216 17:46:04 3103 Add atlas-bkp1.cs.wisc.edu to manager  
>>>>>> config; id=0
>>>>>> 091216 17:46:04 3103 ManTree: Now connected to 3 root node(s)
>>>>>> 091216 17:46:04 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu
>>>>>> 091216 17:46:04 3103 Dispatch manager.0:17@atlas- 
>>>>>> bkp1.cs.wisc.edu for
>>>>>> try
>>>>>> dlen=3
>>>>>> 091216 17:46:04 3103 Protocol: No buffers to serve
>>>>>> atlas-bkp1.cs.wisc.edu
>>>>>> 091216 17:46:04 3103 Remove completed atlas-bkp1.cs.wisc.edu  
>>>>>> manager
>>>>>> 0.96
>>>>>> 091216 17:46:04 3103 Manager: manager.0:[log in to unmask]
>>>>>> removed; insufficient buffers
>>>>>> 091216 17:46:11 3103 Dispatch manager.0:19@atlas- 
>>>>>> bkp2.cs.wisc.edu for
>>>>>> state dlen=169
>>>>>> 091216 17:46:11 3103 manager.0:[log in to unmask] XrdLink:
>>>>>> Setting ref to 1+1 post=0
>>>>>>
>>>>>> Thanks
>>>>>> Wen
>>>>>>
>>>>>> On Thu, Dec 17, 2009 at 12:10 AM, wen guan <[log in to unmask] 
>>>>>> >
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Andy,
>>>>>>>
>>>>>>>> OK, I understand. As for stalling, too many nodes were deemed  
>>>>>>>> to be
>>>>>>>> in
>>>>>>>> trouble for the manager to allow service resumption.
>>>>>>>>
>>>>>>>> Please make sure that all of the nodes in the cluster receive  
>>>>>>>> the new
>>>>>>>> cmsd
>>>>>>>> as they will drop off with the old one and you'll see the  
>>>>>>>> same kind
>>>>>>>> of
>>>>>>>> activity. Perhaps the best way to know that you suceeded in  
>>>>>>>> putting
>>>>>>>> everything in sync is to start with 63 data nodes plus one
>>>>>>>> supervisor.
>>>>>>>> Once
>>>>>>>> all connections are established; adding an additional server  
>>>>>>>> should
>>>>>>>> simply
>>>>>>>> send it to the supervisor.
>>>>>>>
>>>>>>> I will do it.
>>>>>>> you said start 63 data server and one supervisor. Does it mean  
>>>>>>> the
>>>>>>> supervisor is managed using the same policy? If I there are 64
>>>>>>> dataservers which are connected before the supervisor, will the
>>>>>>> supervisor be dropped? Is the supervisor has high priority to be
>>>>>>> added to the manager? I mean, if there are already 64  
>>>>>>> dataservers and
>>>>>>> a supervisor comes in, will the supervisor be accepted and a  
>>>>>>> datasever
>>>>>>> be redirected to the supervisor?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Wen
>>>>>>>
>>>>>>>> Hi Andrew,
>>>>>>>>
>>>>>>>> But when I tried to xrdcp a file to it, it doesn't response. In
>>>>>>>> atlas-bkp1-xrd.log.20091213, it always prints "stalling  
>>>>>>>> client for 10
>>>>>>>> sec". But in cms.log, I can find any message about the file.
>>>>>>>>
>>>>>>>>> I don't see why you say it doesn't work. With the debugging  
>>>>>>>>> level
>>>>>>>>> set
>>>>>>>>> so
>>>>>>>>> high the noise may make it look like something is going  
>>>>>>>>> wrong but
>>>>>>>>> that
>>>>>>>>> isn't
>>>>>>>>> necessarily the case.
>>>>>>>>>
>>>>>>>>> 1) The 'too many subscribers' is correct. The manager was  
>>>>>>>>> simply
>>>>>>>>> redirecting
>>>>>>>>> them because there were already 64 servers. However, in your  
>>>>>>>>> case
>>>>>>>>> the
>>>>>>>>> supervisor wasn't started until almost 30 minutes after  
>>>>>>>>> everyone
>>>>>>>>> else
>>>>>>>>> (i.e.,
>>>>>>>>> 10:42 AM). Why was that? I'm not suprised about the flurry of
>>>>>>>>> messages
>>>>>>>>> with
>>>>>>>>> a critical component missing for 30 minutes.
>>>>>>>>
>>>>>>>> Because the manager is 64bit machine but supervisor is 32 bit
>>>>>>>> machine.
>>>>>>>> Then I have to recompile the it. At that time, I was  
>>>>>>>> interrupted by
>>>>>>>> something else.
>>>>>>>>
>>>>>>>>
>>>>>>>>> 2) Once the supervisor started, it started accepting the  
>>>>>>>>> redirected
>>>>>>>>> servers.
>>>>>>>>>
>>>>>>>>> 3) Then 10 seconds (10:42:10) later the supervisor was  
>>>>>>>>> restarted.
>>>>>>>>> So,
>>>>>>>>> that
>>>>>>>>> would cause a flurry of activity to occur as there is no  
>>>>>>>>> backup
>>>>>>>>> supervisor
>>>>>>>>> to take over.
>>>>>>>>>
>>>>>>>>> 4) This happened again at 10:42:34 AM then again at  
>>>>>>>>> 10:48:49. Is the
>>>>>>>>> supervisor crashing? Is there a core file?
>>>>>>>>>
>>>>>>>>> 5) At 11:11 AM the manager restarted. Again, is there a core  
>>>>>>>>> file
>>>>>>>>> here
>>>>>>>>> or
>>>>>>>>> was this a manual action?
>>>>>>>>>
>>>>>>>>> During the course of all of this. All nodes connected were  
>>>>>>>>> operating
>>>>>>>>> propely
>>>>>>>>> and files were being located.
>>>>>>>>>
>>>>>>>>> So, the two big questions are:
>>>>>>>>>
>>>>>>>>> a) Why was the supervisor not started until 30 minutes after  
>>>>>>>>> the
>>>>>>>>> system
>>>>>>>>> was
>>>>>>>>> started?
>>>>>>>>>
>>>>>>>>> b) Is there an explanation of the restarts? If this was a  
>>>>>>>>> crash then
>>>>>>>>> we
>>>>>>>>> need
>>>>>>>>> a core file to figure out what happened.
>>>>>>>>
>>>>>>>> It's not a crash. There are some reasons that I restarted some
>>>>>>>> daemons.
>>>>>>>> (1)I thought if a dataserver tried many times to connect to a
>>>>>>>> redirector but failed, the dataserver would not try to  
>>>>>>>> connect a
>>>>>>>> redirector again. The supervisor was missing for long time.  
>>>>>>>> So maybe
>>>>>>>> some dataservers would not try to connect to atlas-bkp1  
>>>>>>>> again. To
>>>>>>>> reactive these dataservers, I restarted any servers.
>>>>>>>> (2)When I tried to xrdcp, it was hanging for long time. I  
>>>>>>>> thought
>>>>>>>> maybe manager was affected by some others things. then I  
>>>>>>>> restarte
>>>>>>>> manager to see whether a restart can make this xrdcp work.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Wen
>>>>>>>>
>>>>>>>>> Andy
>>>>>>>>>
>>>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>>>> <[log in to unmask]>
>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>>> Cc: <[log in to unmask]>
>>>>>>>>> Sent: Wednesday, December 16, 2009 9:38 AM
>>>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Andrew,
>>>>>>>>>
>>>>>>>>> It still doesn't work.
>>>>>>>>> The log file is in higgs03.cs.wisc.edu/wguan/. The name is
>>>>>>>>> *.20091216
>>>>>>>>> The manager complains there are too many subscribers and the  
>>>>>>>>> removes
>>>>>>>>> nodes.
>>>>>>>>>
>>>>>>>>> (*)
>>>>>>>>> Add server.10040:[log in to unmask] redirected; too many
>>>>>>>>> subscribers.
>>>>>>>>>
>>>>>>>>> Wen
>>>>>>>>>
>>>>>>>>> On Wed, Dec 16, 2009 at 4:25 AM, Andrew Hanushevsky
>>>>>>>>> <[log in to unmask]>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Wen,
>>>>>>>>>>
>>>>>>>>>> It will be easier for me to retroft as the changes were  
>>>>>>>>>> pretty
>>>>>>>>>> minor.
>>>>>>>>>> Please
>>>>>>>>>> lift the new XrdCmsNode.cc file from
>>>>>>>>>>
>>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd
>>>>>>>>>>
>>>>>>>>>> Andy
>>>>>>>>>>
>>>>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>>>>> <[log in to unmask]>
>>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>>>> Cc: <[log in to unmask]>
>>>>>>>>>> Sent: Tuesday, December 15, 2009 5:12 PM
>>>>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Andy,
>>>>>>>>>>
>>>>>>>>>> I can switch to 20091104-1102. Then you don't need to patch
>>>>>>>>>> another version. How can I download v20091104-1102?
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Wen
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 16, 2009 at 12:52 AM, Andrew Hanushevsky
>>>>>>>>>> <[log in to unmask]>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>
>>>>>>>>>>> Ah yes, I see that now. The file I gave you is based on
>>>>>>>>>>> v20091104-1102.
>>>>>>>>>>> Let
>>>>>>>>>>> me see if I can retrofit the patch for you.
>>>>>>>>>>>
>>>>>>>>>>> Andy
>>>>>>>>>>>
>>>>>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>>>>>> <[log in to unmask]>
>>>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>>>>> Cc: <[log in to unmask]>
>>>>>>>>>>> Sent: Tuesday, December 15, 2009 1:04 PM
>>>>>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi Andy,
>>>>>>>>>>>
>>>>>>>>>>> Which xrootd version are you using? XrdCmsConfig.hh is  
>>>>>>>>>>> different.
>>>>>>>>>>> XrdCmsConfig.hh is downloaded from
>>>>>>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/.
>>>>>>>>>>>
>>>>>>>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsNode.cc
>>>>>>>>>>> 6fb3ae40fe4e10bdd4d372818a341f2c src/XrdCms/XrdCmsNode.cc
>>>>>>>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsConfig.hh
>>>>>>>>>>> 7d57753847d9448186c718f98e963cbe src/XrdCms/XrdCmsConfig.hh
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Wen
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Dec 15, 2009 at 10:50 PM, Andrew Hanushevsky
>>>>>>>>>>> <[log in to unmask]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>
>>>>>>>>>>>> Just compiled on Linux and it was clean. Something is  
>>>>>>>>>>>> really
>>>>>>>>>>>> wrong
>>>>>>>>>>>> with
>>>>>>>>>>>> your
>>>>>>>>>>>> source files, specifically XrdCmsConfig.cc
>>>>>>>>>>>>
>>>>>>>>>>>> The MD5 checksums on the relevant files are:
>>>>>>>>>>>>
>>>>>>>>>>>> MD5 (XrdCmsNode.cc) = 6fb3ae40fe4e10bdd4d372818a341f2c
>>>>>>>>>>>>
>>>>>>>>>>>> MD5 (XrdCmsConfig.hh) = 4a7d655582a7cd43b098947d0676924b
>>>>>>>>>>>>
>>>>>>>>>>>> Andy
>>>>>>>>>>>>
>>>>>>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>>>>>>> <[log in to unmask]>
>>>>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>>>>>> Cc: <[log in to unmask]>
>>>>>>>>>>>> Sent: Tuesday, December 15, 2009 4:24 AM
>>>>>>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Andy,
>>>>>>>>>>>>
>>>>>>>>>>>> No problem. Thanks for the fix. But it cannot be  
>>>>>>>>>>>> compiled. The
>>>>>>>>>>>> version I am using is
>>>>>>>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/.
>>>>>>>>>>>>
>>>>>>>>>>>> Making cms component...
>>>>>>>>>>>> Compiling XrdCmsNode.cc
>>>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>>>> XrdCmsNode::do_Chmod(XrdCmsRRData&)':
>>>>>>>>>>>> XrdCmsNode.cc:268: error: `fsExec' was not declared in  
>>>>>>>>>>>> this scope
>>>>>>>>>>>> XrdCmsNode.cc:268: warning: unused variable 'fsExec'
>>>>>>>>>>>> XrdCmsNode.cc:269: error: 'class XrdCmsConfig' has no  
>>>>>>>>>>>> member
>>>>>>>>>>>> named
>>>>>>>>>>>> 'ossFS'
>>>>>>>>>>>> XrdCmsNode.cc:273: error: `fsFail' was not declared in  
>>>>>>>>>>>> this scope
>>>>>>>>>>>> XrdCmsNode.cc:273: warning: unused variable 'fsFail'
>>>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>>>> XrdCmsNode::do_Mkdir(XrdCmsRRData&)':
>>>>>>>>>>>> XrdCmsNode.cc:600: error: `fsExec' was not declared in  
>>>>>>>>>>>> this scope
>>>>>>>>>>>> XrdCmsNode.cc:600: warning: unused variable 'fsExec'
>>>>>>>>>>>> XrdCmsNode.cc:601: error: 'class XrdCmsConfig' has no  
>>>>>>>>>>>> member
>>>>>>>>>>>> named
>>>>>>>>>>>> 'ossFS'
>>>>>>>>>>>> XrdCmsNode.cc:605: error: `fsFail' was not declared in  
>>>>>>>>>>>> this scope
>>>>>>>>>>>> XrdCmsNode.cc:605: warning: unused variable 'fsFail'
>>>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>>>> XrdCmsNode::do_Mkpath(XrdCmsRRData&)':
>>>>>>>>>>>> XrdCmsNode.cc:640: error: `fsExec' was not declared in  
>>>>>>>>>>>> this scope
>>>>>>>>>>>> XrdCmsNode.cc:640: warning: unused variable 'fsExec'
>>>>>>>>>>>> XrdCmsNode.cc:641: error: 'class XrdCmsConfig' has no  
>>>>>>>>>>>> member
>>>>>>>>>>>> named
>>>>>>>>>>>> 'ossFS'
>>>>>>>>>>>> XrdCmsNode.cc:645: error: `fsFail' was not declared in  
>>>>>>>>>>>> this scope
>>>>>>>>>>>> XrdCmsNode.cc:645: warning: unused variable 'fsFail'
>>>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>>>> XrdCmsNode::do_Mv(XrdCmsRRData&)':
>>>>>>>>>>>> XrdCmsNode.cc:704: error: `fsExec' was not declared in  
>>>>>>>>>>>> this scope
>>>>>>>>>>>> XrdCmsNode.cc:704: warning: unused variable 'fsExec'
>>>>>>>>>>>> XrdCmsNode.cc:705: error: 'class XrdCmsConfig' has no  
>>>>>>>>>>>> member
>>>>>>>>>>>> named
>>>>>>>>>>>> 'ossFS'
>>>>>>>>>>>> XrdCmsNode.cc:709: error: `fsFail' was not declared in  
>>>>>>>>>>>> this scope
>>>>>>>>>>>> XrdCmsNode.cc:709: warning: unused variable 'fsFail'
>>>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>>>> XrdCmsNode::do_Rm(XrdCmsRRData&)':
>>>>>>>>>>>> XrdCmsNode.cc:831: error: `fsExec' was not declared in  
>>>>>>>>>>>> this scope
>>>>>>>>>>>> XrdCmsNode.cc:831: warning: unused variable 'fsExec'
>>>>>>>>>>>> XrdCmsNode.cc:832: error: 'class XrdCmsConfig' has no  
>>>>>>>>>>>> member
>>>>>>>>>>>> named
>>>>>>>>>>>> 'ossFS'
>>>>>>>>>>>> XrdCmsNode.cc:836: error: `fsFail' was not declared in  
>>>>>>>>>>>> this scope
>>>>>>>>>>>> XrdCmsNode.cc:836: warning: unused variable 'fsFail'
>>>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>>>> XrdCmsNode::do_Rmdir(XrdCmsRRData&)':
>>>>>>>>>>>> XrdCmsNode.cc:873: error: `fsExec' was not declared in  
>>>>>>>>>>>> this scope
>>>>>>>>>>>> XrdCmsNode.cc:873: warning: unused variable 'fsExec'
>>>>>>>>>>>> XrdCmsNode.cc:874: error: 'class XrdCmsConfig' has no  
>>>>>>>>>>>> member
>>>>>>>>>>>> named
>>>>>>>>>>>> 'ossFS'
>>>>>>>>>>>> XrdCmsNode.cc:878: error: `fsFail' was not declared in  
>>>>>>>>>>>> this scope
>>>>>>>>>>>> XrdCmsNode.cc:878: warning: unused variable 'fsFail'
>>>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>>>> XrdCmsNode::do_Trunc(XrdCmsRRData&)':
>>>>>>>>>>>> XrdCmsNode.cc:1377: error: `fsExec' was not declared in  
>>>>>>>>>>>> this
>>>>>>>>>>>> scope
>>>>>>>>>>>> XrdCmsNode.cc:1377: warning: unused variable 'fsExec'
>>>>>>>>>>>> XrdCmsNode.cc:1378: error: 'class XrdCmsConfig' has no  
>>>>>>>>>>>> member
>>>>>>>>>>>> named
>>>>>>>>>>>> 'ossFS'
>>>>>>>>>>>> XrdCmsNode.cc:1382: error: `fsFail' was not declared in  
>>>>>>>>>>>> this
>>>>>>>>>>>> scope
>>>>>>>>>>>> XrdCmsNode.cc:1382: warning: unused variable 'fsFail'
>>>>>>>>>>>> XrdCmsNode.cc: At global scope:
>>>>>>>>>>>> XrdCmsNode.cc:1524: error: no `int
>>>>>>>>>>>> XrdCmsNode::fsExec(XrdOucProg*,
>>>>>>>>>>>> char*, char*)' member function declared in class  
>>>>>>>>>>>> `XrdCmsNode'
>>>>>>>>>>>> XrdCmsNode.cc: In member function `int
>>>>>>>>>>>> XrdCmsNode::fsExec(XrdOucProg*,
>>>>>>>>>>>> char*, char*)':
>>>>>>>>>>>> XrdCmsNode.cc:1533: error: `fsL2PFail1' was not declared  
>>>>>>>>>>>> in this
>>>>>>>>>>>> scope
>>>>>>>>>>>> XrdCmsNode.cc:1533: warning: unused variable 'fsL2PFail1'
>>>>>>>>>>>> XrdCmsNode.cc:1537: error: `fsL2PFail2' was not declared  
>>>>>>>>>>>> in this
>>>>>>>>>>>> scope
>>>>>>>>>>>> XrdCmsNode.cc:1537: warning: unused variable 'fsL2PFail2'
>>>>>>>>>>>> XrdCmsNode.cc: At global scope:
>>>>>>>>>>>> XrdCmsNode.cc:1553: error: no `const char*
>>>>>>>>>>>> XrdCmsNode::fsFail(const
>>>>>>>>>>>> char*, const char*, const char*, int)' member function  
>>>>>>>>>>>> declared
>>>>>>>>>>>> in
>>>>>>>>>>>> class `XrdCmsNode'
>>>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>>>> XrdCmsNode::fsFail(const char*, const char*, const char*,  
>>>>>>>>>>>> int)':
>>>>>>>>>>>> XrdCmsNode.cc:1559: error: `fsL2PFail1' was not declared  
>>>>>>>>>>>> in this
>>>>>>>>>>>> scope
>>>>>>>>>>>> XrdCmsNode.cc:1559: warning: unused variable 'fsL2PFail1'
>>>>>>>>>>>> XrdCmsNode.cc:1560: error: `fsL2PFail2' was not declared  
>>>>>>>>>>>> in this
>>>>>>>>>>>> scope
>>>>>>>>>>>> XrdCmsNode.cc:1560: warning: unused variable 'fsL2PFail2'
>>>>>>>>>>>> XrdCmsNode.cc: In static member function `static int
>>>>>>>>>>>> XrdCmsNode::isOnline(char*, int)':
>>>>>>>>>>>> XrdCmsNode.cc:1608: error: 'class XrdCmsConfig' has no  
>>>>>>>>>>>> member
>>>>>>>>>>>> named
>>>>>>>>>>>> 'ossFS'
>>>>>>>>>>>> make[4]: *** [../../obj/XrdCmsNode.o] Error 1
>>>>>>>>>>>> make[3]: *** [Linuxall] Error 2
>>>>>>>>>>>> make[2]: *** [all] Error 2
>>>>>>>>>>>> make[1]: *** [XrdCms] Error 2
>>>>>>>>>>>> make: *** [all] Error 2
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Wen
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Dec 15, 2009 at 2:08 AM, Andrew Hanushevsky
>>>>>>>>>>>> <[log in to unmask]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have developed a permanent fix. You will find the  
>>>>>>>>>>>>> source files
>>>>>>>>>>>>> in
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/
>>>>>>>>>>>>>
>>>>>>>>>>>>> There are three files: XrdCmsCluster.cc XrdCmsNode.cc
>>>>>>>>>>>>> XrdCmsProtocol.cc
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please do a source replacement and recompile.  
>>>>>>>>>>>>> Unfortunately, the
>>>>>>>>>>>>> cmsd
>>>>>>>>>>>>> will
>>>>>>>>>>>>> need to be replaced on each node regardless of role. My
>>>>>>>>>>>>> apologies
>>>>>>>>>>>>> for
>>>>>>>>>>>>> the
>>>>>>>>>>>>> disruption. Please let me know how it goes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>
>>>>>>>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>>>>>>>> <[log in to unmask]>
>>>>>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>>>>>>> Cc: <[log in to unmask]>
>>>>>>>>>>>>> Sent: Sunday, December 13, 2009 7:04 AM
>>>>>>>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>> I used the new cmsd at atlas-bkp1 manager. But it's still
>>>>>>>>>>>>> dropping
>>>>>>>>>>>>> nodes. And in supervisor's log, I cannot find any  
>>>>>>>>>>>>> dataserver to
>>>>>>>>>>>>> register to it.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The new logs are in http://higgs03.cs.wisc.edu/wguan/*.20091213 
>>>>>>>>>>>>> .
>>>>>>>>>>>>> The manager is patched at 091213 08:38:15.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sun, Dec 13, 2009 at 1:52 AM, Andrew Hanushevsky
>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Wen
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You will find the source replacement at:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It's XrdCmsCluster.cc and it replaces
>>>>>>>>>>>>>> xrootd/src/XrdCms/XrdCmsCluster.cc
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm stepping out for a couple of hours but will be back  
>>>>>>>>>>>>>> to see
>>>>>>>>>>>>>> how
>>>>>>>>>>>>>> things
>>>>>>>>>>>>>> went. Sorry for the issues :-(
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I prefer a source replacement. Then I can compile it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I can do one of two things here:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1) Supply a source replacement and then you would  
>>>>>>>>>>>>>>>> recompile,
>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2) Give me the uname -a of where the cmsd will run  
>>>>>>>>>>>>>>>> and I'll
>>>>>>>>>>>>>>>> supply
>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> binary
>>>>>>>>>>>>>>>> replacement for you.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Your choice.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Andrew
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The problem is found. Great. Thanks.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Where can I find the patched cmsd?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky
>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I found the problem. Looks like a regression from  
>>>>>>>>>>>>>>>>>> way back
>>>>>>>>>>>>>>>>>> when.
>>>>>>>>>>>>>>>>>> There
>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> missing flag on the redirect. This will require a  
>>>>>>>>>>>>>>>>>> patched
>>>>>>>>>>>>>>>>>> cmsd
>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>>> only to replace the redirector's cmsd as this only  
>>>>>>>>>>>>>>>>>> affects
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> redirector.
>>>>>>>>>>>>>>>>>> How would you like to proceed?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> It doesn't work. atlas-bkp1 manager still dropping  
>>>>>>>>>>>>>>>>>>> nodes
>>>>>>>>>>>>>>>>>>> again.
>>>>>>>>>>>>>>>>>>> In supervisor, I still haven't seen any dataserver
>>>>>>>>>>>>>>>>>>> registered.
>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>> said
>>>>>>>>>>>>>>>>>>> "I updated the ntp" because you said "the log  
>>>>>>>>>>>>>>>>>>> timestamp do
>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>> overlap".
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky
>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Do you mean that everything is now working? It  
>>>>>>>>>>>>>>>>>>>> could be
>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>> removed
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> xrd.timeout directive. That really could cause  
>>>>>>>>>>>>>>>>>>>> problems.
>>>>>>>>>>>>>>>>>>>> As
>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> delays,
>>>>>>>>>>>>>>>>>>>> that is normal when the redirector thinks  
>>>>>>>>>>>>>>>>>>>> something is
>>>>>>>>>>>>>>>>>>>> going
>>>>>>>>>>>>>>>>>>>> wrong.
>>>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>> strategy is to delay clients until it can get  
>>>>>>>>>>>>>>>>>>>> back to a
>>>>>>>>>>>>>>>>>>>> stable
>>>>>>>>>>>>>>>>>>>> configuration. This usually prevents jobs from  
>>>>>>>>>>>>>>>>>>>> crashing
>>>>>>>>>>>>>>>>>>>> during
>>>>>>>>>>>>>>>>>>>> stressful
>>>>>>>>>>>>>>>>>>>> periods.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I restarted it to do supervisor test. Also because
>>>>>>>>>>>>>>>>>>>>> xrootd
>>>>>>>>>>>>>>>>>>>>> manager
>>>>>>>>>>>>>>>>>>>>> frequently doesn't response. (*) is the cms.log,  
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> file
>>>>>>>>>>>>>>>>>>>>> select
>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>> delayed again and again. When do a restart, all  
>>>>>>>>>>>>>>>>>>>>> things
>>>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>> fine.
>>>>>>>>>>>>>>>>>>>>> Now
>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>> am trying to find a clue about it.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> (*)
>>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>> do_Select:
>>>>>>>>>>>>>>>>>>>>> wc
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/ 
>>>>>>>>>>>>>>>>>>>>> MC8.108004 
>>>>>>>>>>>>>>>>>>>>> .PythiaPhotonJet4.7TeV 
>>>>>>>>>>>>>>>>>>>>> .e444_s479_r635_dmp81_tid001090/LOG/dig. 
>>>>>>>>>>>>>>>>>>>>> 001090._000066.log.2
>>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 Select seeking
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/ 
>>>>>>>>>>>>>>>>>>>>> MC8.108004 
>>>>>>>>>>>>>>>>>>>>> .PythiaPhotonJet4.7TeV 
>>>>>>>>>>>>>>>>>>>>> .e444_s479_r635_dmp81_tid001090/LOG/dig. 
>>>>>>>>>>>>>>>>>>>>> 001090._000066.log.2
>>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 UnkFile rc=1
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> path=/atlas/xrootd/users/fang/ 
>>>>>>>>>>>>>>>>>>>>> MC8.108004 
>>>>>>>>>>>>>>>>>>>>> .PythiaPhotonJet4.7TeV 
>>>>>>>>>>>>>>>>>>>>> .e444_s479_r635_dmp81_tid001090/LOG/dig. 
>>>>>>>>>>>>>>>>>>>>> 001090._000066.log.2
>>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>> do_Select:
>>>>>>>>>>>>>>>>>>>>> delay 5
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/ 
>>>>>>>>>>>>>>>>>>>>> MC8.108004 
>>>>>>>>>>>>>>>>>>>>> .PythiaPhotonJet4.7TeV 
>>>>>>>>>>>>>>>>>>>>> .e444_s479_r635_dmp81_tid001090/LOG/dig. 
>>>>>>>>>>>>>>>>>>>>> 001090._000066.log.2
>>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref  
>>>>>>>>>>>>>>>>>>>>> to 2+-1
>>>>>>>>>>>>>>>>>>>>> post=0
>>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 Dispatch
>>>>>>>>>>>>>>>>>>>>> redirector.21313:14@atlas-bkp2
>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>> select dlen=166
>>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref  
>>>>>>>>>>>>>>>>>>>>> to 1+1
>>>>>>>>>>>>>>>>>>>>> post=0
>>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdSched: running  
>>>>>>>>>>>>>>>>>>>>> redirector inq=0
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> There is no core file. I copied a new copies of  
>>>>>>>>>>>>>>>>>>>>> the logs
>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> link
>>>>>>>>>>>>>>>>>>>>> below.
>>>>>>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew  
>>>>>>>>>>>>>>>>>>>>> Hanushevsky
>>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I see in the server log that it is restarting  
>>>>>>>>>>>>>>>>>>>>>> often.
>>>>>>>>>>>>>>>>>>>>>> Could
>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>> take
>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>> look
>>>>>>>>>>>>>>>>>>>>>> in the c193 to see if you have any core files?  
>>>>>>>>>>>>>>>>>>>>>> Also
>>>>>>>>>>>>>>>>>>>>>> please
>>>>>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>> core files are enabled as Linux defaults the  
>>>>>>>>>>>>>>>>>>>>>> size to 0.
>>>>>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>>>>> step
>>>>>>>>>>>>>>>>>>>>>> here
>>>>>>>>>>>>>>>>>>>>>> is to find out why your servers are restarting.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> the logs can be found here. From the log you  
>>>>>>>>>>>>>>>>>>>>>>> can see
>>>>>>>>>>>>>>>>>>>>>>> atlas-bkp1
>>>>>>>>>>>>>>>>>>>>>>> manager are dropping nodes again and again  
>>>>>>>>>>>>>>>>>>>>>>> which tries
>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew  
>>>>>>>>>>>>>>>>>>>>>>> Hanushevsky
>>>>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hi Wen, Could you start everything up and  
>>>>>>>>>>>>>>>>>>>>>>>> provide me
>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>> pointer
>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> manager log file, supervisor log file, and  
>>>>>>>>>>>>>>>>>>>>>>>> one data
>>>>>>>>>>>>>>>>>>>>>>>> server
>>>>>>>>>>>>>>>>>>>>>>>> logfile
>>>>>>>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>> which cover the same time-frame (from start  
>>>>>>>>>>>>>>>>>>>>>>>> to some
>>>>>>>>>>>>>>>>>>>>>>>> point
>>>>>>>>>>>>>>>>>>>>>>>> where
>>>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>> things are working or not). That way I can  
>>>>>>>>>>>>>>>>>>>>>>>> see what
>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>> happening.
>>>>>>>>>>>>>>>>>>>>>>>> At
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> moment I only see two "bad" things in the  
>>>>>>>>>>>>>>>>>>>>>>>> config
>>>>>>>>>>>>>>>>>>>>>>>> file:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated  
>>>>>>>>>>>>>>>>>>>>>>>> as a
>>>>>>>>>>>>>>>>>>>>>>>> manager
>>>>>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>>> claim,
>>>>>>>>>>>>>>>>>>>>>>>> via
>>>>>>>>>>>>>>>>>>>>>>>> the all.manager directive, that there are  
>>>>>>>>>>>>>>>>>>>>>>>> three (bkp2
>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>> bkp3).
>>>>>>>>>>>>>>>>>>>>>>>> While
>>>>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>>> should work, the log file will be dense with  
>>>>>>>>>>>>>>>>>>>>>>>> error
>>>>>>>>>>>>>>>>>>>>>>>> messages.
>>>>>>>>>>>>>>>>>>>>>>>> Please
>>>>>>>>>>>>>>>>>>>>>>>> correct
>>>>>>>>>>>>>>>>>>>>>>>> this to be consistent and make it easier to  
>>>>>>>>>>>>>>>>>>>>>>>> see real
>>>>>>>>>>>>>>>>>>>>>>>> errors.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> This is not a problem for me. Because this  
>>>>>>>>>>>>>>>>>>>>>>> config is
>>>>>>>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> dataserver. In manager, I updated the if
>>>>>>>>>>>>>>>>>>>>>>> atlas-bkp1.cs.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> atlas-bkp2 or something. This is a history  
>>>>>>>>>>>>>>>>>>>>>>> problem. at
>>>>>>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>> atlas-bkp1 is used. atlas-bkp2 and atlas-bkp3  
>>>>>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>>> added
>>>>>>>>>>>>>>>>>>>>>>> later.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> 2) Please use cms.space not olb.space (for  
>>>>>>>>>>>>>>>>>>>>>>>> historical
>>>>>>>>>>>>>>>>>>>>>>>> reasons
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> latter
>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>> still accepted and over-rides the former, but  
>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>> soon
>>>>>>>>>>>>>>>>>>>>>>>> end),
>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>> please use only one (the config file uses both
>>>>>>>>>>>>>>>>>>>>>>>> directives).
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> yes. I should remove this line. in fact  
>>>>>>>>>>>>>>>>>>>>>>> cms.space is
>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> cfg
>>>>>>>>>>>>>>>>>>>>>>> too.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> The xrootd has an internal mechanism to connect
>>>>>>>>>>>>>>>>>>>>>>>> servers
>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>> supervisors
>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> allow for maximum reliability. You cannot  
>>>>>>>>>>>>>>>>>>>>>>>> change that
>>>>>>>>>>>>>>>>>>>>>>>> algorithm
>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>> no need to do so. You should *never* tell  
>>>>>>>>>>>>>>>>>>>>>>>> anyone to
>>>>>>>>>>>>>>>>>>>>>>>> directly
>>>>>>>>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>> supervisor. If you do, you will likely get
>>>>>>>>>>>>>>>>>>>>>>>> unreachable
>>>>>>>>>>>>>>>>>>>>>>>> nodes.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> As for dropping data servers, it would appear  
>>>>>>>>>>>>>>>>>>>>>>>> to me,
>>>>>>>>>>>>>>>>>>>>>>>> given
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> flurry
>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>> such activity, that something either crashed  
>>>>>>>>>>>>>>>>>>>>>>>> or was
>>>>>>>>>>>>>>>>>>>>>>>> restarted.
>>>>>>>>>>>>>>>>>>>>>>>> That's
>>>>>>>>>>>>>>>>>>>>>>>> why
>>>>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>>> would be good to see the complete log of each  
>>>>>>>>>>>>>>>>>>>>>>>> one of
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> entities.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I read the document. and write a config
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg 
>>>>>>>>>>>>>>>>>>>>>>>>> ).
>>>>>>>>>>>>>>>>>>>>>>>>> I used my conf, I can see manager is dispatch
>>>>>>>>>>>>>>>>>>>>>>>>> message
>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>> supervisor. But I cannot see any dataserver  
>>>>>>>>>>>>>>>>>>>>>>>>> tries to
>>>>>>>>>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> supervisor. At the same time, in the  
>>>>>>>>>>>>>>>>>>>>>>>>> manager's log,
>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>>>>> dataserver are Dropped.
>>>>>>>>>>>>>>>>>>>>>>>>> How does xrootd decide which dataserver will  
>>>>>>>>>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>>>>>>>>>> supervisor?
>>>>>>>>>>>>>>>>>>>>>>>>> should I specify some dataservers to connect  
>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> supervisor?
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> (*) supervisor log
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>>> manager.0:20@atlas-bkp2
>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>> state
>>>>>>>>>>>>>>>>>>>>>>>>> dlen=42
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2
>>>>>>>>>>>>>>>>>>>>>>>>> do_State:
>>>>>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2
>>>>>>>>>>>>>>>>>>>>>>>>> do_StateFWD:
>>>>>>>>>>>>>>>>>>>>>>>>> Path
>>>>>>>>>>>>>>>>>>>>>>>>> find
>>>>>>>>>>>>>>>>>>>>>>>>> failed for state
>>>>>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> (*)manager log
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>> TSpace=5587GB
>>>>>>>>>>>>>>>>>>>>>>>>> NumFS=1
>>>>>>>>>>>>>>>>>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>> adding
>>>>>>>>>>>>>>>>>>>>>>>>> path:
>>>>>>>>>>>>>>>>>>>>>>>>> w
>>>>>>>>>>>>>>>>>>>>>>>>> /atlas
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661
>>>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>> do_Space: 5696231MB free; 0% util
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094  
>>>>>>>>>>>>>>>>>>>>>>>>> logged in.
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted  
>>>>>>>>>>>>>>>>>>>>>>>>> connection
>>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>> [log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: running
>>>>>>>>>>>>>>>>>>>>>>>>> ?:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched  
>>>>>>>>>>>>>>>>>>>>>>>>> protocol
>>>>>>>>>>>>>>>>>>>>>>>>> cmsd
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>>> 79
>>>>>>>>>>>>>>>>>>>>>>>>> attached
>>>>>>>>>>>>>>>>>>>>>>>>> to poller 2; num=22
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add
>>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>> bumps
>>>>>>>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 #63
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1  
>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Drop_Node:
>>>>>>>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094  
>>>>>>>>>>>>>>>>>>>>>>>>> dropped.
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add Shoved
>>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 to  
>>>>>>>>>>>>>>>>>>>>>>>>> cluster;
>>>>>>>>>>>>>>>>>>>>>>>>> id=63.78;
>>>>>>>>>>>>>>>>>>>>>>>>> num=64;
>>>>>>>>>>>>>>>>>>>>>>>>> min=51
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1  
>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>> TSpace=5587GB
>>>>>>>>>>>>>>>>>>>>>>>>> NumFS=1
>>>>>>>>>>>>>>>>>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>> adding
>>>>>>>>>>>>>>>>>>>>>>>>> path:
>>>>>>>>>>>>>>>>>>>>>>>>> w
>>>>>>>>>>>>>>>>>>>>>>>>> /atlas
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661
>>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>> do_Space: 5721854MB free; 0% util
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094  
>>>>>>>>>>>>>>>>>>>>>>>>> logged in.
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to  
>>>>>>>>>>>>>>>>>>>>>>>>> recieve
>>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1  
>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling  
>>>>>>>>>>>>>>>>>>>>>>>>> drop node
>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> 60
>>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 node  
>>>>>>>>>>>>>>>>>>>>>>>>> 63.78
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661
>>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>>> 79 detached from poller 2; num=21
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1  
>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll()  
>>>>>>>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>> c177.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>> FD=16
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0  
>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 node  
>>>>>>>>>>>>>>>>>>>>>>>>> 0.3
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>>> 16 detached from poller 2; num=20
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll()  
>>>>>>>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>> c179.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>> FD=21
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1  
>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>>> server.17065:[log in to unmask]:1094 node  
>>>>>>>>>>>>>>>>>>>>>>>>> 1.4
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>>> server.7978:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>> server.7978:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>>> 21
>>>>>>>>>>>>>>>>>>>>>>>>> detached from poller 1; num=21
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 State: Status changed to
>>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Send status to
>>>>>>>>>>>>>>>>>>>>>>>>> redirector.15656:14@atlas-bkp2
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1  
>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll()  
>>>>>>>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>> c182.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>> FD=19
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0  
>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 node  
>>>>>>>>>>>>>>>>>>>>>>>>> 7.10
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>>> 19 detached from poller 2; num=19
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1  
>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll()  
>>>>>>>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>> c178.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>> FD=15
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0  
>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 node  
>>>>>>>>>>>>>>>>>>>>>>>>> 9.12
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>>> 15 detached from poller 1; num=20
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1  
>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll()  
>>>>>>>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>> c181.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>> FD=17
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0  
>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 node  
>>>>>>>>>>>>>>>>>>>>>>>>> 5.8
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>>> 17 detached from poller 0; num=21
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1  
>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll()  
>>>>>>>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>> c183.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>> FD=22
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0  
>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 node  
>>>>>>>>>>>>>>>>>>>>>>>>> 8.11
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>>> 22 detached from poller 2; num=18
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll()  
>>>>>>>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>> c184.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>> FD=20
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1  
>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>>> server.4131:[log in to unmask]:1094 node  
>>>>>>>>>>>>>>>>>>>>>>>>> 3.6
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>>> 20 detached from poller 0; num=20
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1  
>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll()  
>>>>>>>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>> c185.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>> FD=23
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0  
>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 node  
>>>>>>>>>>>>>>>>>>>>>>>>> 6.9
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>>> server.8524:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>> server.8524:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>>> 23
>>>>>>>>>>>>>>>>>>>>>>>>> detached from poller 0; num=19
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1  
>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll()  
>>>>>>>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>> c180.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>> FD=18
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0  
>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 node  
>>>>>>>>>>>>>>>>>>>>>>>>> 4.7
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>>> 18 detached from poller 1; num=19
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1  
>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll()  
>>>>>>>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>> c186.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>> FD=24
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0  
>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 node  
>>>>>>>>>>>>>>>>>>>>>>>>> 2.5
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>>> server.7849:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>> server.7849:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>>> 24
>>>>>>>>>>>>>>>>>>>>>>>>> detached from poller 1; num=18
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop  
>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop  
>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop  
>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop  
>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop  
>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop  
>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop  
>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop  
>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop  
>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop  
>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling  
>>>>>>>>>>>>>>>>>>>>>>>>> drop node
>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling  
>>>>>>>>>>>>>>>>>>>>>>>>> drop node
>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling  
>>>>>>>>>>>>>>>>>>>>>>>>> drop node
>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling  
>>>>>>>>>>>>>>>>>>>>>>>>> drop node
>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling  
>>>>>>>>>>>>>>>>>>>>>>>>> drop node
>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling  
>>>>>>>>>>>>>>>>>>>>>>>>> drop node
>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling  
>>>>>>>>>>>>>>>>>>>>>>>>> drop node
>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling  
>>>>>>>>>>>>>>>>>>>>>>>>> drop node
>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling  
>>>>>>>>>>>>>>>>>>>>>>>>> drop node
>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling  
>>>>>>>>>>>>>>>>>>>>>>>>> drop node
>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop  
>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop  
>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66  
>>>>>>>>>>>>>>>>>>>>>>>>> cancelled.
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop  
>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68  
>>>>>>>>>>>>>>>>>>>>>>>>> cancelled.
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop  
>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69  
>>>>>>>>>>>>>>>>>>>>>>>>> cancelled.
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67  
>>>>>>>>>>>>>>>>>>>>>>>>> cancelled.
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop  
>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70  
>>>>>>>>>>>>>>>>>>>>>>>>> cancelled.
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop  
>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71  
>>>>>>>>>>>>>>>>>>>>>>>>> cancelled.
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop  
>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72  
>>>>>>>>>>>>>>>>>>>>>>>>> cancelled.
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop  
>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73  
>>>>>>>>>>>>>>>>>>>>>>>>> cancelled.
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop  
>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74  
>>>>>>>>>>>>>>>>>>>>>>>>> cancelled.
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop  
>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75  
>>>>>>>>>>>>>>>>>>>>>>>>> cancelled.
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop  
>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76  
>>>>>>>>>>>>>>>>>>>>>>>>> cancelled.
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68  
>>>>>>>>>>>>>>>>>>>>>>>>> workers
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop  
>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77  
>>>>>>>>>>>>>>>>>>>>>>>>> cancelled.
>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop  
>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew  
>>>>>>>>>>>>>>>>>>>>>>>>> Hanushevsky
>>>>>>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> To go past 64 data servers you will need to  
>>>>>>>>>>>>>>>>>>>>>>>>>> setup
>>>>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>>>> supervisors.
>>>>>>>>>>>>>>>>>>>>>>>>>> This does not logically change the current
>>>>>>>>>>>>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>>>>> have.
>>>>>>>>>>>>>>>>>>>>>>>>>> You
>>>>>>>>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>>>>> need to configure one or more *new* servers  
>>>>>>>>>>>>>>>>>>>>>>>>>> (or at
>>>>>>>>>>>>>>>>>>>>>>>>>> least
>>>>>>>>>>>>>>>>>>>>>>>>>> xrootd
>>>>>>>>>>>>>>>>>>>>>>>>>> processes)
>>>>>>>>>>>>>>>>>>>>>>>>>> whose role is supervisor. We'd like them to  
>>>>>>>>>>>>>>>>>>>>>>>>>> run in
>>>>>>>>>>>>>>>>>>>>>>>>>> separate
>>>>>>>>>>>>>>>>>>>>>>>>>> machines
>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>> reliability purposes, but they could run on  
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> manager
>>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>>>>> long
>>>>>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>>>>> give each one a unique instance name (i.e.,  
>>>>>>>>>>>>>>>>>>>>>>>>>> -n
>>>>>>>>>>>>>>>>>>>>>>>>>> option).
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> The front part of the cmsd reference  
>>>>>>>>>>>>>>>>>>>>>>>>>> explains how
>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Is there any change to configure xrootd  
>>>>>>>>>>>>>>>>>>>>>>>>>>> with more
>>>>>>>>>>>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>>>>>>>>> 65
>>>>>>>>>>>>>>>>>>>>>>>>>>> machines? I used the configure below but it
>>>>>>>>>>>>>>>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>>>>>>>>>>>>>>>> work.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Should
>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>> configure some machines' manager to be
>>>>>>>>>>>>>>>>>>>>>>>>>>> supvervisor?
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>>
>>
>>