Print

Print


Hi,

I just wanted to say that this is not a recommended configuration. A 
meta-manager makes a manager nothing more than a specialized supervisor. 
Internally, it is exactly the same configuration as a manager and a 
supervisor except for the fact that a supervisor-less configuration will 
never be able to globally federate. I am suprised that there were 
problems with a supervisor. BNL runs 500 production nodes with 
supervisors and has no problems, at least none that they have reported. 
I strongly discourage using a meta manager to get beyond 64 servers.

Andy

On Sun, 27 Jun 2010, wen guan wrote:

> Hi Rob,
>
>    Our xrootd pool is more than 64 dataservers. But I choose to use
> meta manager instead of supervisor. Because when using supervisor,
> some data servers seems lost and not easy to control(fix) it. when
> restarting a data server, it will cause some problem too.
>    Below is the redirector cfg. I choose 50 dataservers to connect
> meta manager with port 3121, and the other 25 to connect manager with
> port 4121. At the same time, the manager will connect to meta manager.
>
> cheers
> Wen
>
> (*)
> if  named meta
>      all.role meta manager
>      xrd.port 1094
>
>      #xrootd.manager atlas-bkp2.cs.wisc.edu 4121
>      all.manager meta atlas-bkp2.cs.wisc.edu 3121
>      #all.manager atlas-bkp3.cs.wisc.edu 3121
>      ofs.forward 3way atlas-bkp1.cs.wisc.edu:1095 mv rm rmdir trunc
> else if atlas-bkp1.cs.wisc.edu atlas-bkp2.cs.wisc.edu atlas-bkp3.cs.wisc.edu
>      all.role manager
>      xrd.port 4094
>      #
>      # 3way forward: redirect the client to the CNS, and forward mv
> rm rmdir and
>      #               trunc to the data servers.
>      #
>      #ofs.forward 3way atlas-bkp1.cs.wisc.edu:1095 mv rm rmdir trunc
>      all.manager atlas-bkp2.cs.wisc.edu 4121
>      all.manager meta atlas-bkp2.cs.wisc.edu 3121
> fi
>
> On Sun, Jun 27, 2010 at 4:44 PM, Rob Gardner <[log in to unmask]> wrote:
>> Wen,
>>
>> I was wondering if you finally did succeed in getting >64 data server
>> nodes working using a supervisor, etc.
>>
>> thanks,
>>
>> Rob
>>
>>
>> On Dec 18, 2009, at 8:58 AM, wen guan wrote:
>>
>>> Hi Andy,
>>>
>>>  I am sure I am using the right cmsd code. Today I compiled and
>>> reinstall all cmsd and xrootd. But now it still doesn't work.  I will
>>> create an account for you then you can login to these machines to
>>> check what happend.
>>>
>>>  In fact today when I doing some restart, I saw some machines
>>> registered itself to higgs07. But unfortunately when reinstalling, the
>>> logs have been cleaned.
>>>
>>>  I found the supervisor will come to "suspend" state after a while
>>> it's started, will it cause the supervisor fails to get some
>>> information.
>>>
>>> Wen
>>>
>>>
>>> On Fri, Dec 18, 2009 at 3:05 AM, Andrew Hanushevsky <[log in to unmask]>
>>> wrote:
>>>>
>>>> Hi Wen,
>>>>
>>>> Something is really going wrong with your data servers. For instance,
>>>> c109
>>>> is quite happy from midnight to 7:23am. Then it dropped the connection.
>>>> Then
>>>> reconnected 7:24:03 and was again happy 12:37:20 but here it reported
>>>> that's
>>>> it's xrootd died but then the cmsd promptly killed its connection
>>>> afterward.
>>>> This appears as if someone restarted the xrootd followed by the cmsd on
>>>> c109. This continued like this until 12:43:00 (i.e., connect, suspend,
>>>> die,
>>>> repeat). All your servers, in fact, started doing this at 12:36:41 to
>>>> 12:42:51 causing a massive swap of servers. New servers were added and
>>>> old
>>>> ones reconnecting were redirected to the supervisor. However, it would
>>>> appear that those machines could not connect there as they kept comming
>>>> back
>>>> to the atlas-bkp1. I can't tell you anything about what was happening on
>>>> higgs07. As far as I can tell it was happily connected to the redirector
>>>> cmsd. The reason is that y=there is no log for higgs07 on the web site
>>>> for
>>>> 12/17 starting at midnight. Perhaps you can put one there.
>>>>
>>>> So,
>>>>
>>>> 1) Are you *absolutely* sure that *all* your (data, etc) servers are
>>>> running
>>>> the corrected cmsd?
>>>> 2) Please provide the higgs07 log for 12/17.
>>>>
>>>> 3) Please provide logs for a sampling of data servers say c0109, c094,
>>>> higgs15, and higgs13 between 1/17 12:00:00 to 15:44.
>>>>
>>>> I have never seen a situation like yours so something is very wrong here.
>>>> In
>>>> the mean time I will add more debugging information to the redirector and
>>>> supervisor and let you know when that is available.
>>>>
>>>> Andy
>>>>
>>>>
>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
>>>> To: "Fabrizio Furano" <[log in to unmask]>
>>>> Cc: "Andrew Hanushevsky" <[log in to unmask]>; <[log in to unmask]>
>>>> Sent: Thursday, December 17, 2009 3:12 PM
>>>> Subject: Re: xrootd with more than 65 machines
>>>>
>>>>
>>>> Hi Fabrizio,
>>>>
>>>>  This is the xrdcp debug message.
>>>>           ClientHeader.header.dlen = 41
>>>> =================== END CLIENT HEADER DUMPING ===================
>>>>
>>>> 091217 16:47:54 15961 Xrd: WriteRaw: Writing 24 bytes to physical
>>>> connection
>>>> 091217 16:47:54 15961 Xrd: WriteRaw: Writing to substreamid 0
>>>> 091217 16:47:54 15961 Xrd: WriteRaw: Writing 41 bytes to physical
>>>> connection
>>>> 091217 16:47:54 15961 Xrd: WriteRaw: Writing to substreamid 0
>>>> 091217 16:47:54 15961 Xrd: ReadPartialAnswer: Reading a
>>>> XrdClientMessage from the server [atlas-bkp1.cs.wisc.edu:1094]...
>>>> 091217 16:47:54 15961 Xrd: XrdClientMessage::ReadRaw:  sid: 1, IsAttn:
>>>> 0, substreamid: 0
>>>> 091217 16:47:54 15961 Xrd: XrdClientMessage::ReadRaw: Reading data (4
>>>> bytes) from substream 0
>>>> 091217 16:47:54 15961 Xrd: ReadRaw: Reading from
>>>> atlas-bkp1.cs.wisc.edu:1094
>>>> 091217 16:47:54 15961 Xrd: BuildMessage:  posting id 1
>>>> 091217 16:47:54 15961 Xrd: XrdClientMessage::ReadRaw: Reading header (8
>>>> bytes).
>>>> 091217 16:47:54 15961 Xrd: ReadRaw: Reading from
>>>> atlas-bkp1.cs.wisc.edu:1094
>>>>
>>>>
>>>> ======== DUMPING SERVER RESPONSE HEADER ========
>>>>    ServerHeader.streamid = 0x01 0x00
>>>>      ServerHeader.status = kXR_wait (4005)
>>>>        ServerHeader.dlen = 4
>>>> ========== END DUMPING SERVER HEADER ===========
>>>>
>>>> 091217 16:47:54 15961 Xrd: ReadPartialAnswer: Server
>>>> [atlas-bkp1.cs.wisc.edu:1094] answered [kXR_wait] (4005)
>>>> 091217 16:47:54 15961 Xrd: CheckErrorStatus: Server
>>>> [atlas-bkp1.cs.wisc.edu:1094] requested 10 seconds of wait
>>>> 091217 16:48:04 15961 Xrd: DumpPhyConn: Phyconn entry,
>>>> [log in to unmask]:1094', LogCnt=1 Valid
>>>> 091217 16:48:04 15961 Xrd: SendGenCommand: Sending command Open
>>>>
>>>>
>>>> ================= DUMPING CLIENT REQUEST HEADER =================
>>>>              ClientHeader.streamid = 0x01 0x00
>>>>             ClientHeader.requestid = kXR_open (3010)
>>>>             ClientHeader.open.mode = 0x00 0x00
>>>>          ClientHeader.open.options = 0x40 0x04
>>>>         ClientHeader.open.reserved = 0 repeated 12 times
>>>>           ClientHeader.header.dlen = 41
>>>> =================== END CLIENT HEADER DUMPING ===================
>>>>
>>>> 091217 16:48:04 15961 Xrd: WriteRaw: Writing 24 bytes to physical
>>>> connection
>>>> 091217 16:48:04 15961 Xrd: WriteRaw: Writing to substreamid 0
>>>> 091217 16:48:04 15961 Xrd: WriteRaw: Writing 41 bytes to physical
>>>> connection
>>>> 091217 16:48:04 15961 Xrd: WriteRaw: Writing to substreamid 0
>>>> 091217 16:48:04 15961 Xrd: ReadPartialAnswer: Reading a
>>>> XrdClientMessage from the server [atlas-bkp1.cs.wisc.edu:1094]...
>>>> 091217 16:48:04 15961 Xrd: XrdClientMessage::ReadRaw:  sid: 1, IsAttn:
>>>> 0, substreamid: 0
>>>> 091217 16:48:04 15961 Xrd: XrdClientMessage::ReadRaw: Reading data (4
>>>> bytes) from substream 0
>>>> 091217 16:48:04 15961 Xrd: ReadRaw: Reading from
>>>> atlas-bkp1.cs.wisc.edu:1094
>>>> 091217 16:48:04 15961 Xrd: BuildMessage:  posting id 1
>>>> 091217 16:48:04 15961 Xrd: XrdClientMessage::ReadRaw: Reading header (8
>>>> bytes).
>>>> 091217 16:48:04 15961 Xrd: ReadRaw: Reading from
>>>> atlas-bkp1.cs.wisc.edu:1094
>>>>
>>>>
>>>> ======== DUMPING SERVER RESPONSE HEADER ========
>>>>    ServerHeader.streamid = 0x01 0x00
>>>>      ServerHeader.status = kXR_wait (4005)
>>>>        ServerHeader.dlen = 4
>>>> ========== END DUMPING SERVER HEADER ===========
>>>>
>>>> 091217 16:48:04 15961 Xrd: ReadPartialAnswer: Server
>>>> [atlas-bkp1.cs.wisc.edu:1094] answered [kXR_wait] (4005)
>>>> 091217 16:48:04 15961 Xrd: CheckErrorStatus: Server
>>>> [atlas-bkp1.cs.wisc.edu:1094] requested 10 seconds of wait
>>>> 091217 16:48:14 15961 Xrd: SendGenCommand: Sending command Open
>>>>
>>>>
>>>> ================= DUMPING CLIENT REQUEST HEADER =================
>>>>              ClientHeader.streamid = 0x01 0x00
>>>>             ClientHeader.requestid = kXR_open (3010)
>>>>             ClientHeader.open.mode = 0x00 0x00
>>>>          ClientHeader.open.options = 0x40 0x04
>>>>         ClientHeader.open.reserved = 0 repeated 12 times
>>>>           ClientHeader.header.dlen = 41
>>>> =================== END CLIENT HEADER DUMPING ===================
>>>>
>>>> 091217 16:48:14 15961 Xrd: WriteRaw: Writing 24 bytes to physical
>>>> connection
>>>> 091217 16:48:14 15961 Xrd: WriteRaw: Writing to substreamid 0
>>>> 091217 16:48:14 15961 Xrd: WriteRaw: Writing 41 bytes to physical
>>>> connection
>>>> 091217 16:48:14 15961 Xrd: WriteRaw: Writing to substreamid 0
>>>> 091217 16:48:14 15961 Xrd: ReadPartialAnswer: Reading a
>>>> XrdClientMessage from the server [atlas-bkp1.cs.wisc.edu:1094]...
>>>> 091217 16:48:14 15961 Xrd: XrdClientMessage::ReadRaw:  sid: 1, IsAttn:
>>>> 0, substreamid: 0
>>>> 091217 16:48:14 15961 Xrd: XrdClientMessage::ReadRaw: Reading data (4
>>>> bytes) from substream 0
>>>> 091217 16:48:14 15961 Xrd: ReadRaw: Reading from
>>>> atlas-bkp1.cs.wisc.edu:1094
>>>> 091217 16:48:14 15961 Xrd: BuildMessage:  posting id 1
>>>> 091217 16:48:14 15961 Xrd: XrdClientMessage::ReadRaw: Reading header (8
>>>> bytes).
>>>> 091217 16:48:14 15961 Xrd: ReadRaw: Reading from
>>>> atlas-bkp1.cs.wisc.edu:1094
>>>>
>>>>
>>>> ======== DUMPING SERVER RESPONSE HEADER ========
>>>>    ServerHeader.streamid = 0x01 0x00
>>>>      ServerHeader.status = kXR_wait (4005)
>>>>        ServerHeader.dlen = 4
>>>> ========== END DUMPING SERVER HEADER ===========
>>>>
>>>> 091217 16:48:14 15961 Xrd: ReadPartialAnswer: Server
>>>> [atlas-bkp1.cs.wisc.edu:1094] answered [kXR_wait] (4005)
>>>> 091217 16:48:14 15961 Xrd: SendGenCommand: Max time limit elapsed for
>>>> request  kXR_open. Aborting command.
>>>> Last server error 10000 ('')
>>>> Error accessing path/file for
>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>>
>>>>
>>>> Wen
>>>>
>>>> On Thu, Dec 17, 2009 at 11:27 PM, Fabrizio Furano <[log in to unmask]> wrote:
>>>>>
>>>>> Hi Wen,
>>>>>
>>>>> I see that you are getting error 10000, which means "generic error
>>>>> before
>>>>> any interaction". Could you please run the same command with debug level
>>>>> 3
>>>>> and post the log with the same kind of issue? Something like
>>>>>
>>>>> xrdcp -d 3 ....
>>>>>
>>>>> Most likely this time the problem is different. I may be wrong here, but
>>>>> a
>>>>> possible reason for that error is that the servers require
>>>>> authentication
>>>>> and xrdcp does not find some library in the LD_LIBRARY_PATH.
>>>>>
>>>>> Fabrizio
>>>>>
>>>>>
>>>>> wen guan ha scritto:
>>>>>>
>>>>>> Hi Andy,
>>>>>>
>>>>>> I put new logs in web.
>>>>>>
>>>>>> It still doesn't work. I cannot copy files in and out.
>>>>>>
>>>>>> It seems xrootd daemon at atlas-bkp1 hasn't talked with cmsd.
>>>>>> Normally if xrootd daemont tries to copy a file, in the cms.log I
>>>>>> should see "do_Select: filename". But in this cms.log, there is
>>>>>> nothing from atlas-bkp1.
>>>>>>
>>>>>> (*)
>>>>>> [root@atlas-bkp1 ~]# xrdcp
>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>>>> /tmp/
>>>>>> Last server error 10000 ('')
>>>>>> Error accessing path/file for
>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>>>> [root@atlas-bkp1 ~]# xrdcp /bin/mv
>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123
>>>>>> 133
>>>>>>
>>>>>>
>>>>>> Wen
>>>>>>
>>>>>> On Thu, Dec 17, 2009 at 10:54 PM, Andrew Hanushevsky <[log in to unmask]>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Wen,
>>>>>>>
>>>>>>> I reviewed the log file. Other than the odd redirect of c131 at
>>>>>>> 17:47:25
>>>>>>> which I can't comment on because its logs on the web site do not
>>>>>>> overlap
>>>>>>> with the manager or supervisor. Unless all the logs include the full
>>>>>>> time
>>>>>>> in
>>>>>>> question I can't say much of anything. Can you provide me with
>>>>>>> inclusive
>>>>>>> logs?
>>>>>>>
>>>>>>> atlas-bkp1 cms: 17:20:57 to 17:42:19 xrd: 17:20:57 to 17:40:57
>>>>>>> higgs07 cms & xrd 17:22:33 to 17:42:33
>>>>>>> c131 cms & xrd 17:31:57 to 17:47:28
>>>>>>>
>>>>>>> That said, it certainly looks like things were working and files were
>>>>>>> being
>>>>>>> accessed and discovered on all the machines. You even werw able to
>>>>>>> open
>>>>>>> /atlas/xrootd/users/wguan/test/test98123313
>>>>>>> through not
>>>>>>> /atlas/xrootd/users/wguan/test/test123131The other issue is that you
>>>>>>> did
>>>>>>> not
>>>>>>> specify a stable adminpath and the adminpath defaults to /tmp. If you
>>>>>>> have a
>>>>>>> "cleanup" script that runs periodically for /tmp then eventually your
>>>>>>> cluster will go catonic as important (but not often used) files are
>>>>>>> deleted
>>>>>>> by that script. Could you please find a stable home for the adminpath?
>>>>>>>
>>>>>>> I reran my tests here and things worked as expected. I will ramp up
>>>>>>> some
>>>>>>> more tests. So, what is your status today?
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>> Cc: <[log in to unmask]>
>>>>>>> Sent: Thursday, December 17, 2009 5:05 AM
>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>
>>>>>>>
>>>>>>> Hi Andy,
>>>>>>>
>>>>>>> Yes. I am using the file download from
>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/ which compiled yesterday. I
>>>>>>> just now compiled it again and compare it with one I compiled
>>>>>>> yesterday. they are the same(same md5sum).
>>>>>>>
>>>>>>> Wen
>>>>>>>
>>>>>>> On Thu, Dec 17, 2009 at 2:09 AM, Andrew Hanushevsky <[log in to unmask]>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Wen,
>>>>>>>>
>>>>>>>> If c131 cannot connect then either c131 does not have the new cms or
>>>>>>>> atlas-bkp1 does not have the new cms as that would be what would
>>>>>>>> happen
>>>>>>>> if
>>>>>>>> either were true. Looking at the log on c131 it would appear that
>>>>>>>> atlas-bkp1
>>>>>>>> is still using the old cmsd as the response data length is wrong.
>>>>>>>> Could
>>>>>>>> you
>>>>>>>> verify please.
>>>>>>>>
>>>>>>>> Andy
>>>>>>>>
>>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>>> <[log in to unmask]>
>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>> Cc: <[log in to unmask]>
>>>>>>>> Sent: Wednesday, December 16, 2009 3:58 PM
>>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Andy,
>>>>>>>>
>>>>>>>> I tried it. But there are still some problem. I put the logs in
>>>>>>>> higgs03.cs.wisc.edu/wguan/
>>>>>>>>
>>>>>>>> In my test, c131 is the 65 nodes to be added the the manager.
>>>>>>>> and I can copy the file to the pool through manager. But I cannot
>>>>>>>> copy a file out which is in c131.
>>>>>>>>
>>>>>>>> In c131's cms.log, I see "Manager:
>>>>>>>> manager.0:[log in to unmask] removed; redirected" again and
>>>>>>>> again. and I cannot see any thing about c131 in higgs07's
>>>>>>>> log(supervisor). Does it mean manager tries to redirect it to
>>>>>>>> higgs07,
>>>>>>>> but c131 hasn't try to connect higgs07. It only tries to connect
>>>>>>>> manager again.
>>>>>>>>
>>>>>>>> (*)
>>>>>>>> [root@c131 ~]# xrdcp /bin/mv
>>>>>>>> root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331
>>>>>>>> Last server error 10000 ('')
>>>>>>>> Error accessing path/file for
>>>>>>>> root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331
>>>>>>>> [root@c131 ~]# xrdcp /bin/mv
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123311
>>>>>>>> [xrootd] Total 0.06 MB |====================| 100.00 % [3.1 MB/s]
>>>>>>>> [root@c131 ~]# xrdcp /bin/mv
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123312
>>>>>>>> [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s]
>>>>>>>> [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/
>>>>>>>> test123131
>>>>>>>> [root@c131 ~]# xrdcp
>>>>>>>>
>>>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>>>>>> /tmp/
>>>>>>>> Last server error 3011 ('No servers are available to read the file.')
>>>>>>>> Error accessing path/file for
>>>>>>>>
>>>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>>>>>> [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/test123131
>>>>>>>> /atlas/xrootd/users/wguan/test/test123131
>>>>>>>> [root@c131 ~]# xrdcp
>>>>>>>>
>>>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>>>>>> /tmp/
>>>>>>>> Last server error 3011 ('No servers are available to read the file.')
>>>>>>>> Error accessing path/file for
>>>>>>>>
>>>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>>>>>> [root@c131 ~]# xrdcp /bin/mv
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123313
>>>>>>>> [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s]
>>>>>>>> [root@c131 ~]# xrdcp
>>>>>>>>
>>>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>>>>>> /tmp/
>>>>>>>> Last server error 3011 ('No servers are available to read the file.')
>>>>>>>> Error accessing path/file for
>>>>>>>>
>>>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>>>>>> [root@c131 ~]# xrdcp
>>>>>>>>
>>>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>>>>>> /tmp/
>>>>>>>> Last server error 3011 ('No servers are available to read the file.')
>>>>>>>> Error accessing path/file for
>>>>>>>>
>>>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>>>>>> [root@c131 ~]# xrdcp
>>>>>>>>
>>>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>>>>>> /tmp/
>>>>>>>> Last server error 3011 ('No servers are available to read the file.')
>>>>>>>> Error accessing path/file for
>>>>>>>>
>>>>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>>>>>> [root@c131 ~]# tail -f /var/log/xrootd/cms.log
>>>>>>>> 091216 17:45:52 3103 manager.0:[log in to unmask] XrdLink:
>>>>>>>> Setting ref to 2+-1 post=0
>>>>>>>> 091216 17:45:55 3103 Pander trying to connect to lvl 0
>>>>>>>> atlas-bkp1.cs.wisc.edu:3121
>>>>>>>> 091216 17:45:55 3103 XrdInet: Connected to
>>>>>>>> atlas-bkp1.cs.wisc.edu:3121
>>>>>>>> 091216 17:45:55 3103 Add atlas-bkp1.cs.wisc.edu to manager config;
>>>>>>>> id=0
>>>>>>>> 091216 17:45:55 3103 ManTree: Now connected to 3 root node(s)
>>>>>>>> 091216 17:45:55 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu
>>>>>>>> 091216 17:45:55 3103 Dispatch manager.0:[log in to unmask] for
>>>>>>>> try
>>>>>>>> dlen=3
>>>>>>>> 091216 17:45:55 3103 manager.0:[log in to unmask] do_Try:
>>>>>>>> 091216 17:45:55 3103 Remove completed atlas-bkp1.cs.wisc.edu manager
>>>>>>>> 0.95
>>>>>>>> 091216 17:45:55 3103 Manager: manager.0:[log in to unmask]
>>>>>>>> removed; redirected
>>>>>>>> 091216 17:46:04 3103 Pander trying to connect to lvl 0
>>>>>>>> atlas-bkp1.cs.wisc.edu:3121
>>>>>>>> 091216 17:46:04 3103 XrdInet: Connected to
>>>>>>>> atlas-bkp1.cs.wisc.edu:3121
>>>>>>>> 091216 17:46:04 3103 Add atlas-bkp1.cs.wisc.edu to manager config;
>>>>>>>> id=0
>>>>>>>> 091216 17:46:04 3103 ManTree: Now connected to 3 root node(s)
>>>>>>>> 091216 17:46:04 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu
>>>>>>>> 091216 17:46:04 3103 Dispatch manager.0:[log in to unmask] for
>>>>>>>> try
>>>>>>>> dlen=3
>>>>>>>> 091216 17:46:04 3103 Protocol: No buffers to serve
>>>>>>>> atlas-bkp1.cs.wisc.edu
>>>>>>>> 091216 17:46:04 3103 Remove completed atlas-bkp1.cs.wisc.edu manager
>>>>>>>> 0.96
>>>>>>>> 091216 17:46:04 3103 Manager: manager.0:[log in to unmask]
>>>>>>>> removed; insufficient buffers
>>>>>>>> 091216 17:46:11 3103 Dispatch manager.0:[log in to unmask] for
>>>>>>>> state dlen=169
>>>>>>>> 091216 17:46:11 3103 manager.0:[log in to unmask] XrdLink:
>>>>>>>> Setting ref to 1+1 post=0
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Wen
>>>>>>>>
>>>>>>>> On Thu, Dec 17, 2009 at 12:10 AM, wen guan <[log in to unmask]>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Andy,
>>>>>>>>>
>>>>>>>>>> OK, I understand. As for stalling, too many nodes were deemed to be
>>>>>>>>>> in
>>>>>>>>>> trouble for the manager to allow service resumption.
>>>>>>>>>>
>>>>>>>>>> Please make sure that all of the nodes in the cluster receive the
>>>>>>>>>> new
>>>>>>>>>> cmsd
>>>>>>>>>> as they will drop off with the old one and you'll see the same kind
>>>>>>>>>> of
>>>>>>>>>> activity. Perhaps the best way to know that you suceeded in putting
>>>>>>>>>> everything in sync is to start with 63 data nodes plus one
>>>>>>>>>> supervisor.
>>>>>>>>>> Once
>>>>>>>>>> all connections are established; adding an additional server should
>>>>>>>>>> simply
>>>>>>>>>> send it to the supervisor.
>>>>>>>>>
>>>>>>>>> I will do it.
>>>>>>>>> you said start 63 data server and one supervisor. Does it mean the
>>>>>>>>> supervisor is managed using the same policy? If I there are 64
>>>>>>>>> dataservers which are connected before the supervisor, will the
>>>>>>>>> supervisor be dropped? Is the supervisor has high priority to be
>>>>>>>>> added to the manager? I mean, if there are already 64 dataservers
>>>>>>>>> and
>>>>>>>>> a supervisor comes in, will the supervisor be accepted and a
>>>>>>>>> datasever
>>>>>>>>> be redirected to the supervisor?
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Wen
>>>>>>>>>
>>>>>>>>>> Hi Andrew,
>>>>>>>>>>
>>>>>>>>>> But when I tried to xrdcp a file to it, it doesn't response. In
>>>>>>>>>> atlas-bkp1-xrd.log.20091213, it always prints "stalling client for
>>>>>>>>>> 10
>>>>>>>>>> sec". But in cms.log, I can find any message about the file.
>>>>>>>>>>
>>>>>>>>>>> I don't see why you say it doesn't work. With the debugging level
>>>>>>>>>>> set
>>>>>>>>>>> so
>>>>>>>>>>> high the noise may make it look like something is going wrong but
>>>>>>>>>>> that
>>>>>>>>>>> isn't
>>>>>>>>>>> necessarily the case.
>>>>>>>>>>>
>>>>>>>>>>> 1) The 'too many subscribers' is correct. The manager was simply
>>>>>>>>>>> redirecting
>>>>>>>>>>> them because there were already 64 servers. However, in your case
>>>>>>>>>>> the
>>>>>>>>>>> supervisor wasn't started until almost 30 minutes after everyone
>>>>>>>>>>> else
>>>>>>>>>>> (i.e.,
>>>>>>>>>>> 10:42 AM). Why was that? I'm not suprised about the flurry of
>>>>>>>>>>> messages
>>>>>>>>>>> with
>>>>>>>>>>> a critical component missing for 30 minutes.
>>>>>>>>>>
>>>>>>>>>> Because the manager is 64bit machine but supervisor is 32 bit
>>>>>>>>>> machine.
>>>>>>>>>> Then I have to recompile the it. At that time, I was interrupted by
>>>>>>>>>> something else.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> 2) Once the supervisor started, it started accepting the
>>>>>>>>>>> redirected
>>>>>>>>>>> servers.
>>>>>>>>>>>
>>>>>>>>>>> 3) Then 10 seconds (10:42:10) later the supervisor was restarted.
>>>>>>>>>>> So,
>>>>>>>>>>> that
>>>>>>>>>>> would cause a flurry of activity to occur as there is no backup
>>>>>>>>>>> supervisor
>>>>>>>>>>> to take over.
>>>>>>>>>>>
>>>>>>>>>>> 4) This happened again at 10:42:34 AM then again at 10:48:49. Is
>>>>>>>>>>> the
>>>>>>>>>>> supervisor crashing? Is there a core file?
>>>>>>>>>>>
>>>>>>>>>>> 5) At 11:11 AM the manager restarted. Again, is there a core file
>>>>>>>>>>> here
>>>>>>>>>>> or
>>>>>>>>>>> was this a manual action?
>>>>>>>>>>>
>>>>>>>>>>> During the course of all of this. All nodes connected were
>>>>>>>>>>> operating
>>>>>>>>>>> propely
>>>>>>>>>>> and files were being located.
>>>>>>>>>>>
>>>>>>>>>>> So, the two big questions are:
>>>>>>>>>>>
>>>>>>>>>>> a) Why was the supervisor not started until 30 minutes after the
>>>>>>>>>>> system
>>>>>>>>>>> was
>>>>>>>>>>> started?
>>>>>>>>>>>
>>>>>>>>>>> b) Is there an explanation of the restarts? If this was a crash
>>>>>>>>>>> then
>>>>>>>>>>> we
>>>>>>>>>>> need
>>>>>>>>>>> a core file to figure out what happened.
>>>>>>>>>>
>>>>>>>>>> It's not a crash. There are some reasons that I restarted some
>>>>>>>>>> daemons.
>>>>>>>>>> (1)I thought if a dataserver tried many times to connect to a
>>>>>>>>>> redirector but failed, the dataserver would not try to connect a
>>>>>>>>>> redirector again. The supervisor was missing for long time. So
>>>>>>>>>> maybe
>>>>>>>>>> some dataservers would not try to connect to atlas-bkp1 again. To
>>>>>>>>>> reactive these dataservers, I restarted any servers.
>>>>>>>>>> (2)When I tried to xrdcp, it was hanging for long time. I thought
>>>>>>>>>> maybe manager was affected by some others things. then I restarte
>>>>>>>>>> manager to see whether a restart can make this xrdcp work.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Wen
>>>>>>>>>>
>>>>>>>>>>> Andy
>>>>>>>>>>>
>>>>>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>>>>>> <[log in to unmask]>
>>>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>>>>> Cc: <[log in to unmask]>
>>>>>>>>>>> Sent: Wednesday, December 16, 2009 9:38 AM
>>>>>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>
>>>>>>>>>>> It still doesn't work.
>>>>>>>>>>> The log file is in higgs03.cs.wisc.edu/wguan/. The name is
>>>>>>>>>>> *.20091216
>>>>>>>>>>> The manager complains there are too many subscribers and the
>>>>>>>>>>> removes
>>>>>>>>>>> nodes.
>>>>>>>>>>>
>>>>>>>>>>> (*)
>>>>>>>>>>> Add server.10040:[log in to unmask] redirected; too many
>>>>>>>>>>> subscribers.
>>>>>>>>>>>
>>>>>>>>>>> Wen
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Dec 16, 2009 at 4:25 AM, Andrew Hanushevsky
>>>>>>>>>>> <[log in to unmask]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>
>>>>>>>>>>>> It will be easier for me to retroft as the changes were pretty
>>>>>>>>>>>> minor.
>>>>>>>>>>>> Please
>>>>>>>>>>>> lift the new XrdCmsNode.cc file from
>>>>>>>>>>>>
>>>>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd
>>>>>>>>>>>>
>>>>>>>>>>>> Andy
>>>>>>>>>>>>
>>>>>>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>>>>>>> <[log in to unmask]>
>>>>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>>>>>> Cc: <[log in to unmask]>
>>>>>>>>>>>> Sent: Tuesday, December 15, 2009 5:12 PM
>>>>>>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Andy,
>>>>>>>>>>>>
>>>>>>>>>>>> I can switch to 20091104-1102. Then you don't need to patch
>>>>>>>>>>>> another version. How can I download v20091104-1102?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Wen
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Dec 16, 2009 at 12:52 AM, Andrew Hanushevsky
>>>>>>>>>>>> <[log in to unmask]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ah yes, I see that now. The file I gave you is based on
>>>>>>>>>>>>> v20091104-1102.
>>>>>>>>>>>>> Let
>>>>>>>>>>>>> me see if I can retrofit the patch for you.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>
>>>>>>>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>>>>>>>> <[log in to unmask]>
>>>>>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>>>>>>> Cc: <[log in to unmask]>
>>>>>>>>>>>>> Sent: Tuesday, December 15, 2009 1:04 PM
>>>>>>>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Andy,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Which xrootd version are you using? XrdCmsConfig.hh is
>>>>>>>>>>>>> different.
>>>>>>>>>>>>> XrdCmsConfig.hh is downloaded from
>>>>>>>>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsNode.cc
>>>>>>>>>>>>> 6fb3ae40fe4e10bdd4d372818a341f2c src/XrdCms/XrdCmsNode.cc
>>>>>>>>>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsConfig.hh
>>>>>>>>>>>>> 7d57753847d9448186c718f98e963cbe src/XrdCms/XrdCmsConfig.hh
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Dec 15, 2009 at 10:50 PM, Andrew Hanushevsky
>>>>>>>>>>>>> <[log in to unmask]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just compiled on Linux and it was clean. Something is really
>>>>>>>>>>>>>> wrong
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>> your
>>>>>>>>>>>>>> source files, specifically XrdCmsConfig.cc
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The MD5 checksums on the relevant files are:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> MD5 (XrdCmsNode.cc) = 6fb3ae40fe4e10bdd4d372818a341f2c
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> MD5 (XrdCmsConfig.hh) = 4a7d655582a7cd43b098947d0676924b
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>>>>>>>>> <[log in to unmask]>
>>>>>>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>>>>>>>> Cc: <[log in to unmask]>
>>>>>>>>>>>>>> Sent: Tuesday, December 15, 2009 4:24 AM
>>>>>>>>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Andy,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> No problem. Thanks for the fix. But it cannot be compiled. The
>>>>>>>>>>>>>> version I am using is
>>>>>>>>>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Making cms component...
>>>>>>>>>>>>>> Compiling XrdCmsNode.cc
>>>>>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>>>>>> XrdCmsNode::do_Chmod(XrdCmsRRData&)':
>>>>>>>>>>>>>> XrdCmsNode.cc:268: error: `fsExec' was not declared in this
>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>> XrdCmsNode.cc:268: warning: unused variable 'fsExec'
>>>>>>>>>>>>>> XrdCmsNode.cc:269: error: 'class XrdCmsConfig' has no member
>>>>>>>>>>>>>> named
>>>>>>>>>>>>>> 'ossFS'
>>>>>>>>>>>>>> XrdCmsNode.cc:273: error: `fsFail' was not declared in this
>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>> XrdCmsNode.cc:273: warning: unused variable 'fsFail'
>>>>>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>>>>>> XrdCmsNode::do_Mkdir(XrdCmsRRData&)':
>>>>>>>>>>>>>> XrdCmsNode.cc:600: error: `fsExec' was not declared in this
>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>> XrdCmsNode.cc:600: warning: unused variable 'fsExec'
>>>>>>>>>>>>>> XrdCmsNode.cc:601: error: 'class XrdCmsConfig' has no member
>>>>>>>>>>>>>> named
>>>>>>>>>>>>>> 'ossFS'
>>>>>>>>>>>>>> XrdCmsNode.cc:605: error: `fsFail' was not declared in this
>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>> XrdCmsNode.cc:605: warning: unused variable 'fsFail'
>>>>>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>>>>>> XrdCmsNode::do_Mkpath(XrdCmsRRData&)':
>>>>>>>>>>>>>> XrdCmsNode.cc:640: error: `fsExec' was not declared in this
>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>> XrdCmsNode.cc:640: warning: unused variable 'fsExec'
>>>>>>>>>>>>>> XrdCmsNode.cc:641: error: 'class XrdCmsConfig' has no member
>>>>>>>>>>>>>> named
>>>>>>>>>>>>>> 'ossFS'
>>>>>>>>>>>>>> XrdCmsNode.cc:645: error: `fsFail' was not declared in this
>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>> XrdCmsNode.cc:645: warning: unused variable 'fsFail'
>>>>>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>>>>>> XrdCmsNode::do_Mv(XrdCmsRRData&)':
>>>>>>>>>>>>>> XrdCmsNode.cc:704: error: `fsExec' was not declared in this
>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>> XrdCmsNode.cc:704: warning: unused variable 'fsExec'
>>>>>>>>>>>>>> XrdCmsNode.cc:705: error: 'class XrdCmsConfig' has no member
>>>>>>>>>>>>>> named
>>>>>>>>>>>>>> 'ossFS'
>>>>>>>>>>>>>> XrdCmsNode.cc:709: error: `fsFail' was not declared in this
>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>> XrdCmsNode.cc:709: warning: unused variable 'fsFail'
>>>>>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>>>>>> XrdCmsNode::do_Rm(XrdCmsRRData&)':
>>>>>>>>>>>>>> XrdCmsNode.cc:831: error: `fsExec' was not declared in this
>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>> XrdCmsNode.cc:831: warning: unused variable 'fsExec'
>>>>>>>>>>>>>> XrdCmsNode.cc:832: error: 'class XrdCmsConfig' has no member
>>>>>>>>>>>>>> named
>>>>>>>>>>>>>> 'ossFS'
>>>>>>>>>>>>>> XrdCmsNode.cc:836: error: `fsFail' was not declared in this
>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>> XrdCmsNode.cc:836: warning: unused variable 'fsFail'
>>>>>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>>>>>> XrdCmsNode::do_Rmdir(XrdCmsRRData&)':
>>>>>>>>>>>>>> XrdCmsNode.cc:873: error: `fsExec' was not declared in this
>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>> XrdCmsNode.cc:873: warning: unused variable 'fsExec'
>>>>>>>>>>>>>> XrdCmsNode.cc:874: error: 'class XrdCmsConfig' has no member
>>>>>>>>>>>>>> named
>>>>>>>>>>>>>> 'ossFS'
>>>>>>>>>>>>>> XrdCmsNode.cc:878: error: `fsFail' was not declared in this
>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>> XrdCmsNode.cc:878: warning: unused variable 'fsFail'
>>>>>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>>>>>> XrdCmsNode::do_Trunc(XrdCmsRRData&)':
>>>>>>>>>>>>>> XrdCmsNode.cc:1377: error: `fsExec' was not declared in this
>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>> XrdCmsNode.cc:1377: warning: unused variable 'fsExec'
>>>>>>>>>>>>>> XrdCmsNode.cc:1378: error: 'class XrdCmsConfig' has no member
>>>>>>>>>>>>>> named
>>>>>>>>>>>>>> 'ossFS'
>>>>>>>>>>>>>> XrdCmsNode.cc:1382: error: `fsFail' was not declared in this
>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>> XrdCmsNode.cc:1382: warning: unused variable 'fsFail'
>>>>>>>>>>>>>> XrdCmsNode.cc: At global scope:
>>>>>>>>>>>>>> XrdCmsNode.cc:1524: error: no `int
>>>>>>>>>>>>>> XrdCmsNode::fsExec(XrdOucProg*,
>>>>>>>>>>>>>> char*, char*)' member function declared in class `XrdCmsNode'
>>>>>>>>>>>>>> XrdCmsNode.cc: In member function `int
>>>>>>>>>>>>>> XrdCmsNode::fsExec(XrdOucProg*,
>>>>>>>>>>>>>> char*, char*)':
>>>>>>>>>>>>>> XrdCmsNode.cc:1533: error: `fsL2PFail1' was not declared in
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>> XrdCmsNode.cc:1533: warning: unused variable 'fsL2PFail1'
>>>>>>>>>>>>>> XrdCmsNode.cc:1537: error: `fsL2PFail2' was not declared in
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>> XrdCmsNode.cc:1537: warning: unused variable 'fsL2PFail2'
>>>>>>>>>>>>>> XrdCmsNode.cc: At global scope:
>>>>>>>>>>>>>> XrdCmsNode.cc:1553: error: no `const char*
>>>>>>>>>>>>>> XrdCmsNode::fsFail(const
>>>>>>>>>>>>>> char*, const char*, const char*, int)' member function declared
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>> class `XrdCmsNode'
>>>>>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>>>>>> XrdCmsNode::fsFail(const char*, const char*, const char*,
>>>>>>>>>>>>>> int)':
>>>>>>>>>>>>>> XrdCmsNode.cc:1559: error: `fsL2PFail1' was not declared in
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>> XrdCmsNode.cc:1559: warning: unused variable 'fsL2PFail1'
>>>>>>>>>>>>>> XrdCmsNode.cc:1560: error: `fsL2PFail2' was not declared in
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>> scope
>>>>>>>>>>>>>> XrdCmsNode.cc:1560: warning: unused variable 'fsL2PFail2'
>>>>>>>>>>>>>> XrdCmsNode.cc: In static member function `static int
>>>>>>>>>>>>>> XrdCmsNode::isOnline(char*, int)':
>>>>>>>>>>>>>> XrdCmsNode.cc:1608: error: 'class XrdCmsConfig' has no member
>>>>>>>>>>>>>> named
>>>>>>>>>>>>>> 'ossFS'
>>>>>>>>>>>>>> make[4]: *** [../../obj/XrdCmsNode.o] Error 1
>>>>>>>>>>>>>> make[3]: *** [Linuxall] Error 2
>>>>>>>>>>>>>> make[2]: *** [all] Error 2
>>>>>>>>>>>>>> make[1]: *** [XrdCms] Error 2
>>>>>>>>>>>>>> make: *** [all] Error 2
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Dec 15, 2009 at 2:08 AM, Andrew Hanushevsky
>>>>>>>>>>>>>> <[log in to unmask]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have developed a permanent fix. You will find the source
>>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There are three files: XrdCmsCluster.cc XrdCmsNode.cc
>>>>>>>>>>>>>>> XrdCmsProtocol.cc
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please do a source replacement and recompile. Unfortunately,
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> cmsd
>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>> need to be replaced on each node regardless of role. My
>>>>>>>>>>>>>>> apologies
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> disruption. Please let me know how it goes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>>>>>>>>>> <[log in to unmask]>
>>>>>>>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>>>>>>>>> Cc: <[log in to unmask]>
>>>>>>>>>>>>>>> Sent: Sunday, December 13, 2009 7:04 AM
>>>>>>>>>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>> I used the new cmsd at atlas-bkp1 manager. But it's still
>>>>>>>>>>>>>>> dropping
>>>>>>>>>>>>>>> nodes. And in supervisor's log, I cannot find any dataserver
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> register to it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The new logs are in
>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/*.20091213.
>>>>>>>>>>>>>>> The manager is patched at 091213 08:38:15.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sun, Dec 13, 2009 at 1:52 AM, Andrew Hanushevsky
>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Wen
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> You will find the source replacement at:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It's XrdCmsCluster.cc and it replaces
>>>>>>>>>>>>>>>> xrootd/src/XrdCms/XrdCmsCluster.cc
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm stepping out for a couple of hours but will be back to
>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>> how
>>>>>>>>>>>>>>>> things
>>>>>>>>>>>>>>>> went. Sorry for the issues :-(
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I prefer a source replacement. Then I can compile it.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I can do one of two things here:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 1) Supply a source replacement and then you would
>>>>>>>>>>>>>>>>>> recompile,
>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 2) Give me the uname -a of where the cmsd will run and I'll
>>>>>>>>>>>>>>>>>> supply
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> binary
>>>>>>>>>>>>>>>>>> replacement for you.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Your choice.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Andrew
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The problem is found. Great. Thanks.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Where can I find the patched cmsd?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky
>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I found the problem. Looks like a regression from way
>>>>>>>>>>>>>>>>>>>> back
>>>>>>>>>>>>>>>>>>>> when.
>>>>>>>>>>>>>>>>>>>> There
>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>> missing flag on the redirect. This will require a patched
>>>>>>>>>>>>>>>>>>>> cmsd
>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>>>>> only to replace the redirector's cmsd as this only
>>>>>>>>>>>>>>>>>>>> affects
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> redirector.
>>>>>>>>>>>>>>>>>>>> How would you like to proceed?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> It doesn't work. atlas-bkp1 manager still dropping nodes
>>>>>>>>>>>>>>>>>>>>> again.
>>>>>>>>>>>>>>>>>>>>> In supervisor, I still haven't seen any dataserver
>>>>>>>>>>>>>>>>>>>>> registered.
>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>> said
>>>>>>>>>>>>>>>>>>>>> "I updated the ntp" because you said "the log timestamp
>>>>>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>> overlap".
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky
>>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Do you mean that everything is now working? It could be
>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>> removed
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> xrd.timeout directive. That really could cause
>>>>>>>>>>>>>>>>>>>>>> problems.
>>>>>>>>>>>>>>>>>>>>>> As
>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> delays,
>>>>>>>>>>>>>>>>>>>>>> that is normal when the redirector thinks something is
>>>>>>>>>>>>>>>>>>>>>> going
>>>>>>>>>>>>>>>>>>>>>> wrong.
>>>>>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>>> strategy is to delay clients until it can get back to a
>>>>>>>>>>>>>>>>>>>>>> stable
>>>>>>>>>>>>>>>>>>>>>> configuration. This usually prevents jobs from crashing
>>>>>>>>>>>>>>>>>>>>>> during
>>>>>>>>>>>>>>>>>>>>>> stressful
>>>>>>>>>>>>>>>>>>>>>> periods.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I restarted it to do supervisor test. Also because
>>>>>>>>>>>>>>>>>>>>>>> xrootd
>>>>>>>>>>>>>>>>>>>>>>> manager
>>>>>>>>>>>>>>>>>>>>>>> frequently doesn't response. (*) is the cms.log, the
>>>>>>>>>>>>>>>>>>>>>>> file
>>>>>>>>>>>>>>>>>>>>>>> select
>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>> delayed again and again. When do a restart, all things
>>>>>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>>> fine.
>>>>>>>>>>>>>>>>>>>>>>> Now
>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>> am trying to find a clue about it.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> (*)
>>>>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318
>>>>>>>>>>>>>>>>>>>>>>> slot3.14949:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> do_Select:
>>>>>>>>>>>>>>>>>>>>>>> wc
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 Select seeking
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 UnkFile rc=1
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318
>>>>>>>>>>>>>>>>>>>>>>> slot3.14949:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> do_Select:
>>>>>>>>>>>>>>>>>>>>>>> delay 5
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to
>>>>>>>>>>>>>>>>>>>>>>> 2+-1
>>>>>>>>>>>>>>>>>>>>>>> post=0
>>>>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 Dispatch
>>>>>>>>>>>>>>>>>>>>>>> redirector.21313:14@atlas-bkp2
>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>> select dlen=166
>>>>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1
>>>>>>>>>>>>>>>>>>>>>>> post=0
>>>>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdSched: running redirector
>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> There is no core file. I copied a new copies of the
>>>>>>>>>>>>>>>>>>>>>>> logs
>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> link
>>>>>>>>>>>>>>>>>>>>>>> below.
>>>>>>>>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky
>>>>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I see in the server log that it is restarting often.
>>>>>>>>>>>>>>>>>>>>>>>> Could
>>>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>>> take
>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>> look
>>>>>>>>>>>>>>>>>>>>>>>> in the c193 to see if you have any core files? Also
>>>>>>>>>>>>>>>>>>>>>>>> please
>>>>>>>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>> core files are enabled as Linux defaults the size to
>>>>>>>>>>>>>>>>>>>>>>>> 0.
>>>>>>>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>>>>>>> step
>>>>>>>>>>>>>>>>>>>>>>>> here
>>>>>>>>>>>>>>>>>>>>>>>> is to find out why your servers are restarting.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> the logs can be found here. From the log you can see
>>>>>>>>>>>>>>>>>>>>>>>>> atlas-bkp1
>>>>>>>>>>>>>>>>>>>>>>>>> manager are dropping nodes again and again which
>>>>>>>>>>>>>>>>>>>>>>>>> tries
>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky
>>>>>>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Wen, Could you start everything up and provide
>>>>>>>>>>>>>>>>>>>>>>>>>> me
>>>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>> pointer
>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> manager log file, supervisor log file, and one data
>>>>>>>>>>>>>>>>>>>>>>>>>> server
>>>>>>>>>>>>>>>>>>>>>>>>>> logfile
>>>>>>>>>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>> which cover the same time-frame (from start to some
>>>>>>>>>>>>>>>>>>>>>>>>>> point
>>>>>>>>>>>>>>>>>>>>>>>>>> where
>>>>>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>>>> things are working or not). That way I can see what
>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>> happening.
>>>>>>>>>>>>>>>>>>>>>>>>>> At
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> moment I only see two "bad" things in the config
>>>>>>>>>>>>>>>>>>>>>>>>>> file:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a
>>>>>>>>>>>>>>>>>>>>>>>>>> manager
>>>>>>>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>>>>> claim,
>>>>>>>>>>>>>>>>>>>>>>>>>> via
>>>>>>>>>>>>>>>>>>>>>>>>>> the all.manager directive, that there are three
>>>>>>>>>>>>>>>>>>>>>>>>>> (bkp2
>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>> bkp3).
>>>>>>>>>>>>>>>>>>>>>>>>>> While
>>>>>>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>>>>> should work, the log file will be dense with error
>>>>>>>>>>>>>>>>>>>>>>>>>> messages.
>>>>>>>>>>>>>>>>>>>>>>>>>> Please
>>>>>>>>>>>>>>>>>>>>>>>>>> correct
>>>>>>>>>>>>>>>>>>>>>>>>>> this to be consistent and make it easier to see
>>>>>>>>>>>>>>>>>>>>>>>>>> real
>>>>>>>>>>>>>>>>>>>>>>>>>> errors.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> This is not a problem for me. Because this config is
>>>>>>>>>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> dataserver. In manager, I updated the if
>>>>>>>>>>>>>>>>>>>>>>>>> atlas-bkp1.cs.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>> atlas-bkp2 or something. This is a history problem.
>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>>>> atlas-bkp1 is used. atlas-bkp2 and atlas-bkp3 are
>>>>>>>>>>>>>>>>>>>>>>>>> added
>>>>>>>>>>>>>>>>>>>>>>>>> later.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Please use cms.space not olb.space (for
>>>>>>>>>>>>>>>>>>>>>>>>>> historical
>>>>>>>>>>>>>>>>>>>>>>>>>> reasons
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> latter
>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>> still accepted and over-rides the former, but that
>>>>>>>>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>>>>> soon
>>>>>>>>>>>>>>>>>>>>>>>>>> end),
>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>> please use only one (the config file uses both
>>>>>>>>>>>>>>>>>>>>>>>>>> directives).
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> yes. I should remove this line. in fact cms.space is
>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> cfg
>>>>>>>>>>>>>>>>>>>>>>>>> too.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> The xrootd has an internal mechanism to connect
>>>>>>>>>>>>>>>>>>>>>>>>>> servers
>>>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>> supervisors
>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>> allow for maximum reliability. You cannot change
>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>> algorithm
>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>> no need to do so. You should *never* tell anyone to
>>>>>>>>>>>>>>>>>>>>>>>>>> directly
>>>>>>>>>>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>> supervisor. If you do, you will likely get
>>>>>>>>>>>>>>>>>>>>>>>>>> unreachable
>>>>>>>>>>>>>>>>>>>>>>>>>> nodes.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> As for dropping data servers, it would appear to
>>>>>>>>>>>>>>>>>>>>>>>>>> me,
>>>>>>>>>>>>>>>>>>>>>>>>>> given
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> flurry
>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>> such activity, that something either crashed or was
>>>>>>>>>>>>>>>>>>>>>>>>>> restarted.
>>>>>>>>>>>>>>>>>>>>>>>>>> That's
>>>>>>>>>>>>>>>>>>>>>>>>>> why
>>>>>>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>>>>> would be good to see the complete log of each one
>>>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> entities.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> I read the document. and write a config
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg).
>>>>>>>>>>>>>>>>>>>>>>>>>>> I used my conf, I can see manager is dispatch
>>>>>>>>>>>>>>>>>>>>>>>>>>> message
>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>> supervisor. But I cannot see any dataserver tries
>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>> supervisor. At the same time, in the manager's
>>>>>>>>>>>>>>>>>>>>>>>>>>> log,
>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>>>>>>> dataserver are Dropped.
>>>>>>>>>>>>>>>>>>>>>>>>>>> How does xrootd decide which dataserver will
>>>>>>>>>>>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>>>>>>>>>>>> supervisor?
>>>>>>>>>>>>>>>>>>>>>>>>>>> should I specify some dataservers to connect the
>>>>>>>>>>>>>>>>>>>>>>>>>>> supervisor?
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> (*) supervisor log
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>>>>> manager.0:20@atlas-bkp2
>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>> state
>>>>>>>>>>>>>>>>>>>>>>>>>>> dlen=42
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2
>>>>>>>>>>>>>>>>>>>>>>>>>>> do_State:
>>>>>>>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2
>>>>>>>>>>>>>>>>>>>>>>>>>>> do_StateFWD:
>>>>>>>>>>>>>>>>>>>>>>>>>>> Path
>>>>>>>>>>>>>>>>>>>>>>>>>>> find
>>>>>>>>>>>>>>>>>>>>>>>>>>> failed for state
>>>>>>>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> (*)manager log
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>>>> TSpace=5587GB
>>>>>>>>>>>>>>>>>>>>>>>>>>> NumFS=1
>>>>>>>>>>>>>>>>>>>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>>>> adding
>>>>>>>>>>>>>>>>>>>>>>>>>>> path:
>>>>>>>>>>>>>>>>>>>>>>>>>>> w
>>>>>>>>>>>>>>>>>>>>>>>>>>> /atlas
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>>>> do_Space: 5696231MB free; 0% util
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 logged in.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection
>>>>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>>>> [log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: running
>>>>>>>>>>>>>>>>>>>>>>>>>>> ?:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched
>>>>>>>>>>>>>>>>>>>>>>>>>>> protocol
>>>>>>>>>>>>>>>>>>>>>>>>>>> cmsd
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>>>>> 79
>>>>>>>>>>>>>>>>>>>>>>>>>>> attached
>>>>>>>>>>>>>>>>>>>>>>>>>>> to poller 2; num=22
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>>>> bumps
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 #63
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1
>>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Drop_Node:
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 dropped.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add Shoved
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 to
>>>>>>>>>>>>>>>>>>>>>>>>>>> cluster;
>>>>>>>>>>>>>>>>>>>>>>>>>>> id=63.78;
>>>>>>>>>>>>>>>>>>>>>>>>>>> num=64;
>>>>>>>>>>>>>>>>>>>>>>>>>>> min=51
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1
>>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>>>> TSpace=5587GB
>>>>>>>>>>>>>>>>>>>>>>>>>>> NumFS=1
>>>>>>>>>>>>>>>>>>>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>>>> adding
>>>>>>>>>>>>>>>>>>>>>>>>>>> path:
>>>>>>>>>>>>>>>>>>>>>>>>>>> w
>>>>>>>>>>>>>>>>>>>>>>>>>>> /atlas
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>>>> do_Space: 5721854MB free; 0% util
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 logged in.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve
>>>>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1
>>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop
>>>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>> 60
>>>>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 node 63.78
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>>>>> 79 detached from poller 2; num=21
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1
>>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data
>>>>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>>>> c177.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>>>> FD=16
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 node 0.3
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>>>>> 16 detached from poller 2; num=20
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data
>>>>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>>>> c179.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>>>> FD=21
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1
>>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.17065:[log in to unmask]:1094 node 1.4
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.7978:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.7978:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>>>>> 21
>>>>>>>>>>>>>>>>>>>>>>>>>>> detached from poller 1; num=21
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 State: Status changed to
>>>>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Send status to
>>>>>>>>>>>>>>>>>>>>>>>>>>> redirector.15656:14@atlas-bkp2
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1
>>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data
>>>>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>>>> c182.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>>>> FD=19
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 node 7.10
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>>>>> 19 detached from poller 2; num=19
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1
>>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data
>>>>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>>>> c178.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>>>> FD=15
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 node 9.12
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>>>>> 15 detached from poller 1; num=20
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1
>>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data
>>>>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>>>> c181.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>>>> FD=17
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 node 5.8
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>>>>> 17 detached from poller 0; num=21
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1
>>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data
>>>>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>>>> c183.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>>>> FD=22
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 node 8.11
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>>>>> 22 detached from poller 2; num=18
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data
>>>>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>>>> c184.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>>>> FD=20
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1
>>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.4131:[log in to unmask]:1094 node 3.6
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>>>>> 20 detached from poller 0; num=20
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1
>>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data
>>>>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>>>> c185.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>>>> FD=23
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 node 6.9
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.8524:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.8524:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>>>>> 23
>>>>>>>>>>>>>>>>>>>>>>>>>>> detached from poller 0; num=19
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1
>>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data
>>>>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>>>> c180.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>>>> FD=18
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 node 4.7
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>>>>> 18 detached from poller 1; num=19
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1
>>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data
>>>>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>>>>> c186.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>>>>> FD=24
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 node 2.5
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.7849:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>>>>> server.7849:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>>>>> 24
>>>>>>>>>>>>>>>>>>>>>>>>>>> detached from poller 1; num=18
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop
>>>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop
>>>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop
>>>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop
>>>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop
>>>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop
>>>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop
>>>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop
>>>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop
>>>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop
>>>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68
>>>>>>>>>>>>>>>>>>>>>>>>>>> workers
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hanushevsky
>>>>>>>>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> To go past 64 data servers you will need to setup
>>>>>>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>>>>>> supervisors.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> This does not logically change the current
>>>>>>>>>>>>>>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>>>>>>> have.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> You
>>>>>>>>>>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>>>>>>> need to configure one or more *new* servers (or
>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>> least
>>>>>>>>>>>>>>>>>>>>>>>>>>>> xrootd
>>>>>>>>>>>>>>>>>>>>>>>>>>>> processes)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> whose role is supervisor. We'd like them to run
>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>> separate
>>>>>>>>>>>>>>>>>>>>>>>>>>>> machines
>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>> reliability purposes, but they could run on the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> manager
>>>>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>>>>>>> long
>>>>>>>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>>>>>>> give each one a unique instance name (i.e., -n
>>>>>>>>>>>>>>>>>>>>>>>>>>>> option).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> The front part of the cmsd reference explains how
>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Is there any change to configure xrootd with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 65
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> machines? I used the configure below but it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> work.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Should
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configure some machines' manager to be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supvervisor?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>
>>
>