LISTSERV 16.5 - XROOTD-L Archives

Hi Fabrizio,

    This is the xrdcp debug message.
             ClientHeader.header.dlen = 41
=================== END CLIENT HEADER DUMPING ===================

091217 16:47:54 15961 Xrd: WriteRaw: Writing 24 bytes to physical connection
091217 16:47:54 15961 Xrd: WriteRaw: Writing to substreamid 0
091217 16:47:54 15961 Xrd: WriteRaw: Writing 41 bytes to physical connection
091217 16:47:54 15961 Xrd: WriteRaw: Writing to substreamid 0
091217 16:47:54 15961 Xrd: ReadPartialAnswer: Reading a
XrdClientMessage from the server [atlas-bkp1.cs.wisc.edu:1094]...
091217 16:47:54 15961 Xrd: XrdClientMessage::ReadRaw:  sid: 1, IsAttn:
0, substreamid: 0
091217 16:47:54 15961 Xrd: XrdClientMessage::ReadRaw: Reading data (4
bytes) from substream 0
091217 16:47:54 15961 Xrd: ReadRaw: Reading from atlas-bkp1.cs.wisc.edu:1094
091217 16:47:54 15961 Xrd: BuildMessage:  posting id 1
091217 16:47:54 15961 Xrd: XrdClientMessage::ReadRaw: Reading header (8 bytes).
091217 16:47:54 15961 Xrd: ReadRaw: Reading from atlas-bkp1.cs.wisc.edu:1094


======== DUMPING SERVER RESPONSE HEADER ========
      ServerHeader.streamid = 0x01 0x00
        ServerHeader.status = kXR_wait (4005)
          ServerHeader.dlen = 4
========== END DUMPING SERVER HEADER ===========

091217 16:47:54 15961 Xrd: ReadPartialAnswer: Server
[atlas-bkp1.cs.wisc.edu:1094] answered [kXR_wait] (4005)
091217 16:47:54 15961 Xrd: CheckErrorStatus: Server
[atlas-bkp1.cs.wisc.edu:1094] requested 10 seconds of wait
091217 16:48:04 15961 Xrd: DumpPhyConn: Phyconn entry,
[log in to unmask]:1094', LogCnt=1 Valid
091217 16:48:04 15961 Xrd: SendGenCommand: Sending command Open


================= DUMPING CLIENT REQUEST HEADER =================
                ClientHeader.streamid = 0x01 0x00
               ClientHeader.requestid = kXR_open (3010)
               ClientHeader.open.mode = 0x00 0x00
            ClientHeader.open.options = 0x40 0x04
           ClientHeader.open.reserved = 0 repeated 12 times
             ClientHeader.header.dlen = 41
=================== END CLIENT HEADER DUMPING ===================

091217 16:48:04 15961 Xrd: WriteRaw: Writing 24 bytes to physical connection
091217 16:48:04 15961 Xrd: WriteRaw: Writing to substreamid 0
091217 16:48:04 15961 Xrd: WriteRaw: Writing 41 bytes to physical connection
091217 16:48:04 15961 Xrd: WriteRaw: Writing to substreamid 0
091217 16:48:04 15961 Xrd: ReadPartialAnswer: Reading a
XrdClientMessage from the server [atlas-bkp1.cs.wisc.edu:1094]...
091217 16:48:04 15961 Xrd: XrdClientMessage::ReadRaw:  sid: 1, IsAttn:
0, substreamid: 0
091217 16:48:04 15961 Xrd: XrdClientMessage::ReadRaw: Reading data (4
bytes) from substream 0
091217 16:48:04 15961 Xrd: ReadRaw: Reading from atlas-bkp1.cs.wisc.edu:1094
091217 16:48:04 15961 Xrd: BuildMessage:  posting id 1
091217 16:48:04 15961 Xrd: XrdClientMessage::ReadRaw: Reading header (8 bytes).
091217 16:48:04 15961 Xrd: ReadRaw: Reading from atlas-bkp1.cs.wisc.edu:1094


======== DUMPING SERVER RESPONSE HEADER ========
      ServerHeader.streamid = 0x01 0x00
        ServerHeader.status = kXR_wait (4005)
          ServerHeader.dlen = 4
========== END DUMPING SERVER HEADER ===========

091217 16:48:04 15961 Xrd: ReadPartialAnswer: Server
[atlas-bkp1.cs.wisc.edu:1094] answered [kXR_wait] (4005)
091217 16:48:04 15961 Xrd: CheckErrorStatus: Server
[atlas-bkp1.cs.wisc.edu:1094] requested 10 seconds of wait
091217 16:48:14 15961 Xrd: SendGenCommand: Sending command Open


================= DUMPING CLIENT REQUEST HEADER =================
                ClientHeader.streamid = 0x01 0x00
               ClientHeader.requestid = kXR_open (3010)
               ClientHeader.open.mode = 0x00 0x00
            ClientHeader.open.options = 0x40 0x04
           ClientHeader.open.reserved = 0 repeated 12 times
             ClientHeader.header.dlen = 41
=================== END CLIENT HEADER DUMPING ===================

091217 16:48:14 15961 Xrd: WriteRaw: Writing 24 bytes to physical connection
091217 16:48:14 15961 Xrd: WriteRaw: Writing to substreamid 0
091217 16:48:14 15961 Xrd: WriteRaw: Writing 41 bytes to physical connection
091217 16:48:14 15961 Xrd: WriteRaw: Writing to substreamid 0
091217 16:48:14 15961 Xrd: ReadPartialAnswer: Reading a
XrdClientMessage from the server [atlas-bkp1.cs.wisc.edu:1094]...
091217 16:48:14 15961 Xrd: XrdClientMessage::ReadRaw:  sid: 1, IsAttn:
0, substreamid: 0
091217 16:48:14 15961 Xrd: XrdClientMessage::ReadRaw: Reading data (4
bytes) from substream 0
091217 16:48:14 15961 Xrd: ReadRaw: Reading from atlas-bkp1.cs.wisc.edu:1094
091217 16:48:14 15961 Xrd: BuildMessage:  posting id 1
091217 16:48:14 15961 Xrd: XrdClientMessage::ReadRaw: Reading header (8 bytes).
091217 16:48:14 15961 Xrd: ReadRaw: Reading from atlas-bkp1.cs.wisc.edu:1094


======== DUMPING SERVER RESPONSE HEADER ========
      ServerHeader.streamid = 0x01 0x00
        ServerHeader.status = kXR_wait (4005)
          ServerHeader.dlen = 4
========== END DUMPING SERVER HEADER ===========

091217 16:48:14 15961 Xrd: ReadPartialAnswer: Server
[atlas-bkp1.cs.wisc.edu:1094] answered [kXR_wait] (4005)
091217 16:48:14 15961 Xrd: SendGenCommand: Max time limit elapsed for
request  kXR_open. Aborting command.
Last server error 10000 ('')
Error accessing path/file for
root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131


Wen

On Thu, Dec 17, 2009 at 11:27 PM, Fabrizio Furano <[log in to unmask]> wrote:
> Hi Wen,
>
>  I see that you are getting error 10000, which means "generic error before
> any interaction". Could you please run the same command with debug level 3
> and post the log with the same kind of issue? Something like
>
>  xrdcp -d 3 ....
>
>  Most likely this time the problem is different. I may be wrong here, but a
> possible reason for that error is that the servers require authentication
> and xrdcp does not find some library in the LD_LIBRARY_PATH.
>
> Fabrizio
>
>
> wen guan ha scritto:
>>
>> Hi Andy,
>>
>>    I put new logs in web.
>>
>> It still doesn't work. I cannot copy files in and out.
>>
>>  It seems xrootd daemon at atlas-bkp1 hasn't talked with cmsd.
>> Normally if xrootd daemont tries to copy a file, in the cms.log I
>> should see "do_Select: filename". But in this cms.log, there is
>> nothing from atlas-bkp1.
>>
>> (*)
>> [root@atlas-bkp1 ~]# xrdcp
>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>> /tmp/
>> Last server error 10000 ('')
>> Error accessing path/file for
>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>> [root@atlas-bkp1 ~]# xrdcp /bin/mv
>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123
>> 133
>>
>>
>> Wen
>>
>> On Thu, Dec 17, 2009 at 10:54 PM, Andrew Hanushevsky <[log in to unmask]>
>> wrote:
>>>
>>> Hi Wen,
>>>
>>> I reviewed the log file. Other than the odd redirect of c131 at 17:47:25
>>> which I can't comment on because its logs on the web site do not overlap
>>> with the manager or supervisor. Unless all the logs include the full time
>>> in
>>> question I can't say much of anything. Can you provide me with inclusive
>>> logs?
>>>
>>> atlas-bkp1 cms: 17:20:57 to 17:42:19 xrd: 17:20:57 to 17:40:57
>>> higgs07 cms & xrd 17:22:33 to 17:42:33
>>> c131 cms & xrd 17:31:57 to 17:47:28
>>>
>>> That said, it certainly looks like things were working and files were
>>> being
>>> accessed and discovered on all the machines. You even werw able to open
>>> /atlas/xrootd/users/wguan/test/test98123313
>>> through not
>>> /atlas/xrootd/users/wguan/test/test123131The other issue is that you did
>>> not
>>> specify a stable adminpath and the adminpath defaults to /tmp. If you
>>> have a
>>> "cleanup" script that runs periodically for /tmp then eventually your
>>> cluster will go catonic as important (but not often used) files are
>>> deleted
>>> by that script. Could you please find a stable home for the adminpath?
>>>
>>> I reran my tests here and things worked as expected. I will ramp up some
>>> more tests. So, what is your status today?
>>>
>>> Andy
>>>
>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>> Cc: <[log in to unmask]>
>>> Sent: Thursday, December 17, 2009 5:05 AM
>>> Subject: Re: xrootd with more than 65 machines
>>>
>>>
>>> Hi Andy,
>>>
>>>  Yes. I am using the file download from
>>> http://www.slac.stanford.edu/~abh/cmsd/ which compiled yesterday.  I
>>> just now compiled it again and compare it with one I compiled
>>> yesterday. they are the same(same md5sum).
>>>
>>> Wen
>>>
>>> On Thu, Dec 17, 2009 at 2:09 AM, Andrew Hanushevsky <[log in to unmask]>
>>> wrote:
>>>>
>>>> Hi Wen,
>>>>
>>>> If c131 cannot connect then either c131 does not have the new cms or
>>>> atlas-bkp1 does not have the new cms as that would be what would happen
>>>> if
>>>> either were true. Looking at the log on c131 it would appear that
>>>> atlas-bkp1
>>>> is still using the old cmsd as the response data length is wrong. Could
>>>> you
>>>> verify please.
>>>>
>>>> Andy
>>>>
>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>> Cc: <[log in to unmask]>
>>>> Sent: Wednesday, December 16, 2009 3:58 PM
>>>> Subject: Re: xrootd with more than 65 machines
>>>>
>>>>
>>>> Hi Andy,
>>>>
>>>> I tried it. But there are still some problem. I put the logs in
>>>> higgs03.cs.wisc.edu/wguan/
>>>>
>>>> In my test, c131 is the 65 nodes to be added the the manager.
>>>> and I can copy the file to the pool through manager. But I cannot
>>>> copy a file out which is in c131.
>>>>
>>>> In c131's cms.log, I see "Manager:
>>>> manager.0:[log in to unmask] removed; redirected" again and
>>>> again. and I cannot see any thing about c131 in higgs07's
>>>> log(supervisor). Does it mean manager tries to redirect it to higgs07,
>>>> but c131 hasn't try to connect higgs07. It only tries to connect
>>>> manager again.
>>>>
>>>> (*)
>>>> [root@c131 ~]# xrdcp /bin/mv
>>>> root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331
>>>> Last server error 10000 ('')
>>>> Error accessing path/file for
>>>> root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331
>>>> [root@c131 ~]# xrdcp /bin/mv
>>>>
>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123311
>>>> [xrootd] Total 0.06 MB |====================| 100.00 % [3.1 MB/s]
>>>> [root@c131 ~]# xrdcp /bin/mv
>>>>
>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123312
>>>> [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s]
>>>> [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/
>>>> test123131
>>>> [root@c131 ~]# xrdcp
>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>> /tmp/
>>>> Last server error 3011 ('No servers are available to read the file.')
>>>> Error accessing path/file for
>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>> [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/test123131
>>>> /atlas/xrootd/users/wguan/test/test123131
>>>> [root@c131 ~]# xrdcp
>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>> /tmp/
>>>> Last server error 3011 ('No servers are available to read the file.')
>>>> Error accessing path/file for
>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>> [root@c131 ~]# xrdcp /bin/mv
>>>>
>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123313
>>>> [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s]
>>>> [root@c131 ~]# xrdcp
>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>> /tmp/
>>>> Last server error 3011 ('No servers are available to read the file.')
>>>> Error accessing path/file for
>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>> [root@c131 ~]# xrdcp
>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>> /tmp/
>>>> Last server error 3011 ('No servers are available to read the file.')
>>>> Error accessing path/file for
>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>> [root@c131 ~]# xrdcp
>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>> /tmp/
>>>> Last server error 3011 ('No servers are available to read the file.')
>>>> Error accessing path/file for
>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>> [root@c131 ~]# tail -f /var/log/xrootd/cms.log
>>>> 091216 17:45:52 3103 manager.0:[log in to unmask] XrdLink:
>>>> Setting ref to 2+-1 post=0
>>>> 091216 17:45:55 3103 Pander trying to connect to lvl 0
>>>> atlas-bkp1.cs.wisc.edu:3121
>>>> 091216 17:45:55 3103 XrdInet: Connected to atlas-bkp1.cs.wisc.edu:3121
>>>> 091216 17:45:55 3103 Add atlas-bkp1.cs.wisc.edu to manager config; id=0
>>>> 091216 17:45:55 3103 ManTree: Now connected to 3 root node(s)
>>>> 091216 17:45:55 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu
>>>> 091216 17:45:55 3103 Dispatch manager.0:[log in to unmask] for
>>>> try
>>>> dlen=3
>>>> 091216 17:45:55 3103 manager.0:[log in to unmask] do_Try:
>>>> 091216 17:45:55 3103 Remove completed atlas-bkp1.cs.wisc.edu manager
>>>> 0.95
>>>> 091216 17:45:55 3103 Manager: manager.0:[log in to unmask]
>>>> removed; redirected
>>>> 091216 17:46:04 3103 Pander trying to connect to lvl 0
>>>> atlas-bkp1.cs.wisc.edu:3121
>>>> 091216 17:46:04 3103 XrdInet: Connected to atlas-bkp1.cs.wisc.edu:3121
>>>> 091216 17:46:04 3103 Add atlas-bkp1.cs.wisc.edu to manager config; id=0
>>>> 091216 17:46:04 3103 ManTree: Now connected to 3 root node(s)
>>>> 091216 17:46:04 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu
>>>> 091216 17:46:04 3103 Dispatch manager.0:[log in to unmask] for
>>>> try
>>>> dlen=3
>>>> 091216 17:46:04 3103 Protocol: No buffers to serve
>>>> atlas-bkp1.cs.wisc.edu
>>>> 091216 17:46:04 3103 Remove completed atlas-bkp1.cs.wisc.edu manager
>>>> 0.96
>>>> 091216 17:46:04 3103 Manager: manager.0:[log in to unmask]
>>>> removed; insufficient buffers
>>>> 091216 17:46:11 3103 Dispatch manager.0:[log in to unmask] for
>>>> state dlen=169
>>>> 091216 17:46:11 3103 manager.0:[log in to unmask] XrdLink:
>>>> Setting ref to 1+1 post=0
>>>>
>>>> Thanks
>>>> Wen
>>>>
>>>> On Thu, Dec 17, 2009 at 12:10 AM, wen guan <[log in to unmask]>
>>>> wrote:
>>>>>
>>>>> Hi Andy,
>>>>>
>>>>>> OK, I understand. As for stalling, too many nodes were deemed to be in
>>>>>> trouble for the manager to allow service resumption.
>>>>>>
>>>>>> Please make sure that all of the nodes in the cluster receive the new
>>>>>> cmsd
>>>>>> as they will drop off with the old one and you'll see the same kind of
>>>>>> activity. Perhaps the best way to know that you suceeded in putting
>>>>>> everything in sync is to start with 63 data nodes plus one supervisor.
>>>>>> Once
>>>>>> all connections are established; adding an additional server should
>>>>>> simply
>>>>>> send it to the supervisor.
>>>>>
>>>>> I will do it.
>>>>> you said start 63 data server and one supervisor. Does it mean the
>>>>> supervisor is managed using the same policy? If I there are 64
>>>>> dataservers which are connected before the supervisor, will the
>>>>> supervisor be dropped? Is the supervisor has high priority to be
>>>>> added to the manager? I mean, if there are already 64 dataservers and
>>>>> a supervisor comes in, will the supervisor be accepted and a datasever
>>>>> be redirected to the supervisor?
>>>>>
>>>>> Thanks
>>>>> Wen
>>>>>
>>>>>> Hi Andrew,
>>>>>>
>>>>>> But when I tried to xrdcp a file to it, it doesn't response. In
>>>>>> atlas-bkp1-xrd.log.20091213, it always prints "stalling client for 10
>>>>>> sec". But in cms.log, I can find any message about the file.
>>>>>>
>>>>>>> I don't see why you say it doesn't work. With the debugging level set
>>>>>>> so
>>>>>>> high the noise may make it look like something is going wrong but
>>>>>>> that
>>>>>>> isn't
>>>>>>> necessarily the case.
>>>>>>>
>>>>>>> 1) The 'too many subscribers' is correct. The manager was simply
>>>>>>> redirecting
>>>>>>> them because there were already 64 servers. However, in your case the
>>>>>>> supervisor wasn't started until almost 30 minutes after everyone else
>>>>>>> (i.e.,
>>>>>>> 10:42 AM). Why was that? I'm not suprised about the flurry of
>>>>>>> messages
>>>>>>> with
>>>>>>> a critical component missing for 30 minutes.
>>>>>>
>>>>>> Because the manager is 64bit machine but supervisor is 32 bit machine.
>>>>>> Then I have to recompile the it. At that time, I was interrupted by
>>>>>> something else.
>>>>>>
>>>>>>
>>>>>>> 2) Once the supervisor started, it started accepting the redirected
>>>>>>> servers.
>>>>>>>
>>>>>>> 3) Then 10 seconds (10:42:10) later the supervisor was restarted. So,
>>>>>>> that
>>>>>>> would cause a flurry of activity to occur as there is no backup
>>>>>>> supervisor
>>>>>>> to take over.
>>>>>>>
>>>>>>> 4) This happened again at 10:42:34 AM then again at 10:48:49. Is the
>>>>>>> supervisor crashing? Is there a core file?
>>>>>>>
>>>>>>> 5) At 11:11 AM the manager restarted. Again, is there a core file
>>>>>>> here
>>>>>>> or
>>>>>>> was this a manual action?
>>>>>>>
>>>>>>> During the course of all of this. All nodes connected were operating
>>>>>>> propely
>>>>>>> and files were being located.
>>>>>>>
>>>>>>> So, the two big questions are:
>>>>>>>
>>>>>>> a) Why was the supervisor not started until 30 minutes after the
>>>>>>> system
>>>>>>> was
>>>>>>> started?
>>>>>>>
>>>>>>> b) Is there an explanation of the restarts? If this was a crash then
>>>>>>> we
>>>>>>> need
>>>>>>> a core file to figure out what happened.
>>>>>>
>>>>>> It's not a crash. There are some reasons that I restarted some
>>>>>> daemons.
>>>>>> (1)I thought if a dataserver tried many times to connect to a
>>>>>> redirector but failed, the dataserver would not try to connect a
>>>>>> redirector again. The supervisor was missing for long time. So maybe
>>>>>> some dataservers would not try to connect to atlas-bkp1 again. To
>>>>>> reactive these dataservers, I restarted any servers.
>>>>>> (2)When I tried to xrdcp, it was hanging for long time. I thought
>>>>>> maybe manager was affected by some others things. then I restarte
>>>>>> manager to see whether a restart can make this xrdcp work.
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>> Wen
>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>> <[log in to unmask]>
>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>> Cc: <[log in to unmask]>
>>>>>>> Sent: Wednesday, December 16, 2009 9:38 AM
>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>
>>>>>>>
>>>>>>> Hi Andrew,
>>>>>>>
>>>>>>> It still doesn't work.
>>>>>>> The log file is in higgs03.cs.wisc.edu/wguan/. The name is *.20091216
>>>>>>> The manager complains there are too many subscribers and the removes
>>>>>>> nodes.
>>>>>>>
>>>>>>> (*)
>>>>>>> Add server.10040:[log in to unmask] redirected; too many
>>>>>>> subscribers.
>>>>>>>
>>>>>>> Wen
>>>>>>>
>>>>>>> On Wed, Dec 16, 2009 at 4:25 AM, Andrew Hanushevsky
>>>>>>> <[log in to unmask]>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Wen,
>>>>>>>>
>>>>>>>> It will be easier for me to retroft as the changes were pretty
>>>>>>>> minor.
>>>>>>>> Please
>>>>>>>> lift the new XrdCmsNode.cc file from
>>>>>>>>
>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd
>>>>>>>>
>>>>>>>> Andy
>>>>>>>>
>>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>>> <[log in to unmask]>
>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>> Cc: <[log in to unmask]>
>>>>>>>> Sent: Tuesday, December 15, 2009 5:12 PM
>>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Andy,
>>>>>>>>
>>>>>>>> I can switch to 20091104-1102. Then you don't need to patch
>>>>>>>> another version. How can I download v20091104-1102?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Wen
>>>>>>>>
>>>>>>>> On Wed, Dec 16, 2009 at 12:52 AM, Andrew Hanushevsky
>>>>>>>> <[log in to unmask]>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Wen,
>>>>>>>>>
>>>>>>>>> Ah yes, I see that now. The file I gave you is based on
>>>>>>>>> v20091104-1102.
>>>>>>>>> Let
>>>>>>>>> me see if I can retrofit the patch for you.
>>>>>>>>>
>>>>>>>>> Andy
>>>>>>>>>
>>>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>>>> <[log in to unmask]>
>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>>> Cc: <[log in to unmask]>
>>>>>>>>> Sent: Tuesday, December 15, 2009 1:04 PM
>>>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Andy,
>>>>>>>>>
>>>>>>>>> Which xrootd version are you using? XrdCmsConfig.hh is different.
>>>>>>>>> XrdCmsConfig.hh is downloaded from
>>>>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/.
>>>>>>>>>
>>>>>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsNode.cc
>>>>>>>>> 6fb3ae40fe4e10bdd4d372818a341f2c src/XrdCms/XrdCmsNode.cc
>>>>>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsConfig.hh
>>>>>>>>> 7d57753847d9448186c718f98e963cbe src/XrdCms/XrdCmsConfig.hh
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Wen
>>>>>>>>>
>>>>>>>>> On Tue, Dec 15, 2009 at 10:50 PM, Andrew Hanushevsky
>>>>>>>>> <[log in to unmask]>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Wen,
>>>>>>>>>>
>>>>>>>>>> Just compiled on Linux and it was clean. Something is really wrong
>>>>>>>>>> with
>>>>>>>>>> your
>>>>>>>>>> source files, specifically XrdCmsConfig.cc
>>>>>>>>>>
>>>>>>>>>> The MD5 checksums on the relevant files are:
>>>>>>>>>>
>>>>>>>>>> MD5 (XrdCmsNode.cc) = 6fb3ae40fe4e10bdd4d372818a341f2c
>>>>>>>>>>
>>>>>>>>>> MD5 (XrdCmsConfig.hh) = 4a7d655582a7cd43b098947d0676924b
>>>>>>>>>>
>>>>>>>>>> Andy
>>>>>>>>>>
>>>>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>>>>> <[log in to unmask]>
>>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>>>> Cc: <[log in to unmask]>
>>>>>>>>>> Sent: Tuesday, December 15, 2009 4:24 AM
>>>>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Andy,
>>>>>>>>>>
>>>>>>>>>> No problem. Thanks for the fix. But it cannot be compiled. The
>>>>>>>>>> version I am using is
>>>>>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/.
>>>>>>>>>>
>>>>>>>>>> Making cms component...
>>>>>>>>>> Compiling XrdCmsNode.cc
>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>> XrdCmsNode::do_Chmod(XrdCmsRRData&)':
>>>>>>>>>> XrdCmsNode.cc:268: error: `fsExec' was not declared in this scope
>>>>>>>>>> XrdCmsNode.cc:268: warning: unused variable 'fsExec'
>>>>>>>>>> XrdCmsNode.cc:269: error: 'class XrdCmsConfig' has no member named
>>>>>>>>>> 'ossFS'
>>>>>>>>>> XrdCmsNode.cc:273: error: `fsFail' was not declared in this scope
>>>>>>>>>> XrdCmsNode.cc:273: warning: unused variable 'fsFail'
>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>> XrdCmsNode::do_Mkdir(XrdCmsRRData&)':
>>>>>>>>>> XrdCmsNode.cc:600: error: `fsExec' was not declared in this scope
>>>>>>>>>> XrdCmsNode.cc:600: warning: unused variable 'fsExec'
>>>>>>>>>> XrdCmsNode.cc:601: error: 'class XrdCmsConfig' has no member named
>>>>>>>>>> 'ossFS'
>>>>>>>>>> XrdCmsNode.cc:605: error: `fsFail' was not declared in this scope
>>>>>>>>>> XrdCmsNode.cc:605: warning: unused variable 'fsFail'
>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>> XrdCmsNode::do_Mkpath(XrdCmsRRData&)':
>>>>>>>>>> XrdCmsNode.cc:640: error: `fsExec' was not declared in this scope
>>>>>>>>>> XrdCmsNode.cc:640: warning: unused variable 'fsExec'
>>>>>>>>>> XrdCmsNode.cc:641: error: 'class XrdCmsConfig' has no member named
>>>>>>>>>> 'ossFS'
>>>>>>>>>> XrdCmsNode.cc:645: error: `fsFail' was not declared in this scope
>>>>>>>>>> XrdCmsNode.cc:645: warning: unused variable 'fsFail'
>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>> XrdCmsNode::do_Mv(XrdCmsRRData&)':
>>>>>>>>>> XrdCmsNode.cc:704: error: `fsExec' was not declared in this scope
>>>>>>>>>> XrdCmsNode.cc:704: warning: unused variable 'fsExec'
>>>>>>>>>> XrdCmsNode.cc:705: error: 'class XrdCmsConfig' has no member named
>>>>>>>>>> 'ossFS'
>>>>>>>>>> XrdCmsNode.cc:709: error: `fsFail' was not declared in this scope
>>>>>>>>>> XrdCmsNode.cc:709: warning: unused variable 'fsFail'
>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>> XrdCmsNode::do_Rm(XrdCmsRRData&)':
>>>>>>>>>> XrdCmsNode.cc:831: error: `fsExec' was not declared in this scope
>>>>>>>>>> XrdCmsNode.cc:831: warning: unused variable 'fsExec'
>>>>>>>>>> XrdCmsNode.cc:832: error: 'class XrdCmsConfig' has no member named
>>>>>>>>>> 'ossFS'
>>>>>>>>>> XrdCmsNode.cc:836: error: `fsFail' was not declared in this scope
>>>>>>>>>> XrdCmsNode.cc:836: warning: unused variable 'fsFail'
>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>> XrdCmsNode::do_Rmdir(XrdCmsRRData&)':
>>>>>>>>>> XrdCmsNode.cc:873: error: `fsExec' was not declared in this scope
>>>>>>>>>> XrdCmsNode.cc:873: warning: unused variable 'fsExec'
>>>>>>>>>> XrdCmsNode.cc:874: error: 'class XrdCmsConfig' has no member named
>>>>>>>>>> 'ossFS'
>>>>>>>>>> XrdCmsNode.cc:878: error: `fsFail' was not declared in this scope
>>>>>>>>>> XrdCmsNode.cc:878: warning: unused variable 'fsFail'
>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>> XrdCmsNode::do_Trunc(XrdCmsRRData&)':
>>>>>>>>>> XrdCmsNode.cc:1377: error: `fsExec' was not declared in this scope
>>>>>>>>>> XrdCmsNode.cc:1377: warning: unused variable 'fsExec'
>>>>>>>>>> XrdCmsNode.cc:1378: error: 'class XrdCmsConfig' has no member
>>>>>>>>>> named
>>>>>>>>>> 'ossFS'
>>>>>>>>>> XrdCmsNode.cc:1382: error: `fsFail' was not declared in this scope
>>>>>>>>>> XrdCmsNode.cc:1382: warning: unused variable 'fsFail'
>>>>>>>>>> XrdCmsNode.cc: At global scope:
>>>>>>>>>> XrdCmsNode.cc:1524: error: no `int XrdCmsNode::fsExec(XrdOucProg*,
>>>>>>>>>> char*, char*)' member function declared in class `XrdCmsNode'
>>>>>>>>>> XrdCmsNode.cc: In member function `int
>>>>>>>>>> XrdCmsNode::fsExec(XrdOucProg*,
>>>>>>>>>> char*, char*)':
>>>>>>>>>> XrdCmsNode.cc:1533: error: `fsL2PFail1' was not declared in this
>>>>>>>>>> scope
>>>>>>>>>> XrdCmsNode.cc:1533: warning: unused variable 'fsL2PFail1'
>>>>>>>>>> XrdCmsNode.cc:1537: error: `fsL2PFail2' was not declared in this
>>>>>>>>>> scope
>>>>>>>>>> XrdCmsNode.cc:1537: warning: unused variable 'fsL2PFail2'
>>>>>>>>>> XrdCmsNode.cc: At global scope:
>>>>>>>>>> XrdCmsNode.cc:1553: error: no `const char*
>>>>>>>>>> XrdCmsNode::fsFail(const
>>>>>>>>>> char*, const char*, const char*, int)' member function declared in
>>>>>>>>>> class `XrdCmsNode'
>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>> XrdCmsNode::fsFail(const char*, const char*, const char*, int)':
>>>>>>>>>> XrdCmsNode.cc:1559: error: `fsL2PFail1' was not declared in this
>>>>>>>>>> scope
>>>>>>>>>> XrdCmsNode.cc:1559: warning: unused variable 'fsL2PFail1'
>>>>>>>>>> XrdCmsNode.cc:1560: error: `fsL2PFail2' was not declared in this
>>>>>>>>>> scope
>>>>>>>>>> XrdCmsNode.cc:1560: warning: unused variable 'fsL2PFail2'
>>>>>>>>>> XrdCmsNode.cc: In static member function `static int
>>>>>>>>>> XrdCmsNode::isOnline(char*, int)':
>>>>>>>>>> XrdCmsNode.cc:1608: error: 'class XrdCmsConfig' has no member
>>>>>>>>>> named
>>>>>>>>>> 'ossFS'
>>>>>>>>>> make[4]: *** [../../obj/XrdCmsNode.o] Error 1
>>>>>>>>>> make[3]: *** [Linuxall] Error 2
>>>>>>>>>> make[2]: *** [all] Error 2
>>>>>>>>>> make[1]: *** [XrdCms] Error 2
>>>>>>>>>> make: *** [all] Error 2
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Wen
>>>>>>>>>>
>>>>>>>>>> On Tue, Dec 15, 2009 at 2:08 AM, Andrew Hanushevsky
>>>>>>>>>> <[log in to unmask]>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>
>>>>>>>>>>> I have developed a permanent fix. You will find the source files
>>>>>>>>>>> in
>>>>>>>>>>>
>>>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/
>>>>>>>>>>>
>>>>>>>>>>> There are three files: XrdCmsCluster.cc XrdCmsNode.cc
>>>>>>>>>>> XrdCmsProtocol.cc
>>>>>>>>>>>
>>>>>>>>>>> Please do a source replacement and recompile. Unfortunately, the
>>>>>>>>>>> cmsd
>>>>>>>>>>> will
>>>>>>>>>>> need to be replaced on each node regardless of role. My apologies
>>>>>>>>>>> for
>>>>>>>>>>> the
>>>>>>>>>>> disruption. Please let me know how it goes.
>>>>>>>>>>>
>>>>>>>>>>> Andy
>>>>>>>>>>>
>>>>>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>>>>>> <[log in to unmask]>
>>>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>>>>> Cc: <[log in to unmask]>
>>>>>>>>>>> Sent: Sunday, December 13, 2009 7:04 AM
>>>>>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks.
>>>>>>>>>>> I used the new cmsd at atlas-bkp1 manager. But it's still
>>>>>>>>>>> dropping
>>>>>>>>>>> nodes. And in supervisor's log, I cannot find any dataserver to
>>>>>>>>>>> register to it.
>>>>>>>>>>>
>>>>>>>>>>> The new logs are in http://higgs03.cs.wisc.edu/wguan/*.20091213.
>>>>>>>>>>> The manager is patched at 091213 08:38:15.
>>>>>>>>>>>
>>>>>>>>>>> Wen
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Dec 13, 2009 at 1:52 AM, Andrew Hanushevsky
>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Wen
>>>>>>>>>>>>
>>>>>>>>>>>> You will find the source replacement at:
>>>>>>>>>>>>
>>>>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/
>>>>>>>>>>>>
>>>>>>>>>>>> It's XrdCmsCluster.cc and it replaces
>>>>>>>>>>>> xrootd/src/XrdCms/XrdCmsCluster.cc
>>>>>>>>>>>>
>>>>>>>>>>>> I'm stepping out for a couple of hours but will be back to see
>>>>>>>>>>>> how
>>>>>>>>>>>> things
>>>>>>>>>>>> went. Sorry for the issues :-(
>>>>>>>>>>>>
>>>>>>>>>>>> Andy
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I prefer a source replacement. Then I can compile it.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I can do one of two things here:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1) Supply a source replacement and then you would recompile,
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2) Give me the uname -a of where the cmsd will run and I'll
>>>>>>>>>>>>>> supply
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>> binary
>>>>>>>>>>>>>> replacement for you.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Your choice.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Andrew
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The problem is found. Great. Thanks.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Where can I find the patched cmsd?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky
>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I found the problem. Looks like a regression from way back
>>>>>>>>>>>>>>>> when.
>>>>>>>>>>>>>>>> There
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> missing flag on the redirect. This will require a patched
>>>>>>>>>>>>>>>> cmsd
>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>> only to replace the redirector's cmsd as this only affects
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> redirector.
>>>>>>>>>>>>>>>> How would you like to proceed?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It doesn't work. atlas-bkp1 manager still dropping nodes
>>>>>>>>>>>>>>>>> again.
>>>>>>>>>>>>>>>>> In supervisor, I still haven't seen any dataserver
>>>>>>>>>>>>>>>>> registered.
>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>> said
>>>>>>>>>>>>>>>>> "I updated the ntp" because you said "the log timestamp do
>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>> overlap".
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky
>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Do you mean that everything is now working? It could be
>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>> removed
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> xrd.timeout directive. That really could cause problems.
>>>>>>>>>>>>>>>>>> As
>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> delays,
>>>>>>>>>>>>>>>>>> that is normal when the redirector thinks something is
>>>>>>>>>>>>>>>>>> going
>>>>>>>>>>>>>>>>>> wrong.
>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>> strategy is to delay clients until it can get back to a
>>>>>>>>>>>>>>>>>> stable
>>>>>>>>>>>>>>>>>> configuration. This usually prevents jobs from crashing
>>>>>>>>>>>>>>>>>> during
>>>>>>>>>>>>>>>>>> stressful
>>>>>>>>>>>>>>>>>> periods.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I restarted it to do supervisor test. Also because xrootd
>>>>>>>>>>>>>>>>>>> manager
>>>>>>>>>>>>>>>>>>> frequently doesn't response. (*) is the cms.log, the file
>>>>>>>>>>>>>>>>>>> select
>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> delayed again and again. When do a restart, all things
>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>> fine.
>>>>>>>>>>>>>>>>>>> Now
>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>> am trying to find a clue about it.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> (*)
>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask]
>>>>>>>>>>>>>>>>>>> do_Select:
>>>>>>>>>>>>>>>>>>> wc
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 Select seeking
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 UnkFile rc=1
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask]
>>>>>>>>>>>>>>>>>>> do_Select:
>>>>>>>>>>>>>>>>>>> delay 5
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 2+-1
>>>>>>>>>>>>>>>>>>> post=0
>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 Dispatch
>>>>>>>>>>>>>>>>>>> redirector.21313:14@atlas-bkp2
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> select dlen=166
>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1
>>>>>>>>>>>>>>>>>>> post=0
>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdSched: running redirector inq=0
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> There is no core file. I copied a new copies of the logs
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> link
>>>>>>>>>>>>>>>>>>> below.
>>>>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky
>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I see in the server log that it is restarting often.
>>>>>>>>>>>>>>>>>>>> Could
>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>> take
>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>> look
>>>>>>>>>>>>>>>>>>>> in the c193 to see if you have any core files? Also
>>>>>>>>>>>>>>>>>>>> please
>>>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>> core files are enabled as Linux defaults the size to 0.
>>>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>>> step
>>>>>>>>>>>>>>>>>>>> here
>>>>>>>>>>>>>>>>>>>> is to find out why your servers are restarting.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> the logs can be found here. From the log you can see
>>>>>>>>>>>>>>>>>>>>> atlas-bkp1
>>>>>>>>>>>>>>>>>>>>> manager are dropping nodes again and again which tries
>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky
>>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi Wen, Could you start everything up and provide me a
>>>>>>>>>>>>>>>>>>>>>> pointer
>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> manager log file, supervisor log file, and one data
>>>>>>>>>>>>>>>>>>>>>> server
>>>>>>>>>>>>>>>>>>>>>> logfile
>>>>>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>> which cover the same time-frame (from start to some
>>>>>>>>>>>>>>>>>>>>>> point
>>>>>>>>>>>>>>>>>>>>>> where
>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>> things are working or not). That way I can see what is
>>>>>>>>>>>>>>>>>>>>>> happening.
>>>>>>>>>>>>>>>>>>>>>> At
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> moment I only see two "bad" things in the config file:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a
>>>>>>>>>>>>>>>>>>>>>> manager
>>>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>> claim,
>>>>>>>>>>>>>>>>>>>>>> via
>>>>>>>>>>>>>>>>>>>>>> the all.manager directive, that there are three (bkp2
>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>> bkp3).
>>>>>>>>>>>>>>>>>>>>>> While
>>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>> should work, the log file will be dense with error
>>>>>>>>>>>>>>>>>>>>>> messages.
>>>>>>>>>>>>>>>>>>>>>> Please
>>>>>>>>>>>>>>>>>>>>>> correct
>>>>>>>>>>>>>>>>>>>>>> this to be consistent and make it easier to see real
>>>>>>>>>>>>>>>>>>>>>> errors.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> This is not a problem for me. Because this config is
>>>>>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>> dataserver. In manager, I updated the if
>>>>>>>>>>>>>>>>>>>>> atlas-bkp1.cs.wisc.edu
>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> atlas-bkp2 or something. This is a history problem. at
>>>>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>> atlas-bkp1 is used. atlas-bkp2 and atlas-bkp3 are added
>>>>>>>>>>>>>>>>>>>>> later.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> 2) Please use cms.space not olb.space (for historical
>>>>>>>>>>>>>>>>>>>>>> reasons
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> latter
>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>> still accepted and over-rides the former, but that
>>>>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>> soon
>>>>>>>>>>>>>>>>>>>>>> end),
>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>> please use only one (the config file uses both
>>>>>>>>>>>>>>>>>>>>>> directives).
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> yes. I should remove this line. in fact cms.space is in
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> cfg
>>>>>>>>>>>>>>>>>>>>> too.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The xrootd has an internal mechanism to connect
>>>>>>>>>>>>>>>>>>>>>> servers
>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>> supervisors
>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> allow for maximum reliability. You cannot change that
>>>>>>>>>>>>>>>>>>>>>> algorithm
>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>> no need to do so. You should *never* tell anyone to
>>>>>>>>>>>>>>>>>>>>>> directly
>>>>>>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>> supervisor. If you do, you will likely get unreachable
>>>>>>>>>>>>>>>>>>>>>> nodes.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> As for dropping data servers, it would appear to me,
>>>>>>>>>>>>>>>>>>>>>> given
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> flurry
>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>> such activity, that something either crashed or was
>>>>>>>>>>>>>>>>>>>>>> restarted.
>>>>>>>>>>>>>>>>>>>>>> That's
>>>>>>>>>>>>>>>>>>>>>> why
>>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>> would be good to see the complete log of each one of
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> entities.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I read the document. and write a config
>>>>>>>>>>>>>>>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg).
>>>>>>>>>>>>>>>>>>>>>>> I used my conf, I can see manager is dispatch message
>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> supervisor. But I cannot see any dataserver tries to
>>>>>>>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> supervisor. At the same time, in the manager's log, I
>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>>> dataserver are Dropped.
>>>>>>>>>>>>>>>>>>>>>>> How does xrootd decide which dataserver will connect
>>>>>>>>>>>>>>>>>>>>>>> supervisor?
>>>>>>>>>>>>>>>>>>>>>>> should I specify some dataservers to connect the
>>>>>>>>>>>>>>>>>>>>>>> supervisor?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> (*) supervisor log
>>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 Dispatch
>>>>>>>>>>>>>>>>>>>>>>> manager.0:20@atlas-bkp2
>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>> state
>>>>>>>>>>>>>>>>>>>>>>> dlen=42
>>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2
>>>>>>>>>>>>>>>>>>>>>>> do_State:
>>>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2
>>>>>>>>>>>>>>>>>>>>>>> do_StateFWD:
>>>>>>>>>>>>>>>>>>>>>>> Path
>>>>>>>>>>>>>>>>>>>>>>> find
>>>>>>>>>>>>>>>>>>>>>>> failed for state
>>>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> (*)manager log
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>> TSpace=5587GB
>>>>>>>>>>>>>>>>>>>>>>> NumFS=1
>>>>>>>>>>>>>>>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding
>>>>>>>>>>>>>>>>>>>>>>> path:
>>>>>>>>>>>>>>>>>>>>>>> w
>>>>>>>>>>>>>>>>>>>>>>> /atlas
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661
>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>> do_Space: 5696231MB free; 0% util
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 logged in.
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection from
>>>>>>>>>>>>>>>>>>>>>>> [log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: running
>>>>>>>>>>>>>>>>>>>>>>> ?:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched protocol
>>>>>>>>>>>>>>>>>>>>>>> cmsd
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>> 79
>>>>>>>>>>>>>>>>>>>>>>> attached
>>>>>>>>>>>>>>>>>>>>>>> to poller 2; num=22
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add
>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> bumps
>>>>>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 #63
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Drop_Node:
>>>>>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 dropped.
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add Shoved
>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 to cluster;
>>>>>>>>>>>>>>>>>>>>>>> id=63.78;
>>>>>>>>>>>>>>>>>>>>>>> num=64;
>>>>>>>>>>>>>>>>>>>>>>> min=51
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>> TSpace=5587GB
>>>>>>>>>>>>>>>>>>>>>>> NumFS=1
>>>>>>>>>>>>>>>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding
>>>>>>>>>>>>>>>>>>>>>>> path:
>>>>>>>>>>>>>>>>>>>>>>> w
>>>>>>>>>>>>>>>>>>>>>>> /atlas
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661
>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>> do_Space: 5721854MB free; 0% util
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 logged in.
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve from
>>>>>>>>>>>>>>>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop node
>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> 60
>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 node 63.78
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661
>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>> 79 detached from poller 2; num=21
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>>>> c177.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>> FD=16
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 node 0.3
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>> 16 detached from poller 2; num=20
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>>>> c179.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>> FD=21
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>> server.17065:[log in to unmask]:1094 node 1.4
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>> server.7978:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>> server.7978:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>> 21
>>>>>>>>>>>>>>>>>>>>>>> detached from poller 1; num=21
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 State: Status changed to
>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Send status to
>>>>>>>>>>>>>>>>>>>>>>> redirector.15656:14@atlas-bkp2
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>>>> c182.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>> FD=19
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 node 7.10
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>> 19 detached from poller 2; num=19
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>>>> c178.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>> FD=15
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 node 9.12
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>> 15 detached from poller 1; num=20
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>>>> c181.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>> FD=17
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 node 5.8
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>> 17 detached from poller 0; num=21
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>>>> c183.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>> FD=22
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 node 8.11
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>> 22 detached from poller 2; num=18
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>>>> c184.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>> FD=20
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>> server.4131:[log in to unmask]:1094 node 3.6
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>> 20 detached from poller 0; num=20
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>>>> c185.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>> FD=23
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 node 6.9
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>> server.8524:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>> server.8524:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>> 23
>>>>>>>>>>>>>>>>>>>>>>> detached from poller 0; num=19
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>>>> c180.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>> FD=18
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 node 4.7
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>> 18 detached from poller 1; num=19
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>>>> c186.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>> FD=24
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 node 2.5
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>> server.7849:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>> server.7849:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>> 24
>>>>>>>>>>>>>>>>>>>>>>> detached from poller 1; num=18
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node
>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node
>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node
>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node
>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node
>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node
>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node
>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node
>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node
>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node
>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled.
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled.
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled.
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled.
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled.
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled.
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled.
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled.
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled.
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled.
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled.
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled.
>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky
>>>>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> To go past 64 data servers you will need to setup
>>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>> supervisors.
>>>>>>>>>>>>>>>>>>>>>>>> This does not logically change the current
>>>>>>>>>>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>>> have.
>>>>>>>>>>>>>>>>>>>>>>>> You
>>>>>>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>>> need to configure one or more *new* servers (or at
>>>>>>>>>>>>>>>>>>>>>>>> least
>>>>>>>>>>>>>>>>>>>>>>>> xrootd
>>>>>>>>>>>>>>>>>>>>>>>> processes)
>>>>>>>>>>>>>>>>>>>>>>>> whose role is supervisor. We'd like them to run in
>>>>>>>>>>>>>>>>>>>>>>>> separate
>>>>>>>>>>>>>>>>>>>>>>>> machines
>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>> reliability purposes, but they could run on the
>>>>>>>>>>>>>>>>>>>>>>>> manager
>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>>> long
>>>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>>> give each one a unique instance name (i.e., -n
>>>>>>>>>>>>>>>>>>>>>>>> option).
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> The front part of the cmsd reference explains how to
>>>>>>>>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Is there any change to configure xrootd with more
>>>>>>>>>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>>>>>>> 65
>>>>>>>>>>>>>>>>>>>>>>>>> machines? I used the configure below but it doesn't
>>>>>>>>>>>>>>>>>>>>>>>>> work.
>>>>>>>>>>>>>>>>>>>>>>>>> Should
>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>> configure some machines' manager to be supvervisor?
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>
>>>
>