Print

Print


Hi Wen,

I reviewed the log file. Other than the odd redirect of c131 at 17:47:25 
which I can't comment on because its logs on the web site do not overlap 
with the manager or supervisor. Unless all the logs include the full time in 
question I can't say much of anything. Can you provide me with inclusive 
logs?

atlas-bkp1 cms: 17:20:57 to 17:42:19 xrd: 17:20:57 to 17:40:57
higgs07 cms & xrd 17:22:33 to 17:42:33
c131 cms & xrd 17:31:57 to 17:47:28

That said, it certainly looks like things were working and files were being 
accessed and discovered on all the machines. You even werw able to open
/atlas/xrootd/users/wguan/test/test98123313
through not
/atlas/xrootd/users/wguan/test/test123131The other issue is that you did not 
specify a stable adminpath and the adminpath defaults to /tmp. If you have a 
"cleanup" script that runs periodically for /tmp then eventually your 
cluster will go catonic as important (but not often used) files are deleted 
by that script. Could you please find a stable home for the adminpath?

I reran my tests here and things worked as expected. I will ramp up some 
more tests. So, what is your status today?

Andy

----- Original Message ----- 
From: "wen guan" <[log in to unmask]>
To: "Andrew Hanushevsky" <[log in to unmask]>
Cc: <[log in to unmask]>
Sent: Thursday, December 17, 2009 5:05 AM
Subject: Re: xrootd with more than 65 machines


Hi Andy,

    Yes. I am using the file download from
http://www.slac.stanford.edu/~abh/cmsd/ which compiled yesterday.  I
just now compiled it again and compare it with one I compiled
yesterday. they are the same(same md5sum).

Wen

On Thu, Dec 17, 2009 at 2:09 AM, Andrew Hanushevsky <[log in to unmask]> 
wrote:
> Hi Wen,
>
> If c131 cannot connect then either c131 does not have the new cms or
> atlas-bkp1 does not have the new cms as that would be what would happen if
> either were true. Looking at the log on c131 it would appear that 
> atlas-bkp1
> is still using the old cmsd as the response data length is wrong. Could 
> you
> verify please.
>
> Andy
>
> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
> To: "Andrew Hanushevsky" <[log in to unmask]>
> Cc: <[log in to unmask]>
> Sent: Wednesday, December 16, 2009 3:58 PM
> Subject: Re: xrootd with more than 65 machines
>
>
> Hi Andy,
>
> I tried it. But there are still some problem. I put the logs in
> higgs03.cs.wisc.edu/wguan/
>
> In my test, c131 is the 65 nodes to be added the the manager.
> and I can copy the file to the pool through manager. But I cannot
> copy a file out which is in c131.
>
> In c131's cms.log, I see "Manager:
> manager.0:[log in to unmask] removed; redirected" again and
> again. and I cannot see any thing about c131 in higgs07's
> log(supervisor). Does it mean manager tries to redirect it to higgs07,
> but c131 hasn't try to connect higgs07. It only tries to connect
> manager again.
>
> (*)
> [root@c131 ~]# xrdcp /bin/mv
> root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331
> Last server error 10000 ('')
> Error accessing path/file for
> root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331
> [root@c131 ~]# xrdcp /bin/mv
> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123311
> [xrootd] Total 0.06 MB |====================| 100.00 % [3.1 MB/s]
> [root@c131 ~]# xrdcp /bin/mv
> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123312
> [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s]
> [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/
> test123131
> [root@c131 ~]# xrdcp
> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
> /tmp/
> Last server error 3011 ('No servers are available to read the file.')
> Error accessing path/file for
> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
> [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/test123131
> /atlas/xrootd/users/wguan/test/test123131
> [root@c131 ~]# xrdcp
> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
> /tmp/
> Last server error 3011 ('No servers are available to read the file.')
> Error accessing path/file for
> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
> [root@c131 ~]# xrdcp /bin/mv
> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123313
> [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s]
> [root@c131 ~]# xrdcp
> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
> /tmp/
> Last server error 3011 ('No servers are available to read the file.')
> Error accessing path/file for
> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
> [root@c131 ~]# xrdcp
> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
> /tmp/
> Last server error 3011 ('No servers are available to read the file.')
> Error accessing path/file for
> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
> [root@c131 ~]# xrdcp
> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
> /tmp/
> Last server error 3011 ('No servers are available to read the file.')
> Error accessing path/file for
> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
> [root@c131 ~]# tail -f /var/log/xrootd/cms.log
> 091216 17:45:52 3103 manager.0:[log in to unmask] XrdLink:
> Setting ref to 2+-1 post=0
> 091216 17:45:55 3103 Pander trying to connect to lvl 0
> atlas-bkp1.cs.wisc.edu:3121
> 091216 17:45:55 3103 XrdInet: Connected to atlas-bkp1.cs.wisc.edu:3121
> 091216 17:45:55 3103 Add atlas-bkp1.cs.wisc.edu to manager config; id=0
> 091216 17:45:55 3103 ManTree: Now connected to 3 root node(s)
> 091216 17:45:55 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu
> 091216 17:45:55 3103 Dispatch manager.0:[log in to unmask] for try
> dlen=3
> 091216 17:45:55 3103 manager.0:[log in to unmask] do_Try:
> 091216 17:45:55 3103 Remove completed atlas-bkp1.cs.wisc.edu manager 0.95
> 091216 17:45:55 3103 Manager: manager.0:[log in to unmask]
> removed; redirected
> 091216 17:46:04 3103 Pander trying to connect to lvl 0
> atlas-bkp1.cs.wisc.edu:3121
> 091216 17:46:04 3103 XrdInet: Connected to atlas-bkp1.cs.wisc.edu:3121
> 091216 17:46:04 3103 Add atlas-bkp1.cs.wisc.edu to manager config; id=0
> 091216 17:46:04 3103 ManTree: Now connected to 3 root node(s)
> 091216 17:46:04 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu
> 091216 17:46:04 3103 Dispatch manager.0:[log in to unmask] for try
> dlen=3
> 091216 17:46:04 3103 Protocol: No buffers to serve atlas-bkp1.cs.wisc.edu
> 091216 17:46:04 3103 Remove completed atlas-bkp1.cs.wisc.edu manager 0.96
> 091216 17:46:04 3103 Manager: manager.0:[log in to unmask]
> removed; insufficient buffers
> 091216 17:46:11 3103 Dispatch manager.0:[log in to unmask] for
> state dlen=169
> 091216 17:46:11 3103 manager.0:[log in to unmask] XrdLink:
> Setting ref to 1+1 post=0
>
> Thanks
> Wen
>
> On Thu, Dec 17, 2009 at 12:10 AM, wen guan <[log in to unmask]> wrote:
>>
>> Hi Andy,
>>
>>> OK, I understand. As for stalling, too many nodes were deemed to be in
>>> trouble for the manager to allow service resumption.
>>>
>>> Please make sure that all of the nodes in the cluster receive the new
>>> cmsd
>>> as they will drop off with the old one and you'll see the same kind of
>>> activity. Perhaps the best way to know that you suceeded in putting
>>> everything in sync is to start with 63 data nodes plus one supervisor.
>>> Once
>>> all connections are established; adding an additional server should
>>> simply
>>> send it to the supervisor.
>>
>> I will do it.
>> you said start 63 data server and one supervisor. Does it mean the
>> supervisor is managed using the same policy? If I there are 64
>> dataservers which are connected before the supervisor, will the
>> supervisor be dropped? Is the supervisor has high priority to be
>> added to the manager? I mean, if there are already 64 dataservers and
>> a supervisor comes in, will the supervisor be accepted and a datasever
>> be redirected to the supervisor?
>>
>> Thanks
>> Wen
>>
>>>
>>> Hi Andrew,
>>>
>>> But when I tried to xrdcp a file to it, it doesn't response. In
>>> atlas-bkp1-xrd.log.20091213, it always prints "stalling client for 10
>>> sec". But in cms.log, I can find any message about the file.
>>>
>>>> I don't see why you say it doesn't work. With the debugging level set 
>>>> so
>>>> high the noise may make it look like something is going wrong but that
>>>> isn't
>>>> necessarily the case.
>>>>
>>>> 1) The 'too many subscribers' is correct. The manager was simply
>>>> redirecting
>>>> them because there were already 64 servers. However, in your case the
>>>> supervisor wasn't started until almost 30 minutes after everyone else
>>>> (i.e.,
>>>> 10:42 AM). Why was that? I'm not suprised about the flurry of messages
>>>> with
>>>> a critical component missing for 30 minutes.
>>>
>>> Because the manager is 64bit machine but supervisor is 32 bit machine.
>>> Then I have to recompile the it. At that time, I was interrupted by
>>> something else.
>>>
>>>
>>>> 2) Once the supervisor started, it started accepting the redirected
>>>> servers.
>>>>
>>>> 3) Then 10 seconds (10:42:10) later the supervisor was restarted. So,
>>>> that
>>>> would cause a flurry of activity to occur as there is no backup
>>>> supervisor
>>>> to take over.
>>>>
>>>> 4) This happened again at 10:42:34 AM then again at 10:48:49. Is the
>>>> supervisor crashing? Is there a core file?
>>>>
>>>> 5) At 11:11 AM the manager restarted. Again, is there a core file here
>>>> or
>>>> was this a manual action?
>>>>
>>>> During the course of all of this. All nodes connected were operating
>>>> propely
>>>> and files were being located.
>>>>
>>>> So, the two big questions are:
>>>>
>>>> a) Why was the supervisor not started until 30 minutes after the system
>>>> was
>>>> started?
>>>>
>>>> b) Is there an explanation of the restarts? If this was a crash then we
>>>> need
>>>> a core file to figure out what happened.
>>>
>>> It's not a crash. There are some reasons that I restarted some daemons.
>>> (1)I thought if a dataserver tried many times to connect to a
>>> redirector but failed, the dataserver would not try to connect a
>>> redirector again. The supervisor was missing for long time. So maybe
>>> some dataservers would not try to connect to atlas-bkp1 again. To
>>> reactive these dataservers, I restarted any servers.
>>> (2)When I tried to xrdcp, it was hanging for long time. I thought
>>> maybe manager was affected by some others things. then I restarte
>>> manager to see whether a restart can make this xrdcp work.
>>>
>>>
>>> Thanks
>>> Wen
>>>
>>>> Andy
>>>>
>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>> Cc: <[log in to unmask]>
>>>> Sent: Wednesday, December 16, 2009 9:38 AM
>>>> Subject: Re: xrootd with more than 65 machines
>>>>
>>>>
>>>> Hi Andrew,
>>>>
>>>> It still doesn't work.
>>>> The log file is in higgs03.cs.wisc.edu/wguan/. The name is *.20091216
>>>> The manager complains there are too many subscribers and the removes
>>>> nodes.
>>>>
>>>> (*)
>>>> Add server.10040:[log in to unmask] redirected; too many 
>>>> subscribers.
>>>>
>>>> Wen
>>>>
>>>> On Wed, Dec 16, 2009 at 4:25 AM, Andrew Hanushevsky <[log in to unmask]>
>>>> wrote:
>>>>>
>>>>> Hi Wen,
>>>>>
>>>>> It will be easier for me to retroft as the changes were pretty minor.
>>>>> Please
>>>>> lift the new XrdCmsNode.cc file from
>>>>>
>>>>> http://www.slac.stanford.edu/~abh/cmsd
>>>>>
>>>>> Andy
>>>>>
>>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>> Cc: <[log in to unmask]>
>>>>> Sent: Tuesday, December 15, 2009 5:12 PM
>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>
>>>>>
>>>>> Hi Andy,
>>>>>
>>>>> I can switch to 20091104-1102. Then you don't need to patch
>>>>> another version. How can I download v20091104-1102?
>>>>>
>>>>> Thanks
>>>>> Wen
>>>>>
>>>>> On Wed, Dec 16, 2009 at 12:52 AM, Andrew Hanushevsky 
>>>>> <[log in to unmask]>
>>>>> wrote:
>>>>>>
>>>>>> Hi Wen,
>>>>>>
>>>>>> Ah yes, I see that now. The file I gave you is based on
>>>>>> v20091104-1102.
>>>>>> Let
>>>>>> me see if I can retrofit the patch for you.
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>> ----- Original Message ----- From: "wen guan" 
>>>>>> <[log in to unmask]>
>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>> Cc: <[log in to unmask]>
>>>>>> Sent: Tuesday, December 15, 2009 1:04 PM
>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>
>>>>>>
>>>>>> Hi Andy,
>>>>>>
>>>>>> Which xrootd version are you using? XrdCmsConfig.hh is different.
>>>>>> XrdCmsConfig.hh is downloaded from
>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/.
>>>>>>
>>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsNode.cc
>>>>>> 6fb3ae40fe4e10bdd4d372818a341f2c src/XrdCms/XrdCmsNode.cc
>>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsConfig.hh
>>>>>> 7d57753847d9448186c718f98e963cbe src/XrdCms/XrdCmsConfig.hh
>>>>>>
>>>>>> Thanks
>>>>>> Wen
>>>>>>
>>>>>> On Tue, Dec 15, 2009 at 10:50 PM, Andrew Hanushevsky
>>>>>> <[log in to unmask]>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Wen,
>>>>>>>
>>>>>>> Just compiled on Linux and it was clean. Something is really wrong
>>>>>>> with
>>>>>>> your
>>>>>>> source files, specifically XrdCmsConfig.cc
>>>>>>>
>>>>>>> The MD5 checksums on the relevant files are:
>>>>>>>
>>>>>>> MD5 (XrdCmsNode.cc) = 6fb3ae40fe4e10bdd4d372818a341f2c
>>>>>>>
>>>>>>> MD5 (XrdCmsConfig.hh) = 4a7d655582a7cd43b098947d0676924b
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>> <[log in to unmask]>
>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>> Cc: <[log in to unmask]>
>>>>>>> Sent: Tuesday, December 15, 2009 4:24 AM
>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>
>>>>>>>
>>>>>>> Hi Andy,
>>>>>>>
>>>>>>> No problem. Thanks for the fix. But it cannot be compiled. The
>>>>>>> version I am using is
>>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/.
>>>>>>>
>>>>>>> Making cms component...
>>>>>>> Compiling XrdCmsNode.cc
>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>> XrdCmsNode::do_Chmod(XrdCmsRRData&)':
>>>>>>> XrdCmsNode.cc:268: error: `fsExec' was not declared in this scope
>>>>>>> XrdCmsNode.cc:268: warning: unused variable 'fsExec'
>>>>>>> XrdCmsNode.cc:269: error: 'class XrdCmsConfig' has no member named
>>>>>>> 'ossFS'
>>>>>>> XrdCmsNode.cc:273: error: `fsFail' was not declared in this scope
>>>>>>> XrdCmsNode.cc:273: warning: unused variable 'fsFail'
>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>> XrdCmsNode::do_Mkdir(XrdCmsRRData&)':
>>>>>>> XrdCmsNode.cc:600: error: `fsExec' was not declared in this scope
>>>>>>> XrdCmsNode.cc:600: warning: unused variable 'fsExec'
>>>>>>> XrdCmsNode.cc:601: error: 'class XrdCmsConfig' has no member named
>>>>>>> 'ossFS'
>>>>>>> XrdCmsNode.cc:605: error: `fsFail' was not declared in this scope
>>>>>>> XrdCmsNode.cc:605: warning: unused variable 'fsFail'
>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>> XrdCmsNode::do_Mkpath(XrdCmsRRData&)':
>>>>>>> XrdCmsNode.cc:640: error: `fsExec' was not declared in this scope
>>>>>>> XrdCmsNode.cc:640: warning: unused variable 'fsExec'
>>>>>>> XrdCmsNode.cc:641: error: 'class XrdCmsConfig' has no member named
>>>>>>> 'ossFS'
>>>>>>> XrdCmsNode.cc:645: error: `fsFail' was not declared in this scope
>>>>>>> XrdCmsNode.cc:645: warning: unused variable 'fsFail'
>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>> XrdCmsNode::do_Mv(XrdCmsRRData&)':
>>>>>>> XrdCmsNode.cc:704: error: `fsExec' was not declared in this scope
>>>>>>> XrdCmsNode.cc:704: warning: unused variable 'fsExec'
>>>>>>> XrdCmsNode.cc:705: error: 'class XrdCmsConfig' has no member named
>>>>>>> 'ossFS'
>>>>>>> XrdCmsNode.cc:709: error: `fsFail' was not declared in this scope
>>>>>>> XrdCmsNode.cc:709: warning: unused variable 'fsFail'
>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>> XrdCmsNode::do_Rm(XrdCmsRRData&)':
>>>>>>> XrdCmsNode.cc:831: error: `fsExec' was not declared in this scope
>>>>>>> XrdCmsNode.cc:831: warning: unused variable 'fsExec'
>>>>>>> XrdCmsNode.cc:832: error: 'class XrdCmsConfig' has no member named
>>>>>>> 'ossFS'
>>>>>>> XrdCmsNode.cc:836: error: `fsFail' was not declared in this scope
>>>>>>> XrdCmsNode.cc:836: warning: unused variable 'fsFail'
>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>> XrdCmsNode::do_Rmdir(XrdCmsRRData&)':
>>>>>>> XrdCmsNode.cc:873: error: `fsExec' was not declared in this scope
>>>>>>> XrdCmsNode.cc:873: warning: unused variable 'fsExec'
>>>>>>> XrdCmsNode.cc:874: error: 'class XrdCmsConfig' has no member named
>>>>>>> 'ossFS'
>>>>>>> XrdCmsNode.cc:878: error: `fsFail' was not declared in this scope
>>>>>>> XrdCmsNode.cc:878: warning: unused variable 'fsFail'
>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>> XrdCmsNode::do_Trunc(XrdCmsRRData&)':
>>>>>>> XrdCmsNode.cc:1377: error: `fsExec' was not declared in this scope
>>>>>>> XrdCmsNode.cc:1377: warning: unused variable 'fsExec'
>>>>>>> XrdCmsNode.cc:1378: error: 'class XrdCmsConfig' has no member named
>>>>>>> 'ossFS'
>>>>>>> XrdCmsNode.cc:1382: error: `fsFail' was not declared in this scope
>>>>>>> XrdCmsNode.cc:1382: warning: unused variable 'fsFail'
>>>>>>> XrdCmsNode.cc: At global scope:
>>>>>>> XrdCmsNode.cc:1524: error: no `int XrdCmsNode::fsExec(XrdOucProg*,
>>>>>>> char*, char*)' member function declared in class `XrdCmsNode'
>>>>>>> XrdCmsNode.cc: In member function `int
>>>>>>> XrdCmsNode::fsExec(XrdOucProg*,
>>>>>>> char*, char*)':
>>>>>>> XrdCmsNode.cc:1533: error: `fsL2PFail1' was not declared in this
>>>>>>> scope
>>>>>>> XrdCmsNode.cc:1533: warning: unused variable 'fsL2PFail1'
>>>>>>> XrdCmsNode.cc:1537: error: `fsL2PFail2' was not declared in this
>>>>>>> scope
>>>>>>> XrdCmsNode.cc:1537: warning: unused variable 'fsL2PFail2'
>>>>>>> XrdCmsNode.cc: At global scope:
>>>>>>> XrdCmsNode.cc:1553: error: no `const char* XrdCmsNode::fsFail(const
>>>>>>> char*, const char*, const char*, int)' member function declared in
>>>>>>> class `XrdCmsNode'
>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>> XrdCmsNode::fsFail(const char*, const char*, const char*, int)':
>>>>>>> XrdCmsNode.cc:1559: error: `fsL2PFail1' was not declared in this
>>>>>>> scope
>>>>>>> XrdCmsNode.cc:1559: warning: unused variable 'fsL2PFail1'
>>>>>>> XrdCmsNode.cc:1560: error: `fsL2PFail2' was not declared in this
>>>>>>> scope
>>>>>>> XrdCmsNode.cc:1560: warning: unused variable 'fsL2PFail2'
>>>>>>> XrdCmsNode.cc: In static member function `static int
>>>>>>> XrdCmsNode::isOnline(char*, int)':
>>>>>>> XrdCmsNode.cc:1608: error: 'class XrdCmsConfig' has no member named
>>>>>>> 'ossFS'
>>>>>>> make[4]: *** [../../obj/XrdCmsNode.o] Error 1
>>>>>>> make[3]: *** [Linuxall] Error 2
>>>>>>> make[2]: *** [all] Error 2
>>>>>>> make[1]: *** [XrdCms] Error 2
>>>>>>> make: *** [all] Error 2
>>>>>>>
>>>>>>>
>>>>>>> Wen
>>>>>>>
>>>>>>> On Tue, Dec 15, 2009 at 2:08 AM, Andrew Hanushevsky
>>>>>>> <[log in to unmask]>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Wen,
>>>>>>>>
>>>>>>>> I have developed a permanent fix. You will find the source files in
>>>>>>>>
>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/
>>>>>>>>
>>>>>>>> There are three files: XrdCmsCluster.cc XrdCmsNode.cc
>>>>>>>> XrdCmsProtocol.cc
>>>>>>>>
>>>>>>>> Please do a source replacement and recompile. Unfortunately, the
>>>>>>>> cmsd
>>>>>>>> will
>>>>>>>> need to be replaced on each node regardless of role. My apologies
>>>>>>>> for
>>>>>>>> the
>>>>>>>> disruption. Please let me know how it goes.
>>>>>>>>
>>>>>>>> Andy
>>>>>>>>
>>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>>> <[log in to unmask]>
>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>> Cc: <[log in to unmask]>
>>>>>>>> Sent: Sunday, December 13, 2009 7:04 AM
>>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Andrew,
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>> I used the new cmsd at atlas-bkp1 manager. But it's still dropping
>>>>>>>> nodes. And in supervisor's log, I cannot find any dataserver to
>>>>>>>> register to it.
>>>>>>>>
>>>>>>>> The new logs are in http://higgs03.cs.wisc.edu/wguan/*.20091213.
>>>>>>>> The manager is patched at 091213 08:38:15.
>>>>>>>>
>>>>>>>> Wen
>>>>>>>>
>>>>>>>> On Sun, Dec 13, 2009 at 1:52 AM, Andrew Hanushevsky
>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>
>>>>>>>>> Hi Wen
>>>>>>>>>
>>>>>>>>> You will find the source replacement at:
>>>>>>>>>
>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/
>>>>>>>>>
>>>>>>>>> It's XrdCmsCluster.cc and it replaces
>>>>>>>>> xrootd/src/XrdCms/XrdCmsCluster.cc
>>>>>>>>>
>>>>>>>>> I'm stepping out for a couple of hours but will be back to see how
>>>>>>>>> things
>>>>>>>>> went. Sorry for the issues :-(
>>>>>>>>>
>>>>>>>>> Andy
>>>>>>>>>
>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote:
>>>>>>>>>
>>>>>>>>>> Hi Andrew,
>>>>>>>>>>
>>>>>>>>>> I prefer a source replacement. Then I can compile it.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Wen
>>>>>>>>>>>
>>>>>>>>>>> I can do one of two things here:
>>>>>>>>>>>
>>>>>>>>>>> 1) Supply a source replacement and then you would recompile, or
>>>>>>>>>>>
>>>>>>>>>>> 2) Give me the uname -a of where the cmsd will run and I'll
>>>>>>>>>>> supply
>>>>>>>>>>> a
>>>>>>>>>>> binary
>>>>>>>>>>> replacement for you.
>>>>>>>>>>>
>>>>>>>>>>> Your choice.
>>>>>>>>>>>
>>>>>>>>>>> Andy
>>>>>>>>>>>
>>>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Andrew
>>>>>>>>>>>>
>>>>>>>>>>>> The problem is found. Great. Thanks.
>>>>>>>>>>>>
>>>>>>>>>>>> Where can I find the patched cmsd?
>>>>>>>>>>>>
>>>>>>>>>>>> Wen
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky
>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I found the problem. Looks like a regression from way back
>>>>>>>>>>>>> when.
>>>>>>>>>>>>> There
>>>>>>>>>>>>> is
>>>>>>>>>>>>> a
>>>>>>>>>>>>> missing flag on the redirect. This will require a patched cmsd
>>>>>>>>>>>>> but
>>>>>>>>>>>>> you
>>>>>>>>>>>>> need
>>>>>>>>>>>>> only to replace the redirector's cmsd as this only affects the
>>>>>>>>>>>>> redirector.
>>>>>>>>>>>>> How would you like to proceed?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It doesn't work. atlas-bkp1 manager still dropping nodes
>>>>>>>>>>>>>> again.
>>>>>>>>>>>>>> In supervisor, I still haven't seen any dataserver 
>>>>>>>>>>>>>> registered.
>>>>>>>>>>>>>> I
>>>>>>>>>>>>>> said
>>>>>>>>>>>>>> "I updated the ntp" because you said "the log timestamp do 
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>> overlap".
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky
>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Do you mean that everything is now working? It could be that
>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>> removed
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> xrd.timeout directive. That really could cause problems. As
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> delays,
>>>>>>>>>>>>>>> that is normal when the redirector thinks something is going
>>>>>>>>>>>>>>> wrong.
>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>> strategy is to delay clients until it can get back to a
>>>>>>>>>>>>>>> stable
>>>>>>>>>>>>>>> configuration. This usually prevents jobs from crashing
>>>>>>>>>>>>>>> during
>>>>>>>>>>>>>>> stressful
>>>>>>>>>>>>>>> periods.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I restarted it to do supervisor test. Also because xrootd
>>>>>>>>>>>>>>>> manager
>>>>>>>>>>>>>>>> frequently doesn't response. (*) is the cms.log, the file
>>>>>>>>>>>>>>>> select
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> delayed again and again. When do a restart, all things are
>>>>>>>>>>>>>>>> fine.
>>>>>>>>>>>>>>>> Now
>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>> am trying to find a clue about it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (*)
>>>>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask]
>>>>>>>>>>>>>>>> do_Select:
>>>>>>>>>>>>>>>> wc
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>>>> 091212 00:00:19 21318 Select seeking
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>>>> 091212 00:00:19 21318 UnkFile rc=1
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask]
>>>>>>>>>>>>>>>> do_Select:
>>>>>>>>>>>>>>>> delay 5
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 2+-1
>>>>>>>>>>>>>>>> post=0
>>>>>>>>>>>>>>>> 091212 00:00:19 21318 Dispatch
>>>>>>>>>>>>>>>> redirector.21313:14@atlas-bkp2
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>> select dlen=166
>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1
>>>>>>>>>>>>>>>> post=0
>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdSched: running redirector inq=0
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> There is no core file. I copied a new copies of the logs to
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> link
>>>>>>>>>>>>>>>> below.
>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky
>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I see in the server log that it is restarting often. Could
>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>> take
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> look
>>>>>>>>>>>>>>>>> in the c193 to see if you have any core files? Also please
>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> core files are enabled as Linux defaults the size to 0. 
>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>> step
>>>>>>>>>>>>>>>>> here
>>>>>>>>>>>>>>>>> is to find out why your servers are restarting.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> the logs can be found here. From the log you can see
>>>>>>>>>>>>>>>>>> atlas-bkp1
>>>>>>>>>>>>>>>>>> manager are dropping nodes again and again which tries to
>>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky
>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Wen, Could you start everything up and provide me a
>>>>>>>>>>>>>>>>>>> pointer
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> manager log file, supervisor log file, and one data
>>>>>>>>>>>>>>>>>>> server
>>>>>>>>>>>>>>>>>>> logfile
>>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> which cover the same time-frame (from start to some 
>>>>>>>>>>>>>>>>>>> point
>>>>>>>>>>>>>>>>>>> where
>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>> things are working or not). That way I can see what is
>>>>>>>>>>>>>>>>>>> happening.
>>>>>>>>>>>>>>>>>>> At
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> moment I only see two "bad" things in the config file:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a 
>>>>>>>>>>>>>>>>>>> manager
>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>> claim,
>>>>>>>>>>>>>>>>>>> via
>>>>>>>>>>>>>>>>>>> the all.manager directive, that there are three (bkp2 
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> bkp3).
>>>>>>>>>>>>>>>>>>> While
>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>> should work, the log file will be dense with error
>>>>>>>>>>>>>>>>>>> messages.
>>>>>>>>>>>>>>>>>>> Please
>>>>>>>>>>>>>>>>>>> correct
>>>>>>>>>>>>>>>>>>> this to be consistent and make it easier to see real
>>>>>>>>>>>>>>>>>>> errors.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This is not a problem for me. Because this config is used
>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> dataserver. In manager, I updated the if
>>>>>>>>>>>>>>>>>> atlas-bkp1.cs.wisc.edu
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> atlas-bkp2 or something. This is a history problem. at
>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>> atlas-bkp1 is used. atlas-bkp2 and atlas-bkp3 are added
>>>>>>>>>>>>>>>>>> later.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 2) Please use cms.space not olb.space (for historical
>>>>>>>>>>>>>>>>>>> reasons
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> latter
>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> still accepted and over-rides the former, but that will
>>>>>>>>>>>>>>>>>>> soon
>>>>>>>>>>>>>>>>>>> end),
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> please use only one (the config file uses both
>>>>>>>>>>>>>>>>>>> directives).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> yes. I should remove this line. in fact cms.space is in
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> cfg
>>>>>>>>>>>>>>>>>> too.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The xrootd has an internal mechanism to connect servers
>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>> supervisors
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> allow for maximum reliability. You cannot change that
>>>>>>>>>>>>>>>>>>> algorithm
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> no need to do so. You should *never* tell anyone to
>>>>>>>>>>>>>>>>>>> directly
>>>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>> supervisor. If you do, you will likely get unreachable
>>>>>>>>>>>>>>>>>>> nodes.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> As for dropping data servers, it would appear to me,
>>>>>>>>>>>>>>>>>>> given
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> flurry
>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> such activity, that something either crashed or was
>>>>>>>>>>>>>>>>>>> restarted.
>>>>>>>>>>>>>>>>>>> That's
>>>>>>>>>>>>>>>>>>> why
>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>> would be good to see the complete log of each one of the
>>>>>>>>>>>>>>>>>>> entities.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I read the document. and write a config
>>>>>>>>>>>>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg).
>>>>>>>>>>>>>>>>>>>> I used my conf, I can see manager is dispatch message 
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> supervisor. But I cannot see any dataserver tries to
>>>>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> supervisor. At the same time, in the manager's log, I
>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>> dataserver are Dropped.
>>>>>>>>>>>>>>>>>>>> How does xrootd decide which dataserver will connect
>>>>>>>>>>>>>>>>>>>> supervisor?
>>>>>>>>>>>>>>>>>>>> should I specify some dataservers to connect the
>>>>>>>>>>>>>>>>>>>> supervisor?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (*) supervisor log
>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2
>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> state
>>>>>>>>>>>>>>>>>>>> dlen=42
>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State:
>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2
>>>>>>>>>>>>>>>>>>>> do_StateFWD:
>>>>>>>>>>>>>>>>>>>> Path
>>>>>>>>>>>>>>>>>>>> find
>>>>>>>>>>>>>>>>>>>> failed for state
>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (*)manager log
>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>> TSpace=5587GB
>>>>>>>>>>>>>>>>>>>> NumFS=1
>>>>>>>>>>>>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding
>>>>>>>>>>>>>>>>>>>> path:
>>>>>>>>>>>>>>>>>>>> w
>>>>>>>>>>>>>>>>>>>> /atlas
>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661
>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>> do_Space: 5696231MB free; 0% util
>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 logged in.
>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection from
>>>>>>>>>>>>>>>>>>>> [log in to unmask]
>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: running
>>>>>>>>>>>>>>>>>>>> ?:[log in to unmask]
>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched protocol 
>>>>>>>>>>>>>>>>>>>> cmsd
>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll:
>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>> 79
>>>>>>>>>>>>>>>>>>>> attached
>>>>>>>>>>>>>>>>>>>> to poller 2; num=22
>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add
>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>>>>> bumps
>>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 #63
>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Drop_Node:
>>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 dropped.
>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add Shoved
>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 to cluster;
>>>>>>>>>>>>>>>>>>>> id=63.78;
>>>>>>>>>>>>>>>>>>>> num=64;
>>>>>>>>>>>>>>>>>>>> min=51
>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>> TSpace=5587GB
>>>>>>>>>>>>>>>>>>>> NumFS=1
>>>>>>>>>>>>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding
>>>>>>>>>>>>>>>>>>>> path:
>>>>>>>>>>>>>>>>>>>> w
>>>>>>>>>>>>>>>>>>>> /atlas
>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661
>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>> do_Space: 5721854MB free; 0% util
>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 logged in.
>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve from
>>>>>>>>>>>>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer
>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>> 60
>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 node 63.78
>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 
>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>> 79 detached from poller 2; num=21
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>> c177.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>> FD=16
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 node 0.3
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask]
>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 
>>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask]
>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>> 16 detached from poller 2; num=20
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>> c179.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>> FD=21
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>> server.17065:[log in to unmask]:1094 node 1.4
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>> server.7978:[log in to unmask]
>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.7978:[log in to unmask]
>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>> 21
>>>>>>>>>>>>>>>>>>>> detached from poller 1; num=21
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 State: Status changed to 
>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Send status to
>>>>>>>>>>>>>>>>>>>> redirector.15656:14@atlas-bkp2
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>> c182.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>> FD=19
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 node 7.10
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask]
>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 
>>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask]
>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>> 19 detached from poller 2; num=19
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>> c178.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>> FD=15
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 node 9.12
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask]
>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 
>>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask]
>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>> 15 detached from poller 1; num=20
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>> c181.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>> FD=17
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 node 5.8
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask]
>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 
>>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask]
>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>> 17 detached from poller 0; num=21
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>> c183.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>> FD=22
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 node 8.11
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask]
>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 
>>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask]
>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>> 22 detached from poller 2; num=18
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>> c184.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>> FD=20
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>> server.4131:[log in to unmask]:1094 node 3.6
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask]
>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 
>>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask]
>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>> 20 detached from poller 0; num=20
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>> c185.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>> FD=23
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 node 6.9
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>> server.8524:[log in to unmask]
>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.8524:[log in to unmask]
>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>> 23
>>>>>>>>>>>>>>>>>>>> detached from poller 0; num=19
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>> c180.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>> FD=18
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 node 4.7
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask]
>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 
>>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask]
>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>> 18 detached from poller 1; num=19
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>> c186.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>> FD=24
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 node 2.5
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>> server.7849:[log in to unmask]
>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.7849:[log in to unmask]
>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>> 24
>>>>>>>>>>>>>>>>>>>> detached from poller 1; num=18
>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled.
>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled.
>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled.
>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled.
>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled.
>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled.
>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled.
>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled.
>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled.
>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled.
>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled.
>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers
>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled.
>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky
>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> To go past 64 data servers you will need to setup one
>>>>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>> supervisors.
>>>>>>>>>>>>>>>>>>>>> This does not logically change the current
>>>>>>>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>> have.
>>>>>>>>>>>>>>>>>>>>> You
>>>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>> need to configure one or more *new* servers (or at
>>>>>>>>>>>>>>>>>>>>> least
>>>>>>>>>>>>>>>>>>>>> xrootd
>>>>>>>>>>>>>>>>>>>>> processes)
>>>>>>>>>>>>>>>>>>>>> whose role is supervisor. We'd like them to run in
>>>>>>>>>>>>>>>>>>>>> separate
>>>>>>>>>>>>>>>>>>>>> machines
>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>> reliability purposes, but they could run on the 
>>>>>>>>>>>>>>>>>>>>> manager
>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>> long
>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>> give each one a unique instance name (i.e., -n 
>>>>>>>>>>>>>>>>>>>>> option).
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The front part of the cmsd reference explains how to 
>>>>>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Is there any change to configure xrootd with more 
>>>>>>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>>>> 65
>>>>>>>>>>>>>>>>>>>>>> machines? I used the configure below but it doesn't
>>>>>>>>>>>>>>>>>>>>>> work.
>>>>>>>>>>>>>>>>>>>>>> Should
>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>> configure some machines' manager to be supvervisor?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>
>
>