LISTSERV 16.5 - XROOTD-L Archives

Hi Andy,

    I put new logs in web.

It still doesn't work. I cannot copy files in and out.

 It seems xrootd daemon at atlas-bkp1 hasn't talked with cmsd.
Normally if xrootd daemont tries to copy a file, in the cms.log I
should see "do_Select: filename". But in this cms.log, there is
nothing from atlas-bkp1.

(*)
[root@atlas-bkp1 ~]# xrdcp
root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
/tmp/
Last server error 10000 ('')
Error accessing path/file for
root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
[root@atlas-bkp1 ~]# xrdcp /bin/mv
root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123
133


Wen

On Thu, Dec 17, 2009 at 10:54 PM, Andrew Hanushevsky <[log in to unmask]> wrote:
> Hi Wen,
>
> I reviewed the log file. Other than the odd redirect of c131 at 17:47:25
> which I can't comment on because its logs on the web site do not overlap
> with the manager or supervisor. Unless all the logs include the full time in
> question I can't say much of anything. Can you provide me with inclusive
> logs?
>
> atlas-bkp1 cms: 17:20:57 to 17:42:19 xrd: 17:20:57 to 17:40:57
> higgs07 cms & xrd 17:22:33 to 17:42:33
> c131 cms & xrd 17:31:57 to 17:47:28
>
> That said, it certainly looks like things were working and files were being
> accessed and discovered on all the machines. You even werw able to open
> /atlas/xrootd/users/wguan/test/test98123313
> through not
> /atlas/xrootd/users/wguan/test/test123131The other issue is that you did not
> specify a stable adminpath and the adminpath defaults to /tmp. If you have a
> "cleanup" script that runs periodically for /tmp then eventually your
> cluster will go catonic as important (but not often used) files are deleted
> by that script. Could you please find a stable home for the adminpath?
>
> I reran my tests here and things worked as expected. I will ramp up some
> more tests. So, what is your status today?
>
> Andy
>
> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
> To: "Andrew Hanushevsky" <[log in to unmask]>
> Cc: <[log in to unmask]>
> Sent: Thursday, December 17, 2009 5:05 AM
> Subject: Re: xrootd with more than 65 machines
>
>
> Hi Andy,
>
>   Yes. I am using the file download from
> http://www.slac.stanford.edu/~abh/cmsd/ which compiled yesterday.  I
> just now compiled it again and compare it with one I compiled
> yesterday. they are the same(same md5sum).
>
> Wen
>
> On Thu, Dec 17, 2009 at 2:09 AM, Andrew Hanushevsky <[log in to unmask]>
> wrote:
>>
>> Hi Wen,
>>
>> If c131 cannot connect then either c131 does not have the new cms or
>> atlas-bkp1 does not have the new cms as that would be what would happen if
>> either were true. Looking at the log on c131 it would appear that
>> atlas-bkp1
>> is still using the old cmsd as the response data length is wrong. Could
>> you
>> verify please.
>>
>> Andy
>>
>> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
>> To: "Andrew Hanushevsky" <[log in to unmask]>
>> Cc: <[log in to unmask]>
>> Sent: Wednesday, December 16, 2009 3:58 PM
>> Subject: Re: xrootd with more than 65 machines
>>
>>
>> Hi Andy,
>>
>> I tried it. But there are still some problem. I put the logs in
>> higgs03.cs.wisc.edu/wguan/
>>
>> In my test, c131 is the 65 nodes to be added the the manager.
>> and I can copy the file to the pool through manager. But I cannot
>> copy a file out which is in c131.
>>
>> In c131's cms.log, I see "Manager:
>> manager.0:[log in to unmask] removed; redirected" again and
>> again. and I cannot see any thing about c131 in higgs07's
>> log(supervisor). Does it mean manager tries to redirect it to higgs07,
>> but c131 hasn't try to connect higgs07. It only tries to connect
>> manager again.
>>
>> (*)
>> [root@c131 ~]# xrdcp /bin/mv
>> root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331
>> Last server error 10000 ('')
>> Error accessing path/file for
>> root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331
>> [root@c131 ~]# xrdcp /bin/mv
>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123311
>> [xrootd] Total 0.06 MB |====================| 100.00 % [3.1 MB/s]
>> [root@c131 ~]# xrdcp /bin/mv
>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123312
>> [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s]
>> [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/
>> test123131
>> [root@c131 ~]# xrdcp
>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>> /tmp/
>> Last server error 3011 ('No servers are available to read the file.')
>> Error accessing path/file for
>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>> [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/test123131
>> /atlas/xrootd/users/wguan/test/test123131
>> [root@c131 ~]# xrdcp
>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>> /tmp/
>> Last server error 3011 ('No servers are available to read the file.')
>> Error accessing path/file for
>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>> [root@c131 ~]# xrdcp /bin/mv
>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123313
>> [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s]
>> [root@c131 ~]# xrdcp
>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>> /tmp/
>> Last server error 3011 ('No servers are available to read the file.')
>> Error accessing path/file for
>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>> [root@c131 ~]# xrdcp
>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>> /tmp/
>> Last server error 3011 ('No servers are available to read the file.')
>> Error accessing path/file for
>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>> [root@c131 ~]# xrdcp
>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>> /tmp/
>> Last server error 3011 ('No servers are available to read the file.')
>> Error accessing path/file for
>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>> [root@c131 ~]# tail -f /var/log/xrootd/cms.log
>> 091216 17:45:52 3103 manager.0:[log in to unmask] XrdLink:
>> Setting ref to 2+-1 post=0
>> 091216 17:45:55 3103 Pander trying to connect to lvl 0
>> atlas-bkp1.cs.wisc.edu:3121
>> 091216 17:45:55 3103 XrdInet: Connected to atlas-bkp1.cs.wisc.edu:3121
>> 091216 17:45:55 3103 Add atlas-bkp1.cs.wisc.edu to manager config; id=0
>> 091216 17:45:55 3103 ManTree: Now connected to 3 root node(s)
>> 091216 17:45:55 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu
>> 091216 17:45:55 3103 Dispatch manager.0:[log in to unmask] for try
>> dlen=3
>> 091216 17:45:55 3103 manager.0:[log in to unmask] do_Try:
>> 091216 17:45:55 3103 Remove completed atlas-bkp1.cs.wisc.edu manager 0.95
>> 091216 17:45:55 3103 Manager: manager.0:[log in to unmask]
>> removed; redirected
>> 091216 17:46:04 3103 Pander trying to connect to lvl 0
>> atlas-bkp1.cs.wisc.edu:3121
>> 091216 17:46:04 3103 XrdInet: Connected to atlas-bkp1.cs.wisc.edu:3121
>> 091216 17:46:04 3103 Add atlas-bkp1.cs.wisc.edu to manager config; id=0
>> 091216 17:46:04 3103 ManTree: Now connected to 3 root node(s)
>> 091216 17:46:04 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu
>> 091216 17:46:04 3103 Dispatch manager.0:[log in to unmask] for try
>> dlen=3
>> 091216 17:46:04 3103 Protocol: No buffers to serve atlas-bkp1.cs.wisc.edu
>> 091216 17:46:04 3103 Remove completed atlas-bkp1.cs.wisc.edu manager 0.96
>> 091216 17:46:04 3103 Manager: manager.0:[log in to unmask]
>> removed; insufficient buffers
>> 091216 17:46:11 3103 Dispatch manager.0:[log in to unmask] for
>> state dlen=169
>> 091216 17:46:11 3103 manager.0:[log in to unmask] XrdLink:
>> Setting ref to 1+1 post=0
>>
>> Thanks
>> Wen
>>
>> On Thu, Dec 17, 2009 at 12:10 AM, wen guan <[log in to unmask]> wrote:
>>>
>>> Hi Andy,
>>>
>>>> OK, I understand. As for stalling, too many nodes were deemed to be in
>>>> trouble for the manager to allow service resumption.
>>>>
>>>> Please make sure that all of the nodes in the cluster receive the new
>>>> cmsd
>>>> as they will drop off with the old one and you'll see the same kind of
>>>> activity. Perhaps the best way to know that you suceeded in putting
>>>> everything in sync is to start with 63 data nodes plus one supervisor.
>>>> Once
>>>> all connections are established; adding an additional server should
>>>> simply
>>>> send it to the supervisor.
>>>
>>> I will do it.
>>> you said start 63 data server and one supervisor. Does it mean the
>>> supervisor is managed using the same policy? If I there are 64
>>> dataservers which are connected before the supervisor, will the
>>> supervisor be dropped? Is the supervisor has high priority to be
>>> added to the manager? I mean, if there are already 64 dataservers and
>>> a supervisor comes in, will the supervisor be accepted and a datasever
>>> be redirected to the supervisor?
>>>
>>> Thanks
>>> Wen
>>>
>>>>
>>>> Hi Andrew,
>>>>
>>>> But when I tried to xrdcp a file to it, it doesn't response. In
>>>> atlas-bkp1-xrd.log.20091213, it always prints "stalling client for 10
>>>> sec". But in cms.log, I can find any message about the file.
>>>>
>>>>> I don't see why you say it doesn't work. With the debugging level set
>>>>> so
>>>>> high the noise may make it look like something is going wrong but that
>>>>> isn't
>>>>> necessarily the case.
>>>>>
>>>>> 1) The 'too many subscribers' is correct. The manager was simply
>>>>> redirecting
>>>>> them because there were already 64 servers. However, in your case the
>>>>> supervisor wasn't started until almost 30 minutes after everyone else
>>>>> (i.e.,
>>>>> 10:42 AM). Why was that? I'm not suprised about the flurry of messages
>>>>> with
>>>>> a critical component missing for 30 minutes.
>>>>
>>>> Because the manager is 64bit machine but supervisor is 32 bit machine.
>>>> Then I have to recompile the it. At that time, I was interrupted by
>>>> something else.
>>>>
>>>>
>>>>> 2) Once the supervisor started, it started accepting the redirected
>>>>> servers.
>>>>>
>>>>> 3) Then 10 seconds (10:42:10) later the supervisor was restarted. So,
>>>>> that
>>>>> would cause a flurry of activity to occur as there is no backup
>>>>> supervisor
>>>>> to take over.
>>>>>
>>>>> 4) This happened again at 10:42:34 AM then again at 10:48:49. Is the
>>>>> supervisor crashing? Is there a core file?
>>>>>
>>>>> 5) At 11:11 AM the manager restarted. Again, is there a core file here
>>>>> or
>>>>> was this a manual action?
>>>>>
>>>>> During the course of all of this. All nodes connected were operating
>>>>> propely
>>>>> and files were being located.
>>>>>
>>>>> So, the two big questions are:
>>>>>
>>>>> a) Why was the supervisor not started until 30 minutes after the system
>>>>> was
>>>>> started?
>>>>>
>>>>> b) Is there an explanation of the restarts? If this was a crash then we
>>>>> need
>>>>> a core file to figure out what happened.
>>>>
>>>> It's not a crash. There are some reasons that I restarted some daemons.
>>>> (1)I thought if a dataserver tried many times to connect to a
>>>> redirector but failed, the dataserver would not try to connect a
>>>> redirector again. The supervisor was missing for long time. So maybe
>>>> some dataservers would not try to connect to atlas-bkp1 again. To
>>>> reactive these dataservers, I restarted any servers.
>>>> (2)When I tried to xrdcp, it was hanging for long time. I thought
>>>> maybe manager was affected by some others things. then I restarte
>>>> manager to see whether a restart can make this xrdcp work.
>>>>
>>>>
>>>> Thanks
>>>> Wen
>>>>
>>>>> Andy
>>>>>
>>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>> Cc: <[log in to unmask]>
>>>>> Sent: Wednesday, December 16, 2009 9:38 AM
>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>
>>>>>
>>>>> Hi Andrew,
>>>>>
>>>>> It still doesn't work.
>>>>> The log file is in higgs03.cs.wisc.edu/wguan/. The name is *.20091216
>>>>> The manager complains there are too many subscribers and the removes
>>>>> nodes.
>>>>>
>>>>> (*)
>>>>> Add server.10040:[log in to unmask] redirected; too many
>>>>> subscribers.
>>>>>
>>>>> Wen
>>>>>
>>>>> On Wed, Dec 16, 2009 at 4:25 AM, Andrew Hanushevsky <[log in to unmask]>
>>>>> wrote:
>>>>>>
>>>>>> Hi Wen,
>>>>>>
>>>>>> It will be easier for me to retroft as the changes were pretty minor.
>>>>>> Please
>>>>>> lift the new XrdCmsNode.cc file from
>>>>>>
>>>>>> http://www.slac.stanford.edu/~abh/cmsd
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>> Cc: <[log in to unmask]>
>>>>>> Sent: Tuesday, December 15, 2009 5:12 PM
>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>
>>>>>>
>>>>>> Hi Andy,
>>>>>>
>>>>>> I can switch to 20091104-1102. Then you don't need to patch
>>>>>> another version. How can I download v20091104-1102?
>>>>>>
>>>>>> Thanks
>>>>>> Wen
>>>>>>
>>>>>> On Wed, Dec 16, 2009 at 12:52 AM, Andrew Hanushevsky
>>>>>> <[log in to unmask]>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Wen,
>>>>>>>
>>>>>>> Ah yes, I see that now. The file I gave you is based on
>>>>>>> v20091104-1102.
>>>>>>> Let
>>>>>>> me see if I can retrofit the patch for you.
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>> <[log in to unmask]>
>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>> Cc: <[log in to unmask]>
>>>>>>> Sent: Tuesday, December 15, 2009 1:04 PM
>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>
>>>>>>>
>>>>>>> Hi Andy,
>>>>>>>
>>>>>>> Which xrootd version are you using? XrdCmsConfig.hh is different.
>>>>>>> XrdCmsConfig.hh is downloaded from
>>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/.
>>>>>>>
>>>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsNode.cc
>>>>>>> 6fb3ae40fe4e10bdd4d372818a341f2c src/XrdCms/XrdCmsNode.cc
>>>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsConfig.hh
>>>>>>> 7d57753847d9448186c718f98e963cbe src/XrdCms/XrdCmsConfig.hh
>>>>>>>
>>>>>>> Thanks
>>>>>>> Wen
>>>>>>>
>>>>>>> On Tue, Dec 15, 2009 at 10:50 PM, Andrew Hanushevsky
>>>>>>> <[log in to unmask]>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Wen,
>>>>>>>>
>>>>>>>> Just compiled on Linux and it was clean. Something is really wrong
>>>>>>>> with
>>>>>>>> your
>>>>>>>> source files, specifically XrdCmsConfig.cc
>>>>>>>>
>>>>>>>> The MD5 checksums on the relevant files are:
>>>>>>>>
>>>>>>>> MD5 (XrdCmsNode.cc) = 6fb3ae40fe4e10bdd4d372818a341f2c
>>>>>>>>
>>>>>>>> MD5 (XrdCmsConfig.hh) = 4a7d655582a7cd43b098947d0676924b
>>>>>>>>
>>>>>>>> Andy
>>>>>>>>
>>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>>> <[log in to unmask]>
>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>> Cc: <[log in to unmask]>
>>>>>>>> Sent: Tuesday, December 15, 2009 4:24 AM
>>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Andy,
>>>>>>>>
>>>>>>>> No problem. Thanks for the fix. But it cannot be compiled. The
>>>>>>>> version I am using is
>>>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/.
>>>>>>>>
>>>>>>>> Making cms component...
>>>>>>>> Compiling XrdCmsNode.cc
>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>> XrdCmsNode::do_Chmod(XrdCmsRRData&)':
>>>>>>>> XrdCmsNode.cc:268: error: `fsExec' was not declared in this scope
>>>>>>>> XrdCmsNode.cc:268: warning: unused variable 'fsExec'
>>>>>>>> XrdCmsNode.cc:269: error: 'class XrdCmsConfig' has no member named
>>>>>>>> 'ossFS'
>>>>>>>> XrdCmsNode.cc:273: error: `fsFail' was not declared in this scope
>>>>>>>> XrdCmsNode.cc:273: warning: unused variable 'fsFail'
>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>> XrdCmsNode::do_Mkdir(XrdCmsRRData&)':
>>>>>>>> XrdCmsNode.cc:600: error: `fsExec' was not declared in this scope
>>>>>>>> XrdCmsNode.cc:600: warning: unused variable 'fsExec'
>>>>>>>> XrdCmsNode.cc:601: error: 'class XrdCmsConfig' has no member named
>>>>>>>> 'ossFS'
>>>>>>>> XrdCmsNode.cc:605: error: `fsFail' was not declared in this scope
>>>>>>>> XrdCmsNode.cc:605: warning: unused variable 'fsFail'
>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>> XrdCmsNode::do_Mkpath(XrdCmsRRData&)':
>>>>>>>> XrdCmsNode.cc:640: error: `fsExec' was not declared in this scope
>>>>>>>> XrdCmsNode.cc:640: warning: unused variable 'fsExec'
>>>>>>>> XrdCmsNode.cc:641: error: 'class XrdCmsConfig' has no member named
>>>>>>>> 'ossFS'
>>>>>>>> XrdCmsNode.cc:645: error: `fsFail' was not declared in this scope
>>>>>>>> XrdCmsNode.cc:645: warning: unused variable 'fsFail'
>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>> XrdCmsNode::do_Mv(XrdCmsRRData&)':
>>>>>>>> XrdCmsNode.cc:704: error: `fsExec' was not declared in this scope
>>>>>>>> XrdCmsNode.cc:704: warning: unused variable 'fsExec'
>>>>>>>> XrdCmsNode.cc:705: error: 'class XrdCmsConfig' has no member named
>>>>>>>> 'ossFS'
>>>>>>>> XrdCmsNode.cc:709: error: `fsFail' was not declared in this scope
>>>>>>>> XrdCmsNode.cc:709: warning: unused variable 'fsFail'
>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>> XrdCmsNode::do_Rm(XrdCmsRRData&)':
>>>>>>>> XrdCmsNode.cc:831: error: `fsExec' was not declared in this scope
>>>>>>>> XrdCmsNode.cc:831: warning: unused variable 'fsExec'
>>>>>>>> XrdCmsNode.cc:832: error: 'class XrdCmsConfig' has no member named
>>>>>>>> 'ossFS'
>>>>>>>> XrdCmsNode.cc:836: error: `fsFail' was not declared in this scope
>>>>>>>> XrdCmsNode.cc:836: warning: unused variable 'fsFail'
>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>> XrdCmsNode::do_Rmdir(XrdCmsRRData&)':
>>>>>>>> XrdCmsNode.cc:873: error: `fsExec' was not declared in this scope
>>>>>>>> XrdCmsNode.cc:873: warning: unused variable 'fsExec'
>>>>>>>> XrdCmsNode.cc:874: error: 'class XrdCmsConfig' has no member named
>>>>>>>> 'ossFS'
>>>>>>>> XrdCmsNode.cc:878: error: `fsFail' was not declared in this scope
>>>>>>>> XrdCmsNode.cc:878: warning: unused variable 'fsFail'
>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>> XrdCmsNode::do_Trunc(XrdCmsRRData&)':
>>>>>>>> XrdCmsNode.cc:1377: error: `fsExec' was not declared in this scope
>>>>>>>> XrdCmsNode.cc:1377: warning: unused variable 'fsExec'
>>>>>>>> XrdCmsNode.cc:1378: error: 'class XrdCmsConfig' has no member named
>>>>>>>> 'ossFS'
>>>>>>>> XrdCmsNode.cc:1382: error: `fsFail' was not declared in this scope
>>>>>>>> XrdCmsNode.cc:1382: warning: unused variable 'fsFail'
>>>>>>>> XrdCmsNode.cc: At global scope:
>>>>>>>> XrdCmsNode.cc:1524: error: no `int XrdCmsNode::fsExec(XrdOucProg*,
>>>>>>>> char*, char*)' member function declared in class `XrdCmsNode'
>>>>>>>> XrdCmsNode.cc: In member function `int
>>>>>>>> XrdCmsNode::fsExec(XrdOucProg*,
>>>>>>>> char*, char*)':
>>>>>>>> XrdCmsNode.cc:1533: error: `fsL2PFail1' was not declared in this
>>>>>>>> scope
>>>>>>>> XrdCmsNode.cc:1533: warning: unused variable 'fsL2PFail1'
>>>>>>>> XrdCmsNode.cc:1537: error: `fsL2PFail2' was not declared in this
>>>>>>>> scope
>>>>>>>> XrdCmsNode.cc:1537: warning: unused variable 'fsL2PFail2'
>>>>>>>> XrdCmsNode.cc: At global scope:
>>>>>>>> XrdCmsNode.cc:1553: error: no `const char* XrdCmsNode::fsFail(const
>>>>>>>> char*, const char*, const char*, int)' member function declared in
>>>>>>>> class `XrdCmsNode'
>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>> XrdCmsNode::fsFail(const char*, const char*, const char*, int)':
>>>>>>>> XrdCmsNode.cc:1559: error: `fsL2PFail1' was not declared in this
>>>>>>>> scope
>>>>>>>> XrdCmsNode.cc:1559: warning: unused variable 'fsL2PFail1'
>>>>>>>> XrdCmsNode.cc:1560: error: `fsL2PFail2' was not declared in this
>>>>>>>> scope
>>>>>>>> XrdCmsNode.cc:1560: warning: unused variable 'fsL2PFail2'
>>>>>>>> XrdCmsNode.cc: In static member function `static int
>>>>>>>> XrdCmsNode::isOnline(char*, int)':
>>>>>>>> XrdCmsNode.cc:1608: error: 'class XrdCmsConfig' has no member named
>>>>>>>> 'ossFS'
>>>>>>>> make[4]: *** [../../obj/XrdCmsNode.o] Error 1
>>>>>>>> make[3]: *** [Linuxall] Error 2
>>>>>>>> make[2]: *** [all] Error 2
>>>>>>>> make[1]: *** [XrdCms] Error 2
>>>>>>>> make: *** [all] Error 2
>>>>>>>>
>>>>>>>>
>>>>>>>> Wen
>>>>>>>>
>>>>>>>> On Tue, Dec 15, 2009 at 2:08 AM, Andrew Hanushevsky
>>>>>>>> <[log in to unmask]>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Wen,
>>>>>>>>>
>>>>>>>>> I have developed a permanent fix. You will find the source files in
>>>>>>>>>
>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/
>>>>>>>>>
>>>>>>>>> There are three files: XrdCmsCluster.cc XrdCmsNode.cc
>>>>>>>>> XrdCmsProtocol.cc
>>>>>>>>>
>>>>>>>>> Please do a source replacement and recompile. Unfortunately, the
>>>>>>>>> cmsd
>>>>>>>>> will
>>>>>>>>> need to be replaced on each node regardless of role. My apologies
>>>>>>>>> for
>>>>>>>>> the
>>>>>>>>> disruption. Please let me know how it goes.
>>>>>>>>>
>>>>>>>>> Andy
>>>>>>>>>
>>>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>>>> <[log in to unmask]>
>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>>> Cc: <[log in to unmask]>
>>>>>>>>> Sent: Sunday, December 13, 2009 7:04 AM
>>>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Andrew,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>> I used the new cmsd at atlas-bkp1 manager. But it's still dropping
>>>>>>>>> nodes. And in supervisor's log, I cannot find any dataserver to
>>>>>>>>> register to it.
>>>>>>>>>
>>>>>>>>> The new logs are in http://higgs03.cs.wisc.edu/wguan/*.20091213.
>>>>>>>>> The manager is patched at 091213 08:38:15.
>>>>>>>>>
>>>>>>>>> Wen
>>>>>>>>>
>>>>>>>>> On Sun, Dec 13, 2009 at 1:52 AM, Andrew Hanushevsky
>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Wen
>>>>>>>>>>
>>>>>>>>>> You will find the source replacement at:
>>>>>>>>>>
>>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/
>>>>>>>>>>
>>>>>>>>>> It's XrdCmsCluster.cc and it replaces
>>>>>>>>>> xrootd/src/XrdCms/XrdCmsCluster.cc
>>>>>>>>>>
>>>>>>>>>> I'm stepping out for a couple of hours but will be back to see how
>>>>>>>>>> things
>>>>>>>>>> went. Sorry for the issues :-(
>>>>>>>>>>
>>>>>>>>>> Andy
>>>>>>>>>>
>>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>
>>>>>>>>>>> I prefer a source replacement. Then I can compile it.
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Wen
>>>>>>>>>>>>
>>>>>>>>>>>> I can do one of two things here:
>>>>>>>>>>>>
>>>>>>>>>>>> 1) Supply a source replacement and then you would recompile, or
>>>>>>>>>>>>
>>>>>>>>>>>> 2) Give me the uname -a of where the cmsd will run and I'll
>>>>>>>>>>>> supply
>>>>>>>>>>>> a
>>>>>>>>>>>> binary
>>>>>>>>>>>> replacement for you.
>>>>>>>>>>>>
>>>>>>>>>>>> Your choice.
>>>>>>>>>>>>
>>>>>>>>>>>> Andy
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Andrew
>>>>>>>>>>>>>
>>>>>>>>>>>>> The problem is found. Great. Thanks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Where can I find the patched cmsd?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky
>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I found the problem. Looks like a regression from way back
>>>>>>>>>>>>>> when.
>>>>>>>>>>>>>> There
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>> missing flag on the redirect. This will require a patched cmsd
>>>>>>>>>>>>>> but
>>>>>>>>>>>>>> you
>>>>>>>>>>>>>> need
>>>>>>>>>>>>>> only to replace the redirector's cmsd as this only affects the
>>>>>>>>>>>>>> redirector.
>>>>>>>>>>>>>> How would you like to proceed?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It doesn't work. atlas-bkp1 manager still dropping nodes
>>>>>>>>>>>>>>> again.
>>>>>>>>>>>>>>> In supervisor, I still haven't seen any dataserver
>>>>>>>>>>>>>>> registered.
>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>> said
>>>>>>>>>>>>>>> "I updated the ntp" because you said "the log timestamp do
>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>> overlap".
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky
>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Do you mean that everything is now working? It could be that
>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>> removed
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> xrd.timeout directive. That really could cause problems. As
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> delays,
>>>>>>>>>>>>>>>> that is normal when the redirector thinks something is going
>>>>>>>>>>>>>>>> wrong.
>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>> strategy is to delay clients until it can get back to a
>>>>>>>>>>>>>>>> stable
>>>>>>>>>>>>>>>> configuration. This usually prevents jobs from crashing
>>>>>>>>>>>>>>>> during
>>>>>>>>>>>>>>>> stressful
>>>>>>>>>>>>>>>> periods.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I restarted it to do supervisor test. Also because xrootd
>>>>>>>>>>>>>>>>> manager
>>>>>>>>>>>>>>>>> frequently doesn't response. (*) is the cms.log, the file
>>>>>>>>>>>>>>>>> select
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> delayed again and again. When do a restart, all things are
>>>>>>>>>>>>>>>>> fine.
>>>>>>>>>>>>>>>>> Now
>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>> am trying to find a clue about it.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (*)
>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask]
>>>>>>>>>>>>>>>>> do_Select:
>>>>>>>>>>>>>>>>> wc
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 Select seeking
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 UnkFile rc=1
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask]
>>>>>>>>>>>>>>>>> do_Select:
>>>>>>>>>>>>>>>>> delay 5
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 2+-1
>>>>>>>>>>>>>>>>> post=0
>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 Dispatch
>>>>>>>>>>>>>>>>> redirector.21313:14@atlas-bkp2
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> select dlen=166
>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1
>>>>>>>>>>>>>>>>> post=0
>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdSched: running redirector inq=0
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> There is no core file. I copied a new copies of the logs to
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> link
>>>>>>>>>>>>>>>>> below.
>>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky
>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I see in the server log that it is restarting often. Could
>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>> take
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> look
>>>>>>>>>>>>>>>>>> in the c193 to see if you have any core files? Also please
>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>> core files are enabled as Linux defaults the size to 0.
>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>> step
>>>>>>>>>>>>>>>>>> here
>>>>>>>>>>>>>>>>>> is to find out why your servers are restarting.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> the logs can be found here. From the log you can see
>>>>>>>>>>>>>>>>>>> atlas-bkp1
>>>>>>>>>>>>>>>>>>> manager are dropping nodes again and again which tries to
>>>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky
>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Wen, Could you start everything up and provide me a
>>>>>>>>>>>>>>>>>>>> pointer
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> manager log file, supervisor log file, and one data
>>>>>>>>>>>>>>>>>>>> server
>>>>>>>>>>>>>>>>>>>> logfile
>>>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>> which cover the same time-frame (from start to some
>>>>>>>>>>>>>>>>>>>> point
>>>>>>>>>>>>>>>>>>>> where
>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>> things are working or not). That way I can see what is
>>>>>>>>>>>>>>>>>>>> happening.
>>>>>>>>>>>>>>>>>>>> At
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> moment I only see two "bad" things in the config file:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a
>>>>>>>>>>>>>>>>>>>> manager
>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>> claim,
>>>>>>>>>>>>>>>>>>>> via
>>>>>>>>>>>>>>>>>>>> the all.manager directive, that there are three (bkp2
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> bkp3).
>>>>>>>>>>>>>>>>>>>> While
>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>> should work, the log file will be dense with error
>>>>>>>>>>>>>>>>>>>> messages.
>>>>>>>>>>>>>>>>>>>> Please
>>>>>>>>>>>>>>>>>>>> correct
>>>>>>>>>>>>>>>>>>>> this to be consistent and make it easier to see real
>>>>>>>>>>>>>>>>>>>> errors.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This is not a problem for me. Because this config is used
>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> dataserver. In manager, I updated the if
>>>>>>>>>>>>>>>>>>> atlas-bkp1.cs.wisc.edu
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> atlas-bkp2 or something. This is a history problem. at
>>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>> atlas-bkp1 is used. atlas-bkp2 and atlas-bkp3 are added
>>>>>>>>>>>>>>>>>>> later.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 2) Please use cms.space not olb.space (for historical
>>>>>>>>>>>>>>>>>>>> reasons
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> latter
>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> still accepted and over-rides the former, but that will
>>>>>>>>>>>>>>>>>>>> soon
>>>>>>>>>>>>>>>>>>>> end),
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> please use only one (the config file uses both
>>>>>>>>>>>>>>>>>>>> directives).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> yes. I should remove this line. in fact cms.space is in
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> cfg
>>>>>>>>>>>>>>>>>>> too.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The xrootd has an internal mechanism to connect servers
>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>> supervisors
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> allow for maximum reliability. You cannot change that
>>>>>>>>>>>>>>>>>>>> algorithm
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> no need to do so. You should *never* tell anyone to
>>>>>>>>>>>>>>>>>>>> directly
>>>>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>> supervisor. If you do, you will likely get unreachable
>>>>>>>>>>>>>>>>>>>> nodes.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> As for dropping data servers, it would appear to me,
>>>>>>>>>>>>>>>>>>>> given
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> flurry
>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>> such activity, that something either crashed or was
>>>>>>>>>>>>>>>>>>>> restarted.
>>>>>>>>>>>>>>>>>>>> That's
>>>>>>>>>>>>>>>>>>>> why
>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>> would be good to see the complete log of each one of the
>>>>>>>>>>>>>>>>>>>> entities.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I read the document. and write a config
>>>>>>>>>>>>>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg).
>>>>>>>>>>>>>>>>>>>>> I used my conf, I can see manager is dispatch message
>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> supervisor. But I cannot see any dataserver tries to
>>>>>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> supervisor. At the same time, in the manager's log, I
>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>> dataserver are Dropped.
>>>>>>>>>>>>>>>>>>>>> How does xrootd decide which dataserver will connect
>>>>>>>>>>>>>>>>>>>>> supervisor?
>>>>>>>>>>>>>>>>>>>>> should I specify some dataservers to connect the
>>>>>>>>>>>>>>>>>>>>> supervisor?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> (*) supervisor log
>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2
>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>> state
>>>>>>>>>>>>>>>>>>>>> dlen=42
>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State:
>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2
>>>>>>>>>>>>>>>>>>>>> do_StateFWD:
>>>>>>>>>>>>>>>>>>>>> Path
>>>>>>>>>>>>>>>>>>>>> find
>>>>>>>>>>>>>>>>>>>>> failed for state
>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> (*)manager log
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>> TSpace=5587GB
>>>>>>>>>>>>>>>>>>>>> NumFS=1
>>>>>>>>>>>>>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding
>>>>>>>>>>>>>>>>>>>>> path:
>>>>>>>>>>>>>>>>>>>>> w
>>>>>>>>>>>>>>>>>>>>> /atlas
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661
>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>> do_Space: 5696231MB free; 0% util
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 logged in.
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection from
>>>>>>>>>>>>>>>>>>>>> [log in to unmask]
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: running
>>>>>>>>>>>>>>>>>>>>> ?:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched protocol
>>>>>>>>>>>>>>>>>>>>> cmsd
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll:
>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>> 79
>>>>>>>>>>>>>>>>>>>>> attached
>>>>>>>>>>>>>>>>>>>>> to poller 2; num=22
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add
>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>> bumps
>>>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 #63
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Drop_Node:
>>>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 dropped.
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add Shoved
>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 to cluster;
>>>>>>>>>>>>>>>>>>>>> id=63.78;
>>>>>>>>>>>>>>>>>>>>> num=64;
>>>>>>>>>>>>>>>>>>>>> min=51
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>> TSpace=5587GB
>>>>>>>>>>>>>>>>>>>>> NumFS=1
>>>>>>>>>>>>>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding
>>>>>>>>>>>>>>>>>>>>> path:
>>>>>>>>>>>>>>>>>>>>> w
>>>>>>>>>>>>>>>>>>>>> /atlas
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661
>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>> do_Space: 5721854MB free; 0% util
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 logged in.
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve from
>>>>>>>>>>>>>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>>> 60
>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 node 63.78
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661
>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>> 79 detached from poller 2; num=21
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>> c177.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>> FD=16
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 node 0.3
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>> 16 detached from poller 2; num=20
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>> c179.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>> FD=21
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>> server.17065:[log in to unmask]:1094 node 1.4
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>> server.7978:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.7978:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>> 21
>>>>>>>>>>>>>>>>>>>>> detached from poller 1; num=21
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 State: Status changed to
>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Send status to
>>>>>>>>>>>>>>>>>>>>> redirector.15656:14@atlas-bkp2
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>> c182.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>> FD=19
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 node 7.10
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>> 19 detached from poller 2; num=19
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>> c178.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>> FD=15
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 node 9.12
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>> 15 detached from poller 1; num=20
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>> c181.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>> FD=17
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 node 5.8
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>> 17 detached from poller 0; num=21
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>> c183.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>> FD=22
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 node 8.11
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>> 22 detached from poller 2; num=18
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>> c184.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>> FD=20
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>> server.4131:[log in to unmask]:1094 node 3.6
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>> 20 detached from poller 0; num=20
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>> c185.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>> FD=23
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 node 6.9
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>> server.8524:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.8524:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>> 23
>>>>>>>>>>>>>>>>>>>>> detached from poller 0; num=19
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>> c180.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>> FD=18
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 node 4.7
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>> 18 detached from poller 1; num=19
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>> c186.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>> FD=24
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 node 2.5
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>> server.7849:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.7849:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>> 24
>>>>>>>>>>>>>>>>>>>>> detached from poller 1; num=18
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled.
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled.
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled.
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled.
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled.
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled.
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled.
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled.
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled.
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled.
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled.
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled.
>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky
>>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> To go past 64 data servers you will need to setup one
>>>>>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>> supervisors.
>>>>>>>>>>>>>>>>>>>>>> This does not logically change the current
>>>>>>>>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>> have.
>>>>>>>>>>>>>>>>>>>>>> You
>>>>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>> need to configure one or more *new* servers (or at
>>>>>>>>>>>>>>>>>>>>>> least
>>>>>>>>>>>>>>>>>>>>>> xrootd
>>>>>>>>>>>>>>>>>>>>>> processes)
>>>>>>>>>>>>>>>>>>>>>> whose role is supervisor. We'd like them to run in
>>>>>>>>>>>>>>>>>>>>>> separate
>>>>>>>>>>>>>>>>>>>>>> machines
>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>> reliability purposes, but they could run on the
>>>>>>>>>>>>>>>>>>>>>> manager
>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>> long
>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>> give each one a unique instance name (i.e., -n
>>>>>>>>>>>>>>>>>>>>>> option).
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The front part of the cmsd reference explains how to
>>>>>>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Is there any change to configure xrootd with more
>>>>>>>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>>>>> 65
>>>>>>>>>>>>>>>>>>>>>>> machines? I used the configure below but it doesn't
>>>>>>>>>>>>>>>>>>>>>>> work.
>>>>>>>>>>>>>>>>>>>>>>> Should
>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>> configure some machines' manager to be supvervisor?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>
>
>