LISTSERV 16.5 - XROOTD-L Archives

Hi Wen,

  I see that you are getting error 10000, which means "generic error 
before any interaction". Could you please run the same command with 
debug level 3 and post the log with the same kind of issue? Something like

  xrdcp -d 3 ....

  Most likely this time the problem is different. I may be wrong here, 
but a possible reason for that error is that the servers require 
authentication and xrdcp does not find some library in the LD_LIBRARY_PATH.

Fabrizio


wen guan ha scritto:
> Hi Andy,
> 
>     I put new logs in web.
> 
> It still doesn't work. I cannot copy files in and out.
> 
>  It seems xrootd daemon at atlas-bkp1 hasn't talked with cmsd.
> Normally if xrootd daemont tries to copy a file, in the cms.log I
> should see "do_Select: filename". But in this cms.log, there is
> nothing from atlas-bkp1.
> 
> (*)
> [root@atlas-bkp1 ~]# xrdcp
> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
> /tmp/
> Last server error 10000 ('')
> Error accessing path/file for
> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
> [root@atlas-bkp1 ~]# xrdcp /bin/mv
> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123
> 133
> 
> 
> Wen
> 
> On Thu, Dec 17, 2009 at 10:54 PM, Andrew Hanushevsky <[log in to unmask]> wrote:
>> Hi Wen,
>>
>> I reviewed the log file. Other than the odd redirect of c131 at 17:47:25
>> which I can't comment on because its logs on the web site do not overlap
>> with the manager or supervisor. Unless all the logs include the full time in
>> question I can't say much of anything. Can you provide me with inclusive
>> logs?
>>
>> atlas-bkp1 cms: 17:20:57 to 17:42:19 xrd: 17:20:57 to 17:40:57
>> higgs07 cms & xrd 17:22:33 to 17:42:33
>> c131 cms & xrd 17:31:57 to 17:47:28
>>
>> That said, it certainly looks like things were working and files were being
>> accessed and discovered on all the machines. You even werw able to open
>> /atlas/xrootd/users/wguan/test/test98123313
>> through not
>> /atlas/xrootd/users/wguan/test/test123131The other issue is that you did not
>> specify a stable adminpath and the adminpath defaults to /tmp. If you have a
>> "cleanup" script that runs periodically for /tmp then eventually your
>> cluster will go catonic as important (but not often used) files are deleted
>> by that script. Could you please find a stable home for the adminpath?
>>
>> I reran my tests here and things worked as expected. I will ramp up some
>> more tests. So, what is your status today?
>>
>> Andy
>>
>> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
>> To: "Andrew Hanushevsky" <[log in to unmask]>
>> Cc: <[log in to unmask]>
>> Sent: Thursday, December 17, 2009 5:05 AM
>> Subject: Re: xrootd with more than 65 machines
>>
>>
>> Hi Andy,
>>
>>   Yes. I am using the file download from
>> http://www.slac.stanford.edu/~abh/cmsd/ which compiled yesterday.  I
>> just now compiled it again and compare it with one I compiled
>> yesterday. they are the same(same md5sum).
>>
>> Wen
>>
>> On Thu, Dec 17, 2009 at 2:09 AM, Andrew Hanushevsky <[log in to unmask]>
>> wrote:
>>> Hi Wen,
>>>
>>> If c131 cannot connect then either c131 does not have the new cms or
>>> atlas-bkp1 does not have the new cms as that would be what would happen if
>>> either were true. Looking at the log on c131 it would appear that
>>> atlas-bkp1
>>> is still using the old cmsd as the response data length is wrong. Could
>>> you
>>> verify please.
>>>
>>> Andy
>>>
>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>> Cc: <[log in to unmask]>
>>> Sent: Wednesday, December 16, 2009 3:58 PM
>>> Subject: Re: xrootd with more than 65 machines
>>>
>>>
>>> Hi Andy,
>>>
>>> I tried it. But there are still some problem. I put the logs in
>>> higgs03.cs.wisc.edu/wguan/
>>>
>>> In my test, c131 is the 65 nodes to be added the the manager.
>>> and I can copy the file to the pool through manager. But I cannot
>>> copy a file out which is in c131.
>>>
>>> In c131's cms.log, I see "Manager:
>>> manager.0:[log in to unmask] removed; redirected" again and
>>> again. and I cannot see any thing about c131 in higgs07's
>>> log(supervisor). Does it mean manager tries to redirect it to higgs07,
>>> but c131 hasn't try to connect higgs07. It only tries to connect
>>> manager again.
>>>
>>> (*)
>>> [root@c131 ~]# xrdcp /bin/mv
>>> root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331
>>> Last server error 10000 ('')
>>> Error accessing path/file for
>>> root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331
>>> [root@c131 ~]# xrdcp /bin/mv
>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123311
>>> [xrootd] Total 0.06 MB |====================| 100.00 % [3.1 MB/s]
>>> [root@c131 ~]# xrdcp /bin/mv
>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123312
>>> [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s]
>>> [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/
>>> test123131
>>> [root@c131 ~]# xrdcp
>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>> /tmp/
>>> Last server error 3011 ('No servers are available to read the file.')
>>> Error accessing path/file for
>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>> [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/test123131
>>> /atlas/xrootd/users/wguan/test/test123131
>>> [root@c131 ~]# xrdcp
>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>> /tmp/
>>> Last server error 3011 ('No servers are available to read the file.')
>>> Error accessing path/file for
>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>> [root@c131 ~]# xrdcp /bin/mv
>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123313
>>> [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s]
>>> [root@c131 ~]# xrdcp
>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>> /tmp/
>>> Last server error 3011 ('No servers are available to read the file.')
>>> Error accessing path/file for
>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>> [root@c131 ~]# xrdcp
>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>> /tmp/
>>> Last server error 3011 ('No servers are available to read the file.')
>>> Error accessing path/file for
>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>> [root@c131 ~]# xrdcp
>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>> /tmp/
>>> Last server error 3011 ('No servers are available to read the file.')
>>> Error accessing path/file for
>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>> [root@c131 ~]# tail -f /var/log/xrootd/cms.log
>>> 091216 17:45:52 3103 manager.0:[log in to unmask] XrdLink:
>>> Setting ref to 2+-1 post=0
>>> 091216 17:45:55 3103 Pander trying to connect to lvl 0
>>> atlas-bkp1.cs.wisc.edu:3121
>>> 091216 17:45:55 3103 XrdInet: Connected to atlas-bkp1.cs.wisc.edu:3121
>>> 091216 17:45:55 3103 Add atlas-bkp1.cs.wisc.edu to manager config; id=0
>>> 091216 17:45:55 3103 ManTree: Now connected to 3 root node(s)
>>> 091216 17:45:55 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu
>>> 091216 17:45:55 3103 Dispatch manager.0:[log in to unmask] for try
>>> dlen=3
>>> 091216 17:45:55 3103 manager.0:[log in to unmask] do_Try:
>>> 091216 17:45:55 3103 Remove completed atlas-bkp1.cs.wisc.edu manager 0.95
>>> 091216 17:45:55 3103 Manager: manager.0:[log in to unmask]
>>> removed; redirected
>>> 091216 17:46:04 3103 Pander trying to connect to lvl 0
>>> atlas-bkp1.cs.wisc.edu:3121
>>> 091216 17:46:04 3103 XrdInet: Connected to atlas-bkp1.cs.wisc.edu:3121
>>> 091216 17:46:04 3103 Add atlas-bkp1.cs.wisc.edu to manager config; id=0
>>> 091216 17:46:04 3103 ManTree: Now connected to 3 root node(s)
>>> 091216 17:46:04 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu
>>> 091216 17:46:04 3103 Dispatch manager.0:[log in to unmask] for try
>>> dlen=3
>>> 091216 17:46:04 3103 Protocol: No buffers to serve atlas-bkp1.cs.wisc.edu
>>> 091216 17:46:04 3103 Remove completed atlas-bkp1.cs.wisc.edu manager 0.96
>>> 091216 17:46:04 3103 Manager: manager.0:[log in to unmask]
>>> removed; insufficient buffers
>>> 091216 17:46:11 3103 Dispatch manager.0:[log in to unmask] for
>>> state dlen=169
>>> 091216 17:46:11 3103 manager.0:[log in to unmask] XrdLink:
>>> Setting ref to 1+1 post=0
>>>
>>> Thanks
>>> Wen
>>>
>>> On Thu, Dec 17, 2009 at 12:10 AM, wen guan <[log in to unmask]> wrote:
>>>> Hi Andy,
>>>>
>>>>> OK, I understand. As for stalling, too many nodes were deemed to be in
>>>>> trouble for the manager to allow service resumption.
>>>>>
>>>>> Please make sure that all of the nodes in the cluster receive the new
>>>>> cmsd
>>>>> as they will drop off with the old one and you'll see the same kind of
>>>>> activity. Perhaps the best way to know that you suceeded in putting
>>>>> everything in sync is to start with 63 data nodes plus one supervisor.
>>>>> Once
>>>>> all connections are established; adding an additional server should
>>>>> simply
>>>>> send it to the supervisor.
>>>> I will do it.
>>>> you said start 63 data server and one supervisor. Does it mean the
>>>> supervisor is managed using the same policy? If I there are 64
>>>> dataservers which are connected before the supervisor, will the
>>>> supervisor be dropped? Is the supervisor has high priority to be
>>>> added to the manager? I mean, if there are already 64 dataservers and
>>>> a supervisor comes in, will the supervisor be accepted and a datasever
>>>> be redirected to the supervisor?
>>>>
>>>> Thanks
>>>> Wen
>>>>
>>>>> Hi Andrew,
>>>>>
>>>>> But when I tried to xrdcp a file to it, it doesn't response. In
>>>>> atlas-bkp1-xrd.log.20091213, it always prints "stalling client for 10
>>>>> sec". But in cms.log, I can find any message about the file.
>>>>>
>>>>>> I don't see why you say it doesn't work. With the debugging level set
>>>>>> so
>>>>>> high the noise may make it look like something is going wrong but that
>>>>>> isn't
>>>>>> necessarily the case.
>>>>>>
>>>>>> 1) The 'too many subscribers' is correct. The manager was simply
>>>>>> redirecting
>>>>>> them because there were already 64 servers. However, in your case the
>>>>>> supervisor wasn't started until almost 30 minutes after everyone else
>>>>>> (i.e.,
>>>>>> 10:42 AM). Why was that? I'm not suprised about the flurry of messages
>>>>>> with
>>>>>> a critical component missing for 30 minutes.
>>>>> Because the manager is 64bit machine but supervisor is 32 bit machine.
>>>>> Then I have to recompile the it. At that time, I was interrupted by
>>>>> something else.
>>>>>
>>>>>
>>>>>> 2) Once the supervisor started, it started accepting the redirected
>>>>>> servers.
>>>>>>
>>>>>> 3) Then 10 seconds (10:42:10) later the supervisor was restarted. So,
>>>>>> that
>>>>>> would cause a flurry of activity to occur as there is no backup
>>>>>> supervisor
>>>>>> to take over.
>>>>>>
>>>>>> 4) This happened again at 10:42:34 AM then again at 10:48:49. Is the
>>>>>> supervisor crashing? Is there a core file?
>>>>>>
>>>>>> 5) At 11:11 AM the manager restarted. Again, is there a core file here
>>>>>> or
>>>>>> was this a manual action?
>>>>>>
>>>>>> During the course of all of this. All nodes connected were operating
>>>>>> propely
>>>>>> and files were being located.
>>>>>>
>>>>>> So, the two big questions are:
>>>>>>
>>>>>> a) Why was the supervisor not started until 30 minutes after the system
>>>>>> was
>>>>>> started?
>>>>>>
>>>>>> b) Is there an explanation of the restarts? If this was a crash then we
>>>>>> need
>>>>>> a core file to figure out what happened.
>>>>> It's not a crash. There are some reasons that I restarted some daemons.
>>>>> (1)I thought if a dataserver tried many times to connect to a
>>>>> redirector but failed, the dataserver would not try to connect a
>>>>> redirector again. The supervisor was missing for long time. So maybe
>>>>> some dataservers would not try to connect to atlas-bkp1 again. To
>>>>> reactive these dataservers, I restarted any servers.
>>>>> (2)When I tried to xrdcp, it was hanging for long time. I thought
>>>>> maybe manager was affected by some others things. then I restarte
>>>>> manager to see whether a restart can make this xrdcp work.
>>>>>
>>>>>
>>>>> Thanks
>>>>> Wen
>>>>>
>>>>>> Andy
>>>>>>
>>>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>> Cc: <[log in to unmask]>
>>>>>> Sent: Wednesday, December 16, 2009 9:38 AM
>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>
>>>>>>
>>>>>> Hi Andrew,
>>>>>>
>>>>>> It still doesn't work.
>>>>>> The log file is in higgs03.cs.wisc.edu/wguan/. The name is *.20091216
>>>>>> The manager complains there are too many subscribers and the removes
>>>>>> nodes.
>>>>>>
>>>>>> (*)
>>>>>> Add server.10040:[log in to unmask] redirected; too many
>>>>>> subscribers.
>>>>>>
>>>>>> Wen
>>>>>>
>>>>>> On Wed, Dec 16, 2009 at 4:25 AM, Andrew Hanushevsky <[log in to unmask]>
>>>>>> wrote:
>>>>>>> Hi Wen,
>>>>>>>
>>>>>>> It will be easier for me to retroft as the changes were pretty minor.
>>>>>>> Please
>>>>>>> lift the new XrdCmsNode.cc file from
>>>>>>>
>>>>>>> http://www.slac.stanford.edu/~abh/cmsd
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>> Cc: <[log in to unmask]>
>>>>>>> Sent: Tuesday, December 15, 2009 5:12 PM
>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>
>>>>>>>
>>>>>>> Hi Andy,
>>>>>>>
>>>>>>> I can switch to 20091104-1102. Then you don't need to patch
>>>>>>> another version. How can I download v20091104-1102?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Wen
>>>>>>>
>>>>>>> On Wed, Dec 16, 2009 at 12:52 AM, Andrew Hanushevsky
>>>>>>> <[log in to unmask]>
>>>>>>> wrote:
>>>>>>>> Hi Wen,
>>>>>>>>
>>>>>>>> Ah yes, I see that now. The file I gave you is based on
>>>>>>>> v20091104-1102.
>>>>>>>> Let
>>>>>>>> me see if I can retrofit the patch for you.
>>>>>>>>
>>>>>>>> Andy
>>>>>>>>
>>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>>> <[log in to unmask]>
>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>> Cc: <[log in to unmask]>
>>>>>>>> Sent: Tuesday, December 15, 2009 1:04 PM
>>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Andy,
>>>>>>>>
>>>>>>>> Which xrootd version are you using? XrdCmsConfig.hh is different.
>>>>>>>> XrdCmsConfig.hh is downloaded from
>>>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/.
>>>>>>>>
>>>>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsNode.cc
>>>>>>>> 6fb3ae40fe4e10bdd4d372818a341f2c src/XrdCms/XrdCmsNode.cc
>>>>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsConfig.hh
>>>>>>>> 7d57753847d9448186c718f98e963cbe src/XrdCms/XrdCmsConfig.hh
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Wen
>>>>>>>>
>>>>>>>> On Tue, Dec 15, 2009 at 10:50 PM, Andrew Hanushevsky
>>>>>>>> <[log in to unmask]>
>>>>>>>> wrote:
>>>>>>>>> Hi Wen,
>>>>>>>>>
>>>>>>>>> Just compiled on Linux and it was clean. Something is really wrong
>>>>>>>>> with
>>>>>>>>> your
>>>>>>>>> source files, specifically XrdCmsConfig.cc
>>>>>>>>>
>>>>>>>>> The MD5 checksums on the relevant files are:
>>>>>>>>>
>>>>>>>>> MD5 (XrdCmsNode.cc) = 6fb3ae40fe4e10bdd4d372818a341f2c
>>>>>>>>>
>>>>>>>>> MD5 (XrdCmsConfig.hh) = 4a7d655582a7cd43b098947d0676924b
>>>>>>>>>
>>>>>>>>> Andy
>>>>>>>>>
>>>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>>>> <[log in to unmask]>
>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>>> Cc: <[log in to unmask]>
>>>>>>>>> Sent: Tuesday, December 15, 2009 4:24 AM
>>>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Andy,
>>>>>>>>>
>>>>>>>>> No problem. Thanks for the fix. But it cannot be compiled. The
>>>>>>>>> version I am using is
>>>>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/.
>>>>>>>>>
>>>>>>>>> Making cms component...
>>>>>>>>> Compiling XrdCmsNode.cc
>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>> XrdCmsNode::do_Chmod(XrdCmsRRData&)':
>>>>>>>>> XrdCmsNode.cc:268: error: `fsExec' was not declared in this scope
>>>>>>>>> XrdCmsNode.cc:268: warning: unused variable 'fsExec'
>>>>>>>>> XrdCmsNode.cc:269: error: 'class XrdCmsConfig' has no member named
>>>>>>>>> 'ossFS'
>>>>>>>>> XrdCmsNode.cc:273: error: `fsFail' was not declared in this scope
>>>>>>>>> XrdCmsNode.cc:273: warning: unused variable 'fsFail'
>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>> XrdCmsNode::do_Mkdir(XrdCmsRRData&)':
>>>>>>>>> XrdCmsNode.cc:600: error: `fsExec' was not declared in this scope
>>>>>>>>> XrdCmsNode.cc:600: warning: unused variable 'fsExec'
>>>>>>>>> XrdCmsNode.cc:601: error: 'class XrdCmsConfig' has no member named
>>>>>>>>> 'ossFS'
>>>>>>>>> XrdCmsNode.cc:605: error: `fsFail' was not declared in this scope
>>>>>>>>> XrdCmsNode.cc:605: warning: unused variable 'fsFail'
>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>> XrdCmsNode::do_Mkpath(XrdCmsRRData&)':
>>>>>>>>> XrdCmsNode.cc:640: error: `fsExec' was not declared in this scope
>>>>>>>>> XrdCmsNode.cc:640: warning: unused variable 'fsExec'
>>>>>>>>> XrdCmsNode.cc:641: error: 'class XrdCmsConfig' has no member named
>>>>>>>>> 'ossFS'
>>>>>>>>> XrdCmsNode.cc:645: error: `fsFail' was not declared in this scope
>>>>>>>>> XrdCmsNode.cc:645: warning: unused variable 'fsFail'
>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>> XrdCmsNode::do_Mv(XrdCmsRRData&)':
>>>>>>>>> XrdCmsNode.cc:704: error: `fsExec' was not declared in this scope
>>>>>>>>> XrdCmsNode.cc:704: warning: unused variable 'fsExec'
>>>>>>>>> XrdCmsNode.cc:705: error: 'class XrdCmsConfig' has no member named
>>>>>>>>> 'ossFS'
>>>>>>>>> XrdCmsNode.cc:709: error: `fsFail' was not declared in this scope
>>>>>>>>> XrdCmsNode.cc:709: warning: unused variable 'fsFail'
>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>> XrdCmsNode::do_Rm(XrdCmsRRData&)':
>>>>>>>>> XrdCmsNode.cc:831: error: `fsExec' was not declared in this scope
>>>>>>>>> XrdCmsNode.cc:831: warning: unused variable 'fsExec'
>>>>>>>>> XrdCmsNode.cc:832: error: 'class XrdCmsConfig' has no member named
>>>>>>>>> 'ossFS'
>>>>>>>>> XrdCmsNode.cc:836: error: `fsFail' was not declared in this scope
>>>>>>>>> XrdCmsNode.cc:836: warning: unused variable 'fsFail'
>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>> XrdCmsNode::do_Rmdir(XrdCmsRRData&)':
>>>>>>>>> XrdCmsNode.cc:873: error: `fsExec' was not declared in this scope
>>>>>>>>> XrdCmsNode.cc:873: warning: unused variable 'fsExec'
>>>>>>>>> XrdCmsNode.cc:874: error: 'class XrdCmsConfig' has no member named
>>>>>>>>> 'ossFS'
>>>>>>>>> XrdCmsNode.cc:878: error: `fsFail' was not declared in this scope
>>>>>>>>> XrdCmsNode.cc:878: warning: unused variable 'fsFail'
>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>> XrdCmsNode::do_Trunc(XrdCmsRRData&)':
>>>>>>>>> XrdCmsNode.cc:1377: error: `fsExec' was not declared in this scope
>>>>>>>>> XrdCmsNode.cc:1377: warning: unused variable 'fsExec'
>>>>>>>>> XrdCmsNode.cc:1378: error: 'class XrdCmsConfig' has no member named
>>>>>>>>> 'ossFS'
>>>>>>>>> XrdCmsNode.cc:1382: error: `fsFail' was not declared in this scope
>>>>>>>>> XrdCmsNode.cc:1382: warning: unused variable 'fsFail'
>>>>>>>>> XrdCmsNode.cc: At global scope:
>>>>>>>>> XrdCmsNode.cc:1524: error: no `int XrdCmsNode::fsExec(XrdOucProg*,
>>>>>>>>> char*, char*)' member function declared in class `XrdCmsNode'
>>>>>>>>> XrdCmsNode.cc: In member function `int
>>>>>>>>> XrdCmsNode::fsExec(XrdOucProg*,
>>>>>>>>> char*, char*)':
>>>>>>>>> XrdCmsNode.cc:1533: error: `fsL2PFail1' was not declared in this
>>>>>>>>> scope
>>>>>>>>> XrdCmsNode.cc:1533: warning: unused variable 'fsL2PFail1'
>>>>>>>>> XrdCmsNode.cc:1537: error: `fsL2PFail2' was not declared in this
>>>>>>>>> scope
>>>>>>>>> XrdCmsNode.cc:1537: warning: unused variable 'fsL2PFail2'
>>>>>>>>> XrdCmsNode.cc: At global scope:
>>>>>>>>> XrdCmsNode.cc:1553: error: no `const char* XrdCmsNode::fsFail(const
>>>>>>>>> char*, const char*, const char*, int)' member function declared in
>>>>>>>>> class `XrdCmsNode'
>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>> XrdCmsNode::fsFail(const char*, const char*, const char*, int)':
>>>>>>>>> XrdCmsNode.cc:1559: error: `fsL2PFail1' was not declared in this
>>>>>>>>> scope
>>>>>>>>> XrdCmsNode.cc:1559: warning: unused variable 'fsL2PFail1'
>>>>>>>>> XrdCmsNode.cc:1560: error: `fsL2PFail2' was not declared in this
>>>>>>>>> scope
>>>>>>>>> XrdCmsNode.cc:1560: warning: unused variable 'fsL2PFail2'
>>>>>>>>> XrdCmsNode.cc: In static member function `static int
>>>>>>>>> XrdCmsNode::isOnline(char*, int)':
>>>>>>>>> XrdCmsNode.cc:1608: error: 'class XrdCmsConfig' has no member named
>>>>>>>>> 'ossFS'
>>>>>>>>> make[4]: *** [../../obj/XrdCmsNode.o] Error 1
>>>>>>>>> make[3]: *** [Linuxall] Error 2
>>>>>>>>> make[2]: *** [all] Error 2
>>>>>>>>> make[1]: *** [XrdCms] Error 2
>>>>>>>>> make: *** [all] Error 2
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Wen
>>>>>>>>>
>>>>>>>>> On Tue, Dec 15, 2009 at 2:08 AM, Andrew Hanushevsky
>>>>>>>>> <[log in to unmask]>
>>>>>>>>> wrote:
>>>>>>>>>> Hi Wen,
>>>>>>>>>>
>>>>>>>>>> I have developed a permanent fix. You will find the source files in
>>>>>>>>>>
>>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/
>>>>>>>>>>
>>>>>>>>>> There are three files: XrdCmsCluster.cc XrdCmsNode.cc
>>>>>>>>>> XrdCmsProtocol.cc
>>>>>>>>>>
>>>>>>>>>> Please do a source replacement and recompile. Unfortunately, the
>>>>>>>>>> cmsd
>>>>>>>>>> will
>>>>>>>>>> need to be replaced on each node regardless of role. My apologies
>>>>>>>>>> for
>>>>>>>>>> the
>>>>>>>>>> disruption. Please let me know how it goes.
>>>>>>>>>>
>>>>>>>>>> Andy
>>>>>>>>>>
>>>>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>>>>> <[log in to unmask]>
>>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>>>> Cc: <[log in to unmask]>
>>>>>>>>>> Sent: Sunday, December 13, 2009 7:04 AM
>>>>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Andrew,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>> I used the new cmsd at atlas-bkp1 manager. But it's still dropping
>>>>>>>>>> nodes. And in supervisor's log, I cannot find any dataserver to
>>>>>>>>>> register to it.
>>>>>>>>>>
>>>>>>>>>> The new logs are in http://higgs03.cs.wisc.edu/wguan/*.20091213.
>>>>>>>>>> The manager is patched at 091213 08:38:15.
>>>>>>>>>>
>>>>>>>>>> Wen
>>>>>>>>>>
>>>>>>>>>> On Sun, Dec 13, 2009 at 1:52 AM, Andrew Hanushevsky
>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>> Hi Wen
>>>>>>>>>>>
>>>>>>>>>>> You will find the source replacement at:
>>>>>>>>>>>
>>>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/
>>>>>>>>>>>
>>>>>>>>>>> It's XrdCmsCluster.cc and it replaces
>>>>>>>>>>> xrootd/src/XrdCms/XrdCmsCluster.cc
>>>>>>>>>>>
>>>>>>>>>>> I'm stepping out for a couple of hours but will be back to see how
>>>>>>>>>>> things
>>>>>>>>>>> went. Sorry for the issues :-(
>>>>>>>>>>>
>>>>>>>>>>> Andy
>>>>>>>>>>>
>>>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>
>>>>>>>>>>>> I prefer a source replacement. Then I can compile it.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Wen
>>>>>>>>>>>>> I can do one of two things here:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1) Supply a source replacement and then you would recompile, or
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2) Give me the uname -a of where the cmsd will run and I'll
>>>>>>>>>>>>> supply
>>>>>>>>>>>>> a
>>>>>>>>>>>>> binary
>>>>>>>>>>>>> replacement for you.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Your choice.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Andrew
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The problem is found. Great. Thanks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Where can I find the patched cmsd?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky
>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I found the problem. Looks like a regression from way back
>>>>>>>>>>>>>>> when.
>>>>>>>>>>>>>>> There
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>> missing flag on the redirect. This will require a patched cmsd
>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>> only to replace the redirector's cmsd as this only affects the
>>>>>>>>>>>>>>> redirector.
>>>>>>>>>>>>>>> How would you like to proceed?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It doesn't work. atlas-bkp1 manager still dropping nodes
>>>>>>>>>>>>>>>> again.
>>>>>>>>>>>>>>>> In supervisor, I still haven't seen any dataserver
>>>>>>>>>>>>>>>> registered.
>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>> said
>>>>>>>>>>>>>>>> "I updated the ntp" because you said "the log timestamp do
>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>> overlap".
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky
>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Do you mean that everything is now working? It could be that
>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>> removed
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> xrd.timeout directive. That really could cause problems. As
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> delays,
>>>>>>>>>>>>>>>>> that is normal when the redirector thinks something is going
>>>>>>>>>>>>>>>>> wrong.
>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>> strategy is to delay clients until it can get back to a
>>>>>>>>>>>>>>>>> stable
>>>>>>>>>>>>>>>>> configuration. This usually prevents jobs from crashing
>>>>>>>>>>>>>>>>> during
>>>>>>>>>>>>>>>>> stressful
>>>>>>>>>>>>>>>>> periods.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I restarted it to do supervisor test. Also because xrootd
>>>>>>>>>>>>>>>>>> manager
>>>>>>>>>>>>>>>>>> frequently doesn't response. (*) is the cms.log, the file
>>>>>>>>>>>>>>>>>> select
>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>> delayed again and again. When do a restart, all things are
>>>>>>>>>>>>>>>>>> fine.
>>>>>>>>>>>>>>>>>> Now
>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>> am trying to find a clue about it.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> (*)
>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask]
>>>>>>>>>>>>>>>>>> do_Select:
>>>>>>>>>>>>>>>>>> wc
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 Select seeking
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 UnkFile rc=1
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask]
>>>>>>>>>>>>>>>>>> do_Select:
>>>>>>>>>>>>>>>>>> delay 5
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 2+-1
>>>>>>>>>>>>>>>>>> post=0
>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 Dispatch
>>>>>>>>>>>>>>>>>> redirector.21313:14@atlas-bkp2
>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> select dlen=166
>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1
>>>>>>>>>>>>>>>>>> post=0
>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdSched: running redirector inq=0
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> There is no core file. I copied a new copies of the logs to
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> link
>>>>>>>>>>>>>>>>>> below.
>>>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky
>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I see in the server log that it is restarting often. Could
>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>> take
>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>> look
>>>>>>>>>>>>>>>>>>> in the c193 to see if you have any core files? Also please
>>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>> core files are enabled as Linux defaults the size to 0.
>>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>> step
>>>>>>>>>>>>>>>>>>> here
>>>>>>>>>>>>>>>>>>> is to find out why your servers are restarting.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> the logs can be found here. From the log you can see
>>>>>>>>>>>>>>>>>>>> atlas-bkp1
>>>>>>>>>>>>>>>>>>>> manager are dropping nodes again and again which tries to
>>>>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky
>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>>>> Hi Wen, Could you start everything up and provide me a
>>>>>>>>>>>>>>>>>>>>> pointer
>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> manager log file, supervisor log file, and one data
>>>>>>>>>>>>>>>>>>>>> server
>>>>>>>>>>>>>>>>>>>>> logfile
>>>>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>> which cover the same time-frame (from start to some
>>>>>>>>>>>>>>>>>>>>> point
>>>>>>>>>>>>>>>>>>>>> where
>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>> things are working or not). That way I can see what is
>>>>>>>>>>>>>>>>>>>>> happening.
>>>>>>>>>>>>>>>>>>>>> At
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> moment I only see two "bad" things in the config file:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a
>>>>>>>>>>>>>>>>>>>>> manager
>>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>> claim,
>>>>>>>>>>>>>>>>>>>>> via
>>>>>>>>>>>>>>>>>>>>> the all.manager directive, that there are three (bkp2
>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> bkp3).
>>>>>>>>>>>>>>>>>>>>> While
>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>> should work, the log file will be dense with error
>>>>>>>>>>>>>>>>>>>>> messages.
>>>>>>>>>>>>>>>>>>>>> Please
>>>>>>>>>>>>>>>>>>>>> correct
>>>>>>>>>>>>>>>>>>>>> this to be consistent and make it easier to see real
>>>>>>>>>>>>>>>>>>>>> errors.
>>>>>>>>>>>>>>>>>>>> This is not a problem for me. Because this config is used
>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>> dataserver. In manager, I updated the if
>>>>>>>>>>>>>>>>>>>> atlas-bkp1.cs.wisc.edu
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> atlas-bkp2 or something. This is a history problem. at
>>>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>> atlas-bkp1 is used. atlas-bkp2 and atlas-bkp3 are added
>>>>>>>>>>>>>>>>>>>> later.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 2) Please use cms.space not olb.space (for historical
>>>>>>>>>>>>>>>>>>>>> reasons
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> latter
>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>> still accepted and over-rides the former, but that will
>>>>>>>>>>>>>>>>>>>>> soon
>>>>>>>>>>>>>>>>>>>>> end),
>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> please use only one (the config file uses both
>>>>>>>>>>>>>>>>>>>>> directives).
>>>>>>>>>>>>>>>>>>>> yes. I should remove this line. in fact cms.space is in
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> cfg
>>>>>>>>>>>>>>>>>>>> too.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The xrootd has an internal mechanism to connect servers
>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>> supervisors
>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> allow for maximum reliability. You cannot change that
>>>>>>>>>>>>>>>>>>>>> algorithm
>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>> no need to do so. You should *never* tell anyone to
>>>>>>>>>>>>>>>>>>>>> directly
>>>>>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>> supervisor. If you do, you will likely get unreachable
>>>>>>>>>>>>>>>>>>>>> nodes.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> As for dropping data servers, it would appear to me,
>>>>>>>>>>>>>>>>>>>>> given
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> flurry
>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>> such activity, that something either crashed or was
>>>>>>>>>>>>>>>>>>>>> restarted.
>>>>>>>>>>>>>>>>>>>>> That's
>>>>>>>>>>>>>>>>>>>>> why
>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>> would be good to see the complete log of each one of the
>>>>>>>>>>>>>>>>>>>>> entities.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I read the document. and write a config
>>>>>>>>>>>>>>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg).
>>>>>>>>>>>>>>>>>>>>>> I used my conf, I can see manager is dispatch message
>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> supervisor. But I cannot see any dataserver tries to
>>>>>>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> supervisor. At the same time, in the manager's log, I
>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>> dataserver are Dropped.
>>>>>>>>>>>>>>>>>>>>>> How does xrootd decide which dataserver will connect
>>>>>>>>>>>>>>>>>>>>>> supervisor?
>>>>>>>>>>>>>>>>>>>>>> should I specify some dataservers to connect the
>>>>>>>>>>>>>>>>>>>>>> supervisor?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> (*) supervisor log
>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2
>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>> state
>>>>>>>>>>>>>>>>>>>>>> dlen=42
>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State:
>>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2
>>>>>>>>>>>>>>>>>>>>>> do_StateFWD:
>>>>>>>>>>>>>>>>>>>>>> Path
>>>>>>>>>>>>>>>>>>>>>> find
>>>>>>>>>>>>>>>>>>>>>> failed for state
>>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> (*)manager log
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>> TSpace=5587GB
>>>>>>>>>>>>>>>>>>>>>> NumFS=1
>>>>>>>>>>>>>>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding
>>>>>>>>>>>>>>>>>>>>>> path:
>>>>>>>>>>>>>>>>>>>>>> w
>>>>>>>>>>>>>>>>>>>>>> /atlas
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661
>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>> do_Space: 5696231MB free; 0% util
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 logged in.
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection from
>>>>>>>>>>>>>>>>>>>>>> [log in to unmask]
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: running
>>>>>>>>>>>>>>>>>>>>>> ?:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched protocol
>>>>>>>>>>>>>>>>>>>>>> cmsd
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll:
>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>> 79
>>>>>>>>>>>>>>>>>>>>>> attached
>>>>>>>>>>>>>>>>>>>>>> to poller 2; num=22
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add
>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>> bumps
>>>>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 #63
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Drop_Node:
>>>>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 dropped.
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add Shoved
>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 to cluster;
>>>>>>>>>>>>>>>>>>>>>> id=63.78;
>>>>>>>>>>>>>>>>>>>>>> num=64;
>>>>>>>>>>>>>>>>>>>>>> min=51
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>> TSpace=5587GB
>>>>>>>>>>>>>>>>>>>>>> NumFS=1
>>>>>>>>>>>>>>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding
>>>>>>>>>>>>>>>>>>>>>> path:
>>>>>>>>>>>>>>>>>>>>>> w
>>>>>>>>>>>>>>>>>>>>>> /atlas
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661
>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>> do_Space: 5721854MB free; 0% util
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 logged in.
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve from
>>>>>>>>>>>>>>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>>>> 60
>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 node 63.78
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661
>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>> 79 detached from poller 2; num=21
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>>> c177.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>> FD=16
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 node 0.3
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>> 16 detached from poller 2; num=20
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>>> c179.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>> FD=21
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>> server.17065:[log in to unmask]:1094 node 1.4
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>> server.7978:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.7978:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>> 21
>>>>>>>>>>>>>>>>>>>>>> detached from poller 1; num=21
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 State: Status changed to
>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Send status to
>>>>>>>>>>>>>>>>>>>>>> redirector.15656:14@atlas-bkp2
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>>> c182.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>> FD=19
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 node 7.10
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>> 19 detached from poller 2; num=19
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>>> c178.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>> FD=15
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 node 9.12
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>> 15 detached from poller 1; num=20
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>>> c181.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>> FD=17
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 node 5.8
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>> 17 detached from poller 0; num=21
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>>> c183.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>> FD=22
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 node 8.11
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>> 22 detached from poller 2; num=18
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>>> c184.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>> FD=20
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>> server.4131:[log in to unmask]:1094 node 3.6
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>> 20 detached from poller 0; num=20
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>>> c185.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>> FD=23
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 node 6.9
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>> server.8524:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.8524:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>> 23
>>>>>>>>>>>>>>>>>>>>>> detached from poller 0; num=19
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>>> c180.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>> FD=18
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 node 4.7
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>> 18 detached from poller 1; num=19
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>>>>>> c186.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>> FD=24
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 node 2.5
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>> server.7849:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.7849:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>> 24
>>>>>>>>>>>>>>>>>>>>>> detached from poller 1; num=18
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in
>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled.
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled.
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled.
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled.
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled.
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled.
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled.
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled.
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled.
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled.
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled.
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled.
>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky
>>>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> To go past 64 data servers you will need to setup one
>>>>>>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>> supervisors.
>>>>>>>>>>>>>>>>>>>>>>> This does not logically change the current
>>>>>>>>>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>> have.
>>>>>>>>>>>>>>>>>>>>>>> You
>>>>>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>> need to configure one or more *new* servers (or at
>>>>>>>>>>>>>>>>>>>>>>> least
>>>>>>>>>>>>>>>>>>>>>>> xrootd
>>>>>>>>>>>>>>>>>>>>>>> processes)
>>>>>>>>>>>>>>>>>>>>>>> whose role is supervisor. We'd like them to run in
>>>>>>>>>>>>>>>>>>>>>>> separate
>>>>>>>>>>>>>>>>>>>>>>> machines
>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>> reliability purposes, but they could run on the
>>>>>>>>>>>>>>>>>>>>>>> manager
>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>> long
>>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>> give each one a unique instance name (i.e., -n
>>>>>>>>>>>>>>>>>>>>>>> option).
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> The front part of the cmsd reference explains how to
>>>>>>>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Is there any change to configure xrootd with more
>>>>>>>>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>>>>>> 65
>>>>>>>>>>>>>>>>>>>>>>>> machines? I used the configure below but it doesn't
>>>>>>>>>>>>>>>>>>>>>>>> work.
>>>>>>>>>>>>>>>>>>>>>>>> Should
>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>> configure some machines' manager to be supvervisor?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>>>
>>
>>