Print

Print


Hi Andrew,

     But when I tried to xrdcp a file to it, it doesn't response. In
atlas-bkp1-xrd.log.20091213, it always prints "stalling client for 10
sec". But in cms.log, I can find any message about the file.

> I don't see why you say it doesn't work. With the debugging level set so
> high the noise may make it look like something is going wrong but that isn't
> necessarily the case.
>
> 1) The 'too many subscribers' is correct. The manager was simply redirecting
> them because there were already 64 servers. However, in your case the
> supervisor wasn't started until almost 30 minutes after everyone else (i.e.,
> 10:42 AM). Why was that? I'm not suprised about the flurry of messages with
> a critical component missing for 30 minutes.
Because the manager is 64bit machine but supervisor is 32 bit machine.
Then I have to recompile the it.  At that time, I was interrupted by
something else.


> 2) Once the supervisor started, it started accepting the redirected servers.
>
> 3) Then 10 seconds (10:42:10) later the supervisor was restarted. So, that
> would cause a flurry of activity to occur as there is no backup supervisor
> to take over.
>
> 4) This happened again at 10:42:34 AM then again at 10:48:49. Is the
> supervisor crashing? Is there a core file?
>
> 5) At 11:11 AM the manager restarted. Again, is there a core file here or
> was this a manual action?
>
> During the course of all of this. All nodes connected were operating propely
> and files were being located.
>
> So, the two big questions are:
>
> a) Why was the supervisor not started until 30 minutes after the system was
> started?
>
> b) Is there an explanation of the restarts? If this was a crash then we need
> a core file to figure out what happened.
It's not a crash. There are some reasons that I restarted some daemons.
(1)I thought if a dataserver tried many times to connect to a
redirector but failed, the dataserver would not try to connect a
redirector again. The supervisor was missing for long time.  So maybe
some dataservers would not try to connect to atlas-bkp1 again.  To
reactive these dataservers, I restarted any servers.
(2)When I tried to xrdcp, it was hanging for long time. I thought
maybe manager was affected by some others things. then I restarte
manager to see whether a restart can make this xrdcp work.


Thanks
Wen

> Andy
>
> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
> To: "Andrew Hanushevsky" <[log in to unmask]>
> Cc: <[log in to unmask]>
> Sent: Wednesday, December 16, 2009 9:38 AM
> Subject: Re: xrootd with more than 65 machines
>
>
> Hi Andrew,
>
>   It still doesn't work.
>   The log file is in higgs03.cs.wisc.edu/wguan/.  The name is *.20091216
>   The manager complains there are too many subscribers and the removes
> nodes.
>
> (*)
> Add server.10040:[log in to unmask] redirected; too many subscribers.
>
> Wen
>
> On Wed, Dec 16, 2009 at 4:25 AM, Andrew Hanushevsky <[log in to unmask]>
> wrote:
>>
>> Hi Wen,
>>
>> It will be easier for me to retroft as the changes were pretty minor.
>> Please
>> lift the new XrdCmsNode.cc file from
>>
>> http://www.slac.stanford.edu/~abh/cmsd
>>
>> Andy
>>
>> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
>> To: "Andrew Hanushevsky" <[log in to unmask]>
>> Cc: <[log in to unmask]>
>> Sent: Tuesday, December 15, 2009 5:12 PM
>> Subject: Re: xrootd with more than 65 machines
>>
>>
>> Hi Andy,
>>
>> I can switch to 20091104-1102. Then you don't need to patch
>> another version. How can I download v20091104-1102?
>>
>> Thanks
>> Wen
>>
>> On Wed, Dec 16, 2009 at 12:52 AM, Andrew Hanushevsky <[log in to unmask]>
>> wrote:
>>>
>>> Hi Wen,
>>>
>>> Ah yes, I see that now. The file I gave you is based on v20091104-1102.
>>> Let
>>> me see if I can retrofit the patch for you.
>>>
>>> Andy
>>>
>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>> Cc: <[log in to unmask]>
>>> Sent: Tuesday, December 15, 2009 1:04 PM
>>> Subject: Re: xrootd with more than 65 machines
>>>
>>>
>>> Hi Andy,
>>>
>>> Which xrootd version are you using? XrdCmsConfig.hh is different.
>>> XrdCmsConfig.hh is downloaded from
>>> http://xrootd.slac.stanford.edu/download/20091028-1003/.
>>>
>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsNode.cc
>>> 6fb3ae40fe4e10bdd4d372818a341f2c src/XrdCms/XrdCmsNode.cc
>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsConfig.hh
>>> 7d57753847d9448186c718f98e963cbe src/XrdCms/XrdCmsConfig.hh
>>>
>>> Thanks
>>> Wen
>>>
>>> On Tue, Dec 15, 2009 at 10:50 PM, Andrew Hanushevsky <[log in to unmask]>
>>> wrote:
>>>>
>>>> Hi Wen,
>>>>
>>>> Just compiled on Linux and it was clean. Something is really wrong with
>>>> your
>>>> source files, specifically XrdCmsConfig.cc
>>>>
>>>> The MD5 checksums on the relevant files are:
>>>>
>>>> MD5 (XrdCmsNode.cc) = 6fb3ae40fe4e10bdd4d372818a341f2c
>>>>
>>>> MD5 (XrdCmsConfig.hh) = 4a7d655582a7cd43b098947d0676924b
>>>>
>>>> Andy
>>>>
>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>> Cc: <[log in to unmask]>
>>>> Sent: Tuesday, December 15, 2009 4:24 AM
>>>> Subject: Re: xrootd with more than 65 machines
>>>>
>>>>
>>>> Hi Andy,
>>>>
>>>> No problem. Thanks for the fix. But it cannot be compiled. The
>>>> version I am using is
>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/.
>>>>
>>>> Making cms component...
>>>> Compiling XrdCmsNode.cc
>>>> XrdCmsNode.cc: In member function `const char*
>>>> XrdCmsNode::do_Chmod(XrdCmsRRData&)':
>>>> XrdCmsNode.cc:268: error: `fsExec' was not declared in this scope
>>>> XrdCmsNode.cc:268: warning: unused variable 'fsExec'
>>>> XrdCmsNode.cc:269: error: 'class XrdCmsConfig' has no member named
>>>> 'ossFS'
>>>> XrdCmsNode.cc:273: error: `fsFail' was not declared in this scope
>>>> XrdCmsNode.cc:273: warning: unused variable 'fsFail'
>>>> XrdCmsNode.cc: In member function `const char*
>>>> XrdCmsNode::do_Mkdir(XrdCmsRRData&)':
>>>> XrdCmsNode.cc:600: error: `fsExec' was not declared in this scope
>>>> XrdCmsNode.cc:600: warning: unused variable 'fsExec'
>>>> XrdCmsNode.cc:601: error: 'class XrdCmsConfig' has no member named
>>>> 'ossFS'
>>>> XrdCmsNode.cc:605: error: `fsFail' was not declared in this scope
>>>> XrdCmsNode.cc:605: warning: unused variable 'fsFail'
>>>> XrdCmsNode.cc: In member function `const char*
>>>> XrdCmsNode::do_Mkpath(XrdCmsRRData&)':
>>>> XrdCmsNode.cc:640: error: `fsExec' was not declared in this scope
>>>> XrdCmsNode.cc:640: warning: unused variable 'fsExec'
>>>> XrdCmsNode.cc:641: error: 'class XrdCmsConfig' has no member named
>>>> 'ossFS'
>>>> XrdCmsNode.cc:645: error: `fsFail' was not declared in this scope
>>>> XrdCmsNode.cc:645: warning: unused variable 'fsFail'
>>>> XrdCmsNode.cc: In member function `const char*
>>>> XrdCmsNode::do_Mv(XrdCmsRRData&)':
>>>> XrdCmsNode.cc:704: error: `fsExec' was not declared in this scope
>>>> XrdCmsNode.cc:704: warning: unused variable 'fsExec'
>>>> XrdCmsNode.cc:705: error: 'class XrdCmsConfig' has no member named
>>>> 'ossFS'
>>>> XrdCmsNode.cc:709: error: `fsFail' was not declared in this scope
>>>> XrdCmsNode.cc:709: warning: unused variable 'fsFail'
>>>> XrdCmsNode.cc: In member function `const char*
>>>> XrdCmsNode::do_Rm(XrdCmsRRData&)':
>>>> XrdCmsNode.cc:831: error: `fsExec' was not declared in this scope
>>>> XrdCmsNode.cc:831: warning: unused variable 'fsExec'
>>>> XrdCmsNode.cc:832: error: 'class XrdCmsConfig' has no member named
>>>> 'ossFS'
>>>> XrdCmsNode.cc:836: error: `fsFail' was not declared in this scope
>>>> XrdCmsNode.cc:836: warning: unused variable 'fsFail'
>>>> XrdCmsNode.cc: In member function `const char*
>>>> XrdCmsNode::do_Rmdir(XrdCmsRRData&)':
>>>> XrdCmsNode.cc:873: error: `fsExec' was not declared in this scope
>>>> XrdCmsNode.cc:873: warning: unused variable 'fsExec'
>>>> XrdCmsNode.cc:874: error: 'class XrdCmsConfig' has no member named
>>>> 'ossFS'
>>>> XrdCmsNode.cc:878: error: `fsFail' was not declared in this scope
>>>> XrdCmsNode.cc:878: warning: unused variable 'fsFail'
>>>> XrdCmsNode.cc: In member function `const char*
>>>> XrdCmsNode::do_Trunc(XrdCmsRRData&)':
>>>> XrdCmsNode.cc:1377: error: `fsExec' was not declared in this scope
>>>> XrdCmsNode.cc:1377: warning: unused variable 'fsExec'
>>>> XrdCmsNode.cc:1378: error: 'class XrdCmsConfig' has no member named
>>>> 'ossFS'
>>>> XrdCmsNode.cc:1382: error: `fsFail' was not declared in this scope
>>>> XrdCmsNode.cc:1382: warning: unused variable 'fsFail'
>>>> XrdCmsNode.cc: At global scope:
>>>> XrdCmsNode.cc:1524: error: no `int XrdCmsNode::fsExec(XrdOucProg*,
>>>> char*, char*)' member function declared in class `XrdCmsNode'
>>>> XrdCmsNode.cc: In member function `int XrdCmsNode::fsExec(XrdOucProg*,
>>>> char*, char*)':
>>>> XrdCmsNode.cc:1533: error: `fsL2PFail1' was not declared in this scope
>>>> XrdCmsNode.cc:1533: warning: unused variable 'fsL2PFail1'
>>>> XrdCmsNode.cc:1537: error: `fsL2PFail2' was not declared in this scope
>>>> XrdCmsNode.cc:1537: warning: unused variable 'fsL2PFail2'
>>>> XrdCmsNode.cc: At global scope:
>>>> XrdCmsNode.cc:1553: error: no `const char* XrdCmsNode::fsFail(const
>>>> char*, const char*, const char*, int)' member function declared in
>>>> class `XrdCmsNode'
>>>> XrdCmsNode.cc: In member function `const char*
>>>> XrdCmsNode::fsFail(const char*, const char*, const char*, int)':
>>>> XrdCmsNode.cc:1559: error: `fsL2PFail1' was not declared in this scope
>>>> XrdCmsNode.cc:1559: warning: unused variable 'fsL2PFail1'
>>>> XrdCmsNode.cc:1560: error: `fsL2PFail2' was not declared in this scope
>>>> XrdCmsNode.cc:1560: warning: unused variable 'fsL2PFail2'
>>>> XrdCmsNode.cc: In static member function `static int
>>>> XrdCmsNode::isOnline(char*, int)':
>>>> XrdCmsNode.cc:1608: error: 'class XrdCmsConfig' has no member named
>>>> 'ossFS'
>>>> make[4]: *** [../../obj/XrdCmsNode.o] Error 1
>>>> make[3]: *** [Linuxall] Error 2
>>>> make[2]: *** [all] Error 2
>>>> make[1]: *** [XrdCms] Error 2
>>>> make: *** [all] Error 2
>>>>
>>>>
>>>> Wen
>>>>
>>>> On Tue, Dec 15, 2009 at 2:08 AM, Andrew Hanushevsky <[log in to unmask]>
>>>> wrote:
>>>>>
>>>>> Hi Wen,
>>>>>
>>>>> I have developed a permanent fix. You will find the source files in
>>>>>
>>>>> http://www.slac.stanford.edu/~abh/cmsd/
>>>>>
>>>>> There are three files: XrdCmsCluster.cc XrdCmsNode.cc XrdCmsProtocol.cc
>>>>>
>>>>> Please do a source replacement and recompile. Unfortunately, the cmsd
>>>>> will
>>>>> need to be replaced on each node regardless of role. My apologies for
>>>>> the
>>>>> disruption. Please let me know how it goes.
>>>>>
>>>>> Andy
>>>>>
>>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>> Cc: <[log in to unmask]>
>>>>> Sent: Sunday, December 13, 2009 7:04 AM
>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>
>>>>>
>>>>> Hi Andrew,
>>>>>
>>>>>
>>>>> Thanks.
>>>>> I used the new cmsd at atlas-bkp1 manager. But it's still dropping
>>>>> nodes. And in supervisor's log, I cannot find any dataserver to
>>>>> register to it.
>>>>>
>>>>> The new logs are in http://higgs03.cs.wisc.edu/wguan/*.20091213.
>>>>> The manager is patched at 091213 08:38:15.
>>>>>
>>>>> Wen
>>>>>
>>>>> On Sun, Dec 13, 2009 at 1:52 AM, Andrew Hanushevsky
>>>>> <[log in to unmask]> wrote:
>>>>>>
>>>>>> Hi Wen
>>>>>>
>>>>>> You will find the source replacement at:
>>>>>>
>>>>>> http://www.slac.stanford.edu/~abh/cmsd/
>>>>>>
>>>>>> It's XrdCmsCluster.cc and it replaces
>>>>>> xrootd/src/XrdCms/XrdCmsCluster.cc
>>>>>>
>>>>>> I'm stepping out for a couple of hours but will be back to see how
>>>>>> things
>>>>>> went. Sorry for the issues :-(
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>> On Sun, 13 Dec 2009, wen guan wrote:
>>>>>>
>>>>>>> Hi Andrew,
>>>>>>>
>>>>>>> I prefer a source replacement. Then I can compile it.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Wen
>>>>>>>>
>>>>>>>> I can do one of two things here:
>>>>>>>>
>>>>>>>> 1) Supply a source replacement and then you would recompile, or
>>>>>>>>
>>>>>>>> 2) Give me the uname -a of where the cmsd will run and I'll supply a
>>>>>>>> binary
>>>>>>>> replacement for you.
>>>>>>>>
>>>>>>>> Your choice.
>>>>>>>>
>>>>>>>> Andy
>>>>>>>>
>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote:
>>>>>>>>
>>>>>>>>> Hi Andrew
>>>>>>>>>
>>>>>>>>> The problem is found. Great. Thanks.
>>>>>>>>>
>>>>>>>>> Where can I find the patched cmsd?
>>>>>>>>>
>>>>>>>>> Wen
>>>>>>>>>
>>>>>>>>> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky
>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Wen,
>>>>>>>>>>
>>>>>>>>>> I found the problem. Looks like a regression from way back when.
>>>>>>>>>> There
>>>>>>>>>> is
>>>>>>>>>> a
>>>>>>>>>> missing flag on the redirect. This will require a patched cmsd but
>>>>>>>>>> you
>>>>>>>>>> need
>>>>>>>>>> only to replace the redirector's cmsd as this only affects the
>>>>>>>>>> redirector.
>>>>>>>>>> How would you like to proceed?
>>>>>>>>>>
>>>>>>>>>> Andy
>>>>>>>>>>
>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>
>>>>>>>>>>> It doesn't work. atlas-bkp1 manager still dropping nodes again.
>>>>>>>>>>> In supervisor, I still haven't seen any dataserver registered. I
>>>>>>>>>>> said
>>>>>>>>>>> "I updated the ntp" because you said "the log timestamp do not
>>>>>>>>>>> overlap".
>>>>>>>>>>>
>>>>>>>>>>> Wen
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky
>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>
>>>>>>>>>>>> Do you mean that everything is now working? It could be that you
>>>>>>>>>>>> removed
>>>>>>>>>>>> the
>>>>>>>>>>>> xrd.timeout directive. That really could cause problems. As for
>>>>>>>>>>>> the
>>>>>>>>>>>> delays,
>>>>>>>>>>>> that is normal when the redirector thinks something is going
>>>>>>>>>>>> wrong.
>>>>>>>>>>>> The
>>>>>>>>>>>> strategy is to delay clients until it can get back to a stable
>>>>>>>>>>>> configuration. This usually prevents jobs from crashing during
>>>>>>>>>>>> stressful
>>>>>>>>>>>> periods.
>>>>>>>>>>>>
>>>>>>>>>>>> Andy
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I restarted it to do supervisor test. Also because xrootd
>>>>>>>>>>>>> manager
>>>>>>>>>>>>> frequently doesn't response. (*) is the cms.log, the file
>>>>>>>>>>>>> select
>>>>>>>>>>>>> is
>>>>>>>>>>>>> delayed again and again. When do a restart, all things are
>>>>>>>>>>>>> fine.
>>>>>>>>>>>>> Now
>>>>>>>>>>>>> I
>>>>>>>>>>>>> am trying to find a clue about it.
>>>>>>>>>>>>>
>>>>>>>>>>>>> (*)
>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask]
>>>>>>>>>>>>> do_Select:
>>>>>>>>>>>>> wc
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>> 091212 00:00:19 21318 Select seeking
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>> 091212 00:00:19 21318 UnkFile rc=1
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask]
>>>>>>>>>>>>> do_Select:
>>>>>>>>>>>>> delay 5
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 2+-1 post=0
>>>>>>>>>>>>> 091212 00:00:19 21318 Dispatch redirector.21313:14@atlas-bkp2
>>>>>>>>>>>>> for
>>>>>>>>>>>>> select dlen=166
>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1 post=0
>>>>>>>>>>>>> 091212 00:00:19 21318 XrdSched: running redirector inq=0
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> There is no core file. I copied a new copies of the logs to the
>>>>>>>>>>>>> link
>>>>>>>>>>>>> below.
>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/
>>>>>>>>>>>>>
>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky
>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I see in the server log that it is restarting often. Could you
>>>>>>>>>>>>>> take
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>> look
>>>>>>>>>>>>>> in the c193 to see if you have any core files? Also please
>>>>>>>>>>>>>> make
>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>> core files are enabled as Linux defaults the size to 0. The
>>>>>>>>>>>>>> first
>>>>>>>>>>>>>> step
>>>>>>>>>>>>>> here
>>>>>>>>>>>>>> is to find out why your servers are restarting.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> the logs can be found here. From the log you can see
>>>>>>>>>>>>>>> atlas-bkp1
>>>>>>>>>>>>>>> manager are dropping nodes again and again which tries to
>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky
>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Wen, Could you start everything up and provide me a
>>>>>>>>>>>>>>>> pointer
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> manager log file, supervisor log file, and one data server
>>>>>>>>>>>>>>>> logfile
>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> which cover the same time-frame (from start to some point
>>>>>>>>>>>>>>>> where
>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>> things are working or not). That way I can see what is
>>>>>>>>>>>>>>>> happening.
>>>>>>>>>>>>>>>> At
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> moment I only see two "bad" things in the config file:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a manager
>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>> claim,
>>>>>>>>>>>>>>>> via
>>>>>>>>>>>>>>>> the all.manager directive, that there are three (bkp2 and
>>>>>>>>>>>>>>>> bkp3).
>>>>>>>>>>>>>>>> While
>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>> should work, the log file will be dense with error messages.
>>>>>>>>>>>>>>>> Please
>>>>>>>>>>>>>>>> correct
>>>>>>>>>>>>>>>> this to be consistent and make it easier to see real errors.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is not a problem for me. Because this config is used in
>>>>>>>>>>>>>>> dataserver. In manager, I updated the if
>>>>>>>>>>>>>>> atlas-bkp1.cs.wisc.edu
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> atlas-bkp2 or something. This is a history problem. at first
>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>> atlas-bkp1 is used. atlas-bkp2 and atlas-bkp3 are added
>>>>>>>>>>>>>>> later.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2) Please use cms.space not olb.space (for historical
>>>>>>>>>>>>>>>> reasons
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> latter
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> still accepted and over-rides the former, but that will soon
>>>>>>>>>>>>>>>> end),
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> please use only one (the config file uses both directives).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> yes. I should remove this line. in fact cms.space is in the
>>>>>>>>>>>>>>> cfg
>>>>>>>>>>>>>>> too.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The xrootd has an internal mechanism to connect servers with
>>>>>>>>>>>>>>>> supervisors
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> allow for maximum reliability. You cannot change that
>>>>>>>>>>>>>>>> algorithm
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> no need to do so. You should *never* tell anyone to directly
>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> supervisor. If you do, you will likely get unreachable
>>>>>>>>>>>>>>>> nodes.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> As for dropping data servers, it would appear to me, given
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> flurry
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> such activity, that something either crashed or was
>>>>>>>>>>>>>>>> restarted.
>>>>>>>>>>>>>>>> That's
>>>>>>>>>>>>>>>> why
>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>> would be good to see the complete log of each one of the
>>>>>>>>>>>>>>>> entities.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I read the document. and write a config
>>>>>>>>>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg).
>>>>>>>>>>>>>>>>> I used my conf, I can see manager is dispatch message to
>>>>>>>>>>>>>>>>> supervisor. But I cannot see any dataserver tries to
>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> supervisor. At the same time, in the manager's log, I can
>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>> dataserver are Dropped.
>>>>>>>>>>>>>>>>> How does xrootd decide which dataserver will connect
>>>>>>>>>>>>>>>>> supervisor?
>>>>>>>>>>>>>>>>> should I specify some dataservers to connect the
>>>>>>>>>>>>>>>>> supervisor?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (*) supervisor log
>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2 for
>>>>>>>>>>>>>>>>> state
>>>>>>>>>>>>>>>>> dlen=42
>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State:
>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_StateFWD:
>>>>>>>>>>>>>>>>> Path
>>>>>>>>>>>>>>>>> find
>>>>>>>>>>>>>>>>> failed for state /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (*)manager log
>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu
>>>>>>>>>>>>>>>>> TSpace=5587GB
>>>>>>>>>>>>>>>>> NumFS=1
>>>>>>>>>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0
>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding path:
>>>>>>>>>>>>>>>>> w
>>>>>>>>>>>>>>>>> /atlas
>>>>>>>>>>>>>>>>> 091211 04:13:24 15661
>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>> do_Space: 5696231MB free; 0% util
>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 logged in.
>>>>>>>>>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection from
>>>>>>>>>>>>>>>>> [log in to unmask]
>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: running
>>>>>>>>>>>>>>>>> ?:[log in to unmask]
>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched protocol cmsd
>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll: FD
>>>>>>>>>>>>>>>>> 79
>>>>>>>>>>>>>>>>> attached
>>>>>>>>>>>>>>>>> to poller 2; num=22
>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add
>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>> bumps
>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 #63
>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Drop_Node:
>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 dropped.
>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add Shoved
>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 to cluster;
>>>>>>>>>>>>>>>>> id=63.78;
>>>>>>>>>>>>>>>>> num=64;
>>>>>>>>>>>>>>>>> min=51
>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0
>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu
>>>>>>>>>>>>>>>>> TSpace=5587GB
>>>>>>>>>>>>>>>>> NumFS=1
>>>>>>>>>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0
>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding path:
>>>>>>>>>>>>>>>>> w
>>>>>>>>>>>>>>>>> /atlas
>>>>>>>>>>>>>>>>> 091211 04:13:24 15661
>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094
>>>>>>>>>>>>>>>>> do_Space: 5721854MB free; 0% util
>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 logged in.
>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve from
>>>>>>>>>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer
>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop node in 60
>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Remove_Node
>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 node 63.78
>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>> 79 detached from poller 2; num=21
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094
>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094
>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service
>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>> c177.chtc.wisc.edu
>>>>>>>>>>>>>>>>> FD=16
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 node 0.3
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>> server.21656:[log in to unmask]
>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.21656:[log in to unmask]
>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>> 16 detached from poller 2; num=20
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>> c179.chtc.wisc.edu
>>>>>>>>>>>>>>>>> FD=21
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>> server.17065:[log in to unmask]:1094 node 1.4
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>> server.7978:[log in to unmask]
>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.7978:[log in to unmask]
>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>> 21
>>>>>>>>>>>>>>>>> detached from poller 1; num=21
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 State: Status changed to suspended
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Send status to
>>>>>>>>>>>>>>>>> redirector.15656:14@atlas-bkp2
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094
>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094
>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service
>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>> c182.chtc.wisc.edu
>>>>>>>>>>>>>>>>> FD=19
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 node 7.10
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>> server.26620:[log in to unmask]
>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.26620:[log in to unmask]
>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>> 19 detached from poller 2; num=19
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094
>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094
>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service
>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>> c178.chtc.wisc.edu
>>>>>>>>>>>>>>>>> FD=15
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 node 9.12
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>> server.11901:[log in to unmask]
>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.11901:[log in to unmask]
>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>> 15 detached from poller 1; num=20
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094
>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094
>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service
>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>> c181.chtc.wisc.edu
>>>>>>>>>>>>>>>>> FD=17
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 node 5.8
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>> server.13984:[log in to unmask]
>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.13984:[log in to unmask]
>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>> 17 detached from poller 0; num=21
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094
>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094
>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service
>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>> c183.chtc.wisc.edu
>>>>>>>>>>>>>>>>> FD=22
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 node 8.11
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>> server.27735:[log in to unmask]
>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.27735:[log in to unmask]
>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>> 22 detached from poller 2; num=18
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>> c184.chtc.wisc.edu
>>>>>>>>>>>>>>>>> FD=20
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>> server.4131:[log in to unmask]:1094 node 3.6
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>> server.26787:[log in to unmask]
>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.26787:[log in to unmask]
>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>> 20 detached from poller 0; num=20
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service
>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>> c185.chtc.wisc.edu
>>>>>>>>>>>>>>>>> FD=23
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 node 6.9
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>> server.8524:[log in to unmask]
>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.8524:[log in to unmask]
>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>> 23
>>>>>>>>>>>>>>>>> detached from poller 0; num=19
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094
>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094
>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service
>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>> c180.chtc.wisc.edu
>>>>>>>>>>>>>>>>> FD=18
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 node 4.7
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>> server.14636:[log in to unmask]
>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.14636:[log in to unmask]
>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>> 18 detached from poller 1; num=19
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094
>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094
>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service
>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>> c186.chtc.wisc.edu
>>>>>>>>>>>>>>>>> FD=24
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 node 2.5
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>> server.7849:[log in to unmask]
>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.7849:[log in to unmask]
>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>> 24
>>>>>>>>>>>>>>>>> detached from poller 1; num=18
>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled.
>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled.
>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled.
>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled.
>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled.
>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled.
>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled.
>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled.
>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled.
>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled.
>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled.
>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers
>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled.
>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky
>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> To go past 64 data servers you will need to setup one or
>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>> supervisors.
>>>>>>>>>>>>>>>>>> This does not logically change the current configuration
>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>> have.
>>>>>>>>>>>>>>>>>> You
>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>> need to configure one or more *new* servers (or at least
>>>>>>>>>>>>>>>>>> xrootd
>>>>>>>>>>>>>>>>>> processes)
>>>>>>>>>>>>>>>>>> whose role is supervisor. We'd like them to run in
>>>>>>>>>>>>>>>>>> separate
>>>>>>>>>>>>>>>>>> machines
>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> reliability purposes, but they could run on the manager
>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>> long
>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>> give each one a unique instance name (i.e., -n option).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The front part of the cmsd reference explains how to do
>>>>>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Is there any change to configure xrootd with more than 65
>>>>>>>>>>>>>>>>>>> machines? I used the configure below but it doesn't work.
>>>>>>>>>>>>>>>>>>> Should
>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>> configure some machines' manager to be supvervisor?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>