LISTSERV 16.5 - XROOTD-L Archives

Hi Andy,

> OK, I understand. As for stalling, too many nodes were deemed to be in
> trouble for the manager to allow service resumption.
>
> Please make sure that all of the nodes in the cluster receive the new cmsd
> as they will drop off with the old one and you'll see the same kind of
> activity.  Perhaps the best way to know that you suceeded in putting
> everything in sync is to start with 63 data nodes plus one supervisor. Once
> all connections are established; adding an additional server should simply
> send it to the supervisor.
I will do it.
you said start 63 data server and one supervisor. Does it mean the
supervisor is managed using the same policy? If I there are 64
dataservers which are connected before the supervisor, will the
supervisor be dropped?  Is the supervisor has high priority to be
added to the manager? I mean, if there are already 64 dataservers and
a supervisor comes in, will the supervisor be accepted and a datasever
be redirected to the supervisor?

Thanks
Wen

>
> Hi Andrew,
>
>    But when I tried to xrdcp a file to it, it doesn't response. In
> atlas-bkp1-xrd.log.20091213, it always prints "stalling client for 10
> sec". But in cms.log, I can find any message about the file.
>
>> I don't see why you say it doesn't work. With the debugging level set so
>> high the noise may make it look like something is going wrong but that
>> isn't
>> necessarily the case.
>>
>> 1) The 'too many subscribers' is correct. The manager was simply
>> redirecting
>> them because there were already 64 servers. However, in your case the
>> supervisor wasn't started until almost 30 minutes after everyone else
>> (i.e.,
>> 10:42 AM). Why was that? I'm not suprised about the flurry of messages
>> with
>> a critical component missing for 30 minutes.
>
> Because the manager is 64bit machine but supervisor is 32 bit machine.
> Then I have to recompile the it.  At that time, I was interrupted by
> something else.
>
>
>> 2) Once the supervisor started, it started accepting the redirected
>> servers.
>>
>> 3) Then 10 seconds (10:42:10) later the supervisor was restarted. So, that
>> would cause a flurry of activity to occur as there is no backup supervisor
>> to take over.
>>
>> 4) This happened again at 10:42:34 AM then again at 10:48:49. Is the
>> supervisor crashing? Is there a core file?
>>
>> 5) At 11:11 AM the manager restarted. Again, is there a core file here or
>> was this a manual action?
>>
>> During the course of all of this. All nodes connected were operating
>> propely
>> and files were being located.
>>
>> So, the two big questions are:
>>
>> a) Why was the supervisor not started until 30 minutes after the system
>> was
>> started?
>>
>> b) Is there an explanation of the restarts? If this was a crash then we
>> need
>> a core file to figure out what happened.
>
> It's not a crash. There are some reasons that I restarted some daemons.
> (1)I thought if a dataserver tried many times to connect to a
> redirector but failed, the dataserver would not try to connect a
> redirector again. The supervisor was missing for long time.  So maybe
> some dataservers would not try to connect to atlas-bkp1 again.  To
> reactive these dataservers, I restarted any servers.
> (2)When I tried to xrdcp, it was hanging for long time. I thought
> maybe manager was affected by some others things. then I restarte
> manager to see whether a restart can make this xrdcp work.
>
>
> Thanks
> Wen
>
>> Andy
>>
>> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
>> To: "Andrew Hanushevsky" <[log in to unmask]>
>> Cc: <[log in to unmask]>
>> Sent: Wednesday, December 16, 2009 9:38 AM
>> Subject: Re: xrootd with more than 65 machines
>>
>>
>> Hi Andrew,
>>
>> It still doesn't work.
>> The log file is in higgs03.cs.wisc.edu/wguan/. The name is *.20091216
>> The manager complains there are too many subscribers and the removes
>> nodes.
>>
>> (*)
>> Add server.10040:[log in to unmask] redirected; too many subscribers.
>>
>> Wen
>>
>> On Wed, Dec 16, 2009 at 4:25 AM, Andrew Hanushevsky <[log in to unmask]>
>> wrote:
>>>
>>> Hi Wen,
>>>
>>> It will be easier for me to retroft as the changes were pretty minor.
>>> Please
>>> lift the new XrdCmsNode.cc file from
>>>
>>> http://www.slac.stanford.edu/~abh/cmsd
>>>
>>> Andy
>>>
>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>> Cc: <[log in to unmask]>
>>> Sent: Tuesday, December 15, 2009 5:12 PM
>>> Subject: Re: xrootd with more than 65 machines
>>>
>>>
>>> Hi Andy,
>>>
>>> I can switch to 20091104-1102. Then you don't need to patch
>>> another version. How can I download v20091104-1102?
>>>
>>> Thanks
>>> Wen
>>>
>>> On Wed, Dec 16, 2009 at 12:52 AM, Andrew Hanushevsky <[log in to unmask]>
>>> wrote:
>>>>
>>>> Hi Wen,
>>>>
>>>> Ah yes, I see that now. The file I gave you is based on v20091104-1102.
>>>> Let
>>>> me see if I can retrofit the patch for you.
>>>>
>>>> Andy
>>>>
>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>> Cc: <[log in to unmask]>
>>>> Sent: Tuesday, December 15, 2009 1:04 PM
>>>> Subject: Re: xrootd with more than 65 machines
>>>>
>>>>
>>>> Hi Andy,
>>>>
>>>> Which xrootd version are you using? XrdCmsConfig.hh is different.
>>>> XrdCmsConfig.hh is downloaded from
>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/.
>>>>
>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsNode.cc
>>>> 6fb3ae40fe4e10bdd4d372818a341f2c src/XrdCms/XrdCmsNode.cc
>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsConfig.hh
>>>> 7d57753847d9448186c718f98e963cbe src/XrdCms/XrdCmsConfig.hh
>>>>
>>>> Thanks
>>>> Wen
>>>>
>>>> On Tue, Dec 15, 2009 at 10:50 PM, Andrew Hanushevsky <[log in to unmask]>
>>>> wrote:
>>>>>
>>>>> Hi Wen,
>>>>>
>>>>> Just compiled on Linux and it was clean. Something is really wrong with
>>>>> your
>>>>> source files, specifically XrdCmsConfig.cc
>>>>>
>>>>> The MD5 checksums on the relevant files are:
>>>>>
>>>>> MD5 (XrdCmsNode.cc) = 6fb3ae40fe4e10bdd4d372818a341f2c
>>>>>
>>>>> MD5 (XrdCmsConfig.hh) = 4a7d655582a7cd43b098947d0676924b
>>>>>
>>>>> Andy
>>>>>
>>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>> Cc: <[log in to unmask]>
>>>>> Sent: Tuesday, December 15, 2009 4:24 AM
>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>
>>>>>
>>>>> Hi Andy,
>>>>>
>>>>> No problem. Thanks for the fix. But it cannot be compiled. The
>>>>> version I am using is
>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/.
>>>>>
>>>>> Making cms component...
>>>>> Compiling XrdCmsNode.cc
>>>>> XrdCmsNode.cc: In member function `const char*
>>>>> XrdCmsNode::do_Chmod(XrdCmsRRData&)':
>>>>> XrdCmsNode.cc:268: error: `fsExec' was not declared in this scope
>>>>> XrdCmsNode.cc:268: warning: unused variable 'fsExec'
>>>>> XrdCmsNode.cc:269: error: 'class XrdCmsConfig' has no member named
>>>>> 'ossFS'
>>>>> XrdCmsNode.cc:273: error: `fsFail' was not declared in this scope
>>>>> XrdCmsNode.cc:273: warning: unused variable 'fsFail'
>>>>> XrdCmsNode.cc: In member function `const char*
>>>>> XrdCmsNode::do_Mkdir(XrdCmsRRData&)':
>>>>> XrdCmsNode.cc:600: error: `fsExec' was not declared in this scope
>>>>> XrdCmsNode.cc:600: warning: unused variable 'fsExec'
>>>>> XrdCmsNode.cc:601: error: 'class XrdCmsConfig' has no member named
>>>>> 'ossFS'
>>>>> XrdCmsNode.cc:605: error: `fsFail' was not declared in this scope
>>>>> XrdCmsNode.cc:605: warning: unused variable 'fsFail'
>>>>> XrdCmsNode.cc: In member function `const char*
>>>>> XrdCmsNode::do_Mkpath(XrdCmsRRData&)':
>>>>> XrdCmsNode.cc:640: error: `fsExec' was not declared in this scope
>>>>> XrdCmsNode.cc:640: warning: unused variable 'fsExec'
>>>>> XrdCmsNode.cc:641: error: 'class XrdCmsConfig' has no member named
>>>>> 'ossFS'
>>>>> XrdCmsNode.cc:645: error: `fsFail' was not declared in this scope
>>>>> XrdCmsNode.cc:645: warning: unused variable 'fsFail'
>>>>> XrdCmsNode.cc: In member function `const char*
>>>>> XrdCmsNode::do_Mv(XrdCmsRRData&)':
>>>>> XrdCmsNode.cc:704: error: `fsExec' was not declared in this scope
>>>>> XrdCmsNode.cc:704: warning: unused variable 'fsExec'
>>>>> XrdCmsNode.cc:705: error: 'class XrdCmsConfig' has no member named
>>>>> 'ossFS'
>>>>> XrdCmsNode.cc:709: error: `fsFail' was not declared in this scope
>>>>> XrdCmsNode.cc:709: warning: unused variable 'fsFail'
>>>>> XrdCmsNode.cc: In member function `const char*
>>>>> XrdCmsNode::do_Rm(XrdCmsRRData&)':
>>>>> XrdCmsNode.cc:831: error: `fsExec' was not declared in this scope
>>>>> XrdCmsNode.cc:831: warning: unused variable 'fsExec'
>>>>> XrdCmsNode.cc:832: error: 'class XrdCmsConfig' has no member named
>>>>> 'ossFS'
>>>>> XrdCmsNode.cc:836: error: `fsFail' was not declared in this scope
>>>>> XrdCmsNode.cc:836: warning: unused variable 'fsFail'
>>>>> XrdCmsNode.cc: In member function `const char*
>>>>> XrdCmsNode::do_Rmdir(XrdCmsRRData&)':
>>>>> XrdCmsNode.cc:873: error: `fsExec' was not declared in this scope
>>>>> XrdCmsNode.cc:873: warning: unused variable 'fsExec'
>>>>> XrdCmsNode.cc:874: error: 'class XrdCmsConfig' has no member named
>>>>> 'ossFS'
>>>>> XrdCmsNode.cc:878: error: `fsFail' was not declared in this scope
>>>>> XrdCmsNode.cc:878: warning: unused variable 'fsFail'
>>>>> XrdCmsNode.cc: In member function `const char*
>>>>> XrdCmsNode::do_Trunc(XrdCmsRRData&)':
>>>>> XrdCmsNode.cc:1377: error: `fsExec' was not declared in this scope
>>>>> XrdCmsNode.cc:1377: warning: unused variable 'fsExec'
>>>>> XrdCmsNode.cc:1378: error: 'class XrdCmsConfig' has no member named
>>>>> 'ossFS'
>>>>> XrdCmsNode.cc:1382: error: `fsFail' was not declared in this scope
>>>>> XrdCmsNode.cc:1382: warning: unused variable 'fsFail'
>>>>> XrdCmsNode.cc: At global scope:
>>>>> XrdCmsNode.cc:1524: error: no `int XrdCmsNode::fsExec(XrdOucProg*,
>>>>> char*, char*)' member function declared in class `XrdCmsNode'
>>>>> XrdCmsNode.cc: In member function `int XrdCmsNode::fsExec(XrdOucProg*,
>>>>> char*, char*)':
>>>>> XrdCmsNode.cc:1533: error: `fsL2PFail1' was not declared in this scope
>>>>> XrdCmsNode.cc:1533: warning: unused variable 'fsL2PFail1'
>>>>> XrdCmsNode.cc:1537: error: `fsL2PFail2' was not declared in this scope
>>>>> XrdCmsNode.cc:1537: warning: unused variable 'fsL2PFail2'
>>>>> XrdCmsNode.cc: At global scope:
>>>>> XrdCmsNode.cc:1553: error: no `const char* XrdCmsNode::fsFail(const
>>>>> char*, const char*, const char*, int)' member function declared in
>>>>> class `XrdCmsNode'
>>>>> XrdCmsNode.cc: In member function `const char*
>>>>> XrdCmsNode::fsFail(const char*, const char*, const char*, int)':
>>>>> XrdCmsNode.cc:1559: error: `fsL2PFail1' was not declared in this scope
>>>>> XrdCmsNode.cc:1559: warning: unused variable 'fsL2PFail1'
>>>>> XrdCmsNode.cc:1560: error: `fsL2PFail2' was not declared in this scope
>>>>> XrdCmsNode.cc:1560: warning: unused variable 'fsL2PFail2'
>>>>> XrdCmsNode.cc: In static member function `static int
>>>>> XrdCmsNode::isOnline(char*, int)':
>>>>> XrdCmsNode.cc:1608: error: 'class XrdCmsConfig' has no member named
>>>>> 'ossFS'
>>>>> make[4]: *** [../../obj/XrdCmsNode.o] Error 1
>>>>> make[3]: *** [Linuxall] Error 2
>>>>> make[2]: *** [all] Error 2
>>>>> make[1]: *** [XrdCms] Error 2
>>>>> make: *** [all] Error 2
>>>>>
>>>>>
>>>>> Wen
>>>>>
>>>>> On Tue, Dec 15, 2009 at 2:08 AM, Andrew Hanushevsky <[log in to unmask]>
>>>>> wrote:
>>>>>>
>>>>>> Hi Wen,
>>>>>>
>>>>>> I have developed a permanent fix. You will find the source files in
>>>>>>
>>>>>> http://www.slac.stanford.edu/~abh/cmsd/
>>>>>>
>>>>>> There are three files: XrdCmsCluster.cc XrdCmsNode.cc
>>>>>> XrdCmsProtocol.cc
>>>>>>
>>>>>> Please do a source replacement and recompile. Unfortunately, the cmsd
>>>>>> will
>>>>>> need to be replaced on each node regardless of role. My apologies for
>>>>>> the
>>>>>> disruption. Please let me know how it goes.
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>> Cc: <[log in to unmask]>
>>>>>> Sent: Sunday, December 13, 2009 7:04 AM
>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>
>>>>>>
>>>>>> Hi Andrew,
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>> I used the new cmsd at atlas-bkp1 manager. But it's still dropping
>>>>>> nodes. And in supervisor's log, I cannot find any dataserver to
>>>>>> register to it.
>>>>>>
>>>>>> The new logs are in http://higgs03.cs.wisc.edu/wguan/*.20091213.
>>>>>> The manager is patched at 091213 08:38:15.
>>>>>>
>>>>>> Wen
>>>>>>
>>>>>> On Sun, Dec 13, 2009 at 1:52 AM, Andrew Hanushevsky
>>>>>> <[log in to unmask]> wrote:
>>>>>>>
>>>>>>> Hi Wen
>>>>>>>
>>>>>>> You will find the source replacement at:
>>>>>>>
>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/
>>>>>>>
>>>>>>> It's XrdCmsCluster.cc and it replaces
>>>>>>> xrootd/src/XrdCms/XrdCmsCluster.cc
>>>>>>>
>>>>>>> I'm stepping out for a couple of hours but will be back to see how
>>>>>>> things
>>>>>>> went. Sorry for the issues :-(
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>> On Sun, 13 Dec 2009, wen guan wrote:
>>>>>>>
>>>>>>>> Hi Andrew,
>>>>>>>>
>>>>>>>> I prefer a source replacement. Then I can compile it.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Wen
>>>>>>>>>
>>>>>>>>> I can do one of two things here:
>>>>>>>>>
>>>>>>>>> 1) Supply a source replacement and then you would recompile, or
>>>>>>>>>
>>>>>>>>> 2) Give me the uname -a of where the cmsd will run and I'll supply
>>>>>>>>> a
>>>>>>>>> binary
>>>>>>>>> replacement for you.
>>>>>>>>>
>>>>>>>>> Your choice.
>>>>>>>>>
>>>>>>>>> Andy
>>>>>>>>>
>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote:
>>>>>>>>>
>>>>>>>>>> Hi Andrew
>>>>>>>>>>
>>>>>>>>>> The problem is found. Great. Thanks.
>>>>>>>>>>
>>>>>>>>>> Where can I find the patched cmsd?
>>>>>>>>>>
>>>>>>>>>> Wen
>>>>>>>>>>
>>>>>>>>>> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky
>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>
>>>>>>>>>>> I found the problem. Looks like a regression from way back when.
>>>>>>>>>>> There
>>>>>>>>>>> is
>>>>>>>>>>> a
>>>>>>>>>>> missing flag on the redirect. This will require a patched cmsd
>>>>>>>>>>> but
>>>>>>>>>>> you
>>>>>>>>>>> need
>>>>>>>>>>> only to replace the redirector's cmsd as this only affects the
>>>>>>>>>>> redirector.
>>>>>>>>>>> How would you like to proceed?
>>>>>>>>>>>
>>>>>>>>>>> Andy
>>>>>>>>>>>
>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>
>>>>>>>>>>>> It doesn't work. atlas-bkp1 manager still dropping nodes again.
>>>>>>>>>>>> In supervisor, I still haven't seen any dataserver registered. I
>>>>>>>>>>>> said
>>>>>>>>>>>> "I updated the ntp" because you said "the log timestamp do not
>>>>>>>>>>>> overlap".
>>>>>>>>>>>>
>>>>>>>>>>>> Wen
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky
>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Do you mean that everything is now working? It could be that
>>>>>>>>>>>>> you
>>>>>>>>>>>>> removed
>>>>>>>>>>>>> the
>>>>>>>>>>>>> xrd.timeout directive. That really could cause problems. As for
>>>>>>>>>>>>> the
>>>>>>>>>>>>> delays,
>>>>>>>>>>>>> that is normal when the redirector thinks something is going
>>>>>>>>>>>>> wrong.
>>>>>>>>>>>>> The
>>>>>>>>>>>>> strategy is to delay clients until it can get back to a stable
>>>>>>>>>>>>> configuration. This usually prevents jobs from crashing during
>>>>>>>>>>>>> stressful
>>>>>>>>>>>>> periods.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I restarted it to do supervisor test. Also because xrootd
>>>>>>>>>>>>>> manager
>>>>>>>>>>>>>> frequently doesn't response. (*) is the cms.log, the file
>>>>>>>>>>>>>> select
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>> delayed again and again. When do a restart, all things are
>>>>>>>>>>>>>> fine.
>>>>>>>>>>>>>> Now
>>>>>>>>>>>>>> I
>>>>>>>>>>>>>> am trying to find a clue about it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (*)
>>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask]
>>>>>>>>>>>>>> do_Select:
>>>>>>>>>>>>>> wc
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>> 091212 00:00:19 21318 Select seeking
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>> 091212 00:00:19 21318 UnkFile rc=1
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask]
>>>>>>>>>>>>>> do_Select:
>>>>>>>>>>>>>> delay 5
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 2+-1 post=0
>>>>>>>>>>>>>> 091212 00:00:19 21318 Dispatch redirector.21313:14@atlas-bkp2
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>> select dlen=166
>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1 post=0
>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdSched: running redirector inq=0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There is no core file. I copied a new copies of the logs to
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> link
>>>>>>>>>>>>>> below.
>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky
>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I see in the server log that it is restarting often. Could
>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>> take
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>> look
>>>>>>>>>>>>>>> in the c193 to see if you have any core files? Also please
>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> core files are enabled as Linux defaults the size to 0. The
>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>> step
>>>>>>>>>>>>>>> here
>>>>>>>>>>>>>>> is to find out why your servers are restarting.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> the logs can be found here. From the log you can see
>>>>>>>>>>>>>>>> atlas-bkp1
>>>>>>>>>>>>>>>> manager are dropping nodes again and again which tries to
>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky
>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Wen, Could you start everything up and provide me a
>>>>>>>>>>>>>>>>> pointer
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> manager log file, supervisor log file, and one data server
>>>>>>>>>>>>>>>>> logfile
>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> which cover the same time-frame (from start to some point
>>>>>>>>>>>>>>>>> where
>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>> things are working or not). That way I can see what is
>>>>>>>>>>>>>>>>> happening.
>>>>>>>>>>>>>>>>> At
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> moment I only see two "bad" things in the config file:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a manager
>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>> claim,
>>>>>>>>>>>>>>>>> via
>>>>>>>>>>>>>>>>> the all.manager directive, that there are three (bkp2 and
>>>>>>>>>>>>>>>>> bkp3).
>>>>>>>>>>>>>>>>> While
>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>> should work, the log file will be dense with error
>>>>>>>>>>>>>>>>> messages.
>>>>>>>>>>>>>>>>> Please
>>>>>>>>>>>>>>>>> correct
>>>>>>>>>>>>>>>>> this to be consistent and make it easier to see real
>>>>>>>>>>>>>>>>> errors.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This is not a problem for me. Because this config is used in
>>>>>>>>>>>>>>>> dataserver. In manager, I updated the if
>>>>>>>>>>>>>>>> atlas-bkp1.cs.wisc.edu
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> atlas-bkp2 or something. This is a history problem. at first
>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>> atlas-bkp1 is used. atlas-bkp2 and atlas-bkp3 are added
>>>>>>>>>>>>>>>> later.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2) Please use cms.space not olb.space (for historical
>>>>>>>>>>>>>>>>> reasons
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> latter
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> still accepted and over-rides the former, but that will
>>>>>>>>>>>>>>>>> soon
>>>>>>>>>>>>>>>>> end),
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> please use only one (the config file uses both directives).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> yes. I should remove this line. in fact cms.space is in the
>>>>>>>>>>>>>>>> cfg
>>>>>>>>>>>>>>>> too.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The xrootd has an internal mechanism to connect servers
>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> supervisors
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> allow for maximum reliability. You cannot change that
>>>>>>>>>>>>>>>>> algorithm
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> no need to do so. You should *never* tell anyone to
>>>>>>>>>>>>>>>>> directly
>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> supervisor. If you do, you will likely get unreachable
>>>>>>>>>>>>>>>>> nodes.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> As for dropping data servers, it would appear to me, given
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> flurry
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> such activity, that something either crashed or was
>>>>>>>>>>>>>>>>> restarted.
>>>>>>>>>>>>>>>>> That's
>>>>>>>>>>>>>>>>> why
>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>> would be good to see the complete log of each one of the
>>>>>>>>>>>>>>>>> entities.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I read the document. and write a config
>>>>>>>>>>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg).
>>>>>>>>>>>>>>>>>> I used my conf, I can see manager is dispatch message to
>>>>>>>>>>>>>>>>>> supervisor. But I cannot see any dataserver tries to
>>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> supervisor. At the same time, in the manager's log, I can
>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>> dataserver are Dropped.
>>>>>>>>>>>>>>>>>> How does xrootd decide which dataserver will connect
>>>>>>>>>>>>>>>>>> supervisor?
>>>>>>>>>>>>>>>>>> should I specify some dataservers to connect the
>>>>>>>>>>>>>>>>>> supervisor?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> (*) supervisor log
>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2 for
>>>>>>>>>>>>>>>>>> state
>>>>>>>>>>>>>>>>>> dlen=42
>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State:
>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_StateFWD:
>>>>>>>>>>>>>>>>>> Path
>>>>>>>>>>>>>>>>>> find
>>>>>>>>>>>>>>>>>> failed for state /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> (*)manager log
>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu
>>>>>>>>>>>>>>>>>> TSpace=5587GB
>>>>>>>>>>>>>>>>>> NumFS=1
>>>>>>>>>>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0
>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding
>>>>>>>>>>>>>>>>>> path:
>>>>>>>>>>>>>>>>>> w
>>>>>>>>>>>>>>>>>> /atlas
>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661
>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>> do_Space: 5696231MB free; 0% util
>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 logged in.
>>>>>>>>>>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection from
>>>>>>>>>>>>>>>>>> [log in to unmask]
>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: running
>>>>>>>>>>>>>>>>>> ?:[log in to unmask]
>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched protocol cmsd
>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll: FD
>>>>>>>>>>>>>>>>>> 79
>>>>>>>>>>>>>>>>>> attached
>>>>>>>>>>>>>>>>>> to poller 2; num=22
>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add
>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>>> bumps
>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 #63
>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Drop_Node:
>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 dropped.
>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add Shoved
>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 to cluster;
>>>>>>>>>>>>>>>>>> id=63.78;
>>>>>>>>>>>>>>>>>> num=64;
>>>>>>>>>>>>>>>>>> min=51
>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0
>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu
>>>>>>>>>>>>>>>>>> TSpace=5587GB
>>>>>>>>>>>>>>>>>> NumFS=1
>>>>>>>>>>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0
>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding
>>>>>>>>>>>>>>>>>> path:
>>>>>>>>>>>>>>>>>> w
>>>>>>>>>>>>>>>>>> /atlas
>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661
>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>> do_Space: 5721854MB free; 0% util
>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 logged in.
>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve from
>>>>>>>>>>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer
>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop node in 60
>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Remove_Node
>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 node 63.78
>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>> 79 detached from poller 2; num=21
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>> c177.chtc.wisc.edu
>>>>>>>>>>>>>>>>>> FD=16
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 node 0.3
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask]
>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.21656:[log in to unmask]
>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>> 16 detached from poller 2; num=20
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>> c179.chtc.wisc.edu
>>>>>>>>>>>>>>>>>> FD=21
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>> server.17065:[log in to unmask]:1094 node 1.4
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>> server.7978:[log in to unmask]
>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.7978:[log in to unmask]
>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>> 21
>>>>>>>>>>>>>>>>>> detached from poller 1; num=21
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 State: Status changed to suspended
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Send status to
>>>>>>>>>>>>>>>>>> redirector.15656:14@atlas-bkp2
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>> c182.chtc.wisc.edu
>>>>>>>>>>>>>>>>>> FD=19
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 node 7.10
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask]
>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.26620:[log in to unmask]
>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>> 19 detached from poller 2; num=19
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>> c178.chtc.wisc.edu
>>>>>>>>>>>>>>>>>> FD=15
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 node 9.12
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask]
>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.11901:[log in to unmask]
>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>> 15 detached from poller 1; num=20
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>> c181.chtc.wisc.edu
>>>>>>>>>>>>>>>>>> FD=17
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 node 5.8
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask]
>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.13984:[log in to unmask]
>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>> 17 detached from poller 0; num=21
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>> c183.chtc.wisc.edu
>>>>>>>>>>>>>>>>>> FD=22
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 node 8.11
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask]
>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.27735:[log in to unmask]
>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>> 22 detached from poller 2; num=18
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>> c184.chtc.wisc.edu
>>>>>>>>>>>>>>>>>> FD=20
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>> server.4131:[log in to unmask]:1094 node 3.6
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask]
>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.26787:[log in to unmask]
>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>> 20 detached from poller 0; num=20
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>> c185.chtc.wisc.edu
>>>>>>>>>>>>>>>>>> FD=23
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 node 6.9
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>> server.8524:[log in to unmask]
>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.8524:[log in to unmask]
>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>> 23
>>>>>>>>>>>>>>>>>> detached from poller 0; num=19
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>> c180.chtc.wisc.edu
>>>>>>>>>>>>>>>>>> FD=18
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 node 4.7
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask]
>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.14636:[log in to unmask]
>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>> 18 detached from poller 1; num=19
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service
>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from
>>>>>>>>>>>>>>>>>> c186.chtc.wisc.edu
>>>>>>>>>>>>>>>>>> FD=24
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 node 2.5
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>> server.7849:[log in to unmask]
>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 server.7849:[log in to unmask]
>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>> 24
>>>>>>>>>>>>>>>>>> detached from poller 1; num=18
>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13
>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled.
>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled.
>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled.
>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled.
>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled.
>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled.
>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled.
>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled.
>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled.
>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled.
>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled.
>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers
>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled.
>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky
>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> To go past 64 data servers you will need to setup one or
>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>> supervisors.
>>>>>>>>>>>>>>>>>>> This does not logically change the current configuration
>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>> have.
>>>>>>>>>>>>>>>>>>> You
>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>> need to configure one or more *new* servers (or at least
>>>>>>>>>>>>>>>>>>> xrootd
>>>>>>>>>>>>>>>>>>> processes)
>>>>>>>>>>>>>>>>>>> whose role is supervisor. We'd like them to run in
>>>>>>>>>>>>>>>>>>> separate
>>>>>>>>>>>>>>>>>>> machines
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> reliability purposes, but they could run on the manager
>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>> long
>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>> give each one a unique instance name (i.e., -n option).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The front part of the cmsd reference explains how to do
>>>>>>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Is there any change to configure xrootd with more than
>>>>>>>>>>>>>>>>>>>> 65
>>>>>>>>>>>>>>>>>>>> machines? I used the configure below but it doesn't
>>>>>>>>>>>>>>>>>>>> work.
>>>>>>>>>>>>>>>>>>>> Should
>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>> configure some machines' manager to be supvervisor?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>