Print

Print


Hi Wen,

Another thing is that the log timestamp do not overlap:

bkp1    cms-manager    091211 15:05:33 to 15:31:37
bkp1    xrd-manager    091211 15:05:33 to 15:27:40

higgs03 cms-supervisor 091211 17:25:47 to 17:44:17
higgs03 xrd-supervisor 091211 17:25:47 to 17:43:57

c193    cms-server     091211 04:13:14 to 17:41:23
c193    xrd-server     091211 04:13:14 to 17:40:53

As you can see, there is no overlap between the supervisor and the manager 
logs making it impossible to see what the supervisor was doing relative to 
the manager. Could you reclip the supervisor log into the same time-frame?

In any case. Why did you specify the xrd.timeout directive? In general, we 
prefer to run with the defaults and the particular values you have chosen 
will cause problems in the long run. I'd strongly suggest you remove it.

Andy

  On Sat, 12 Dec 2009, wen guan wrote:

> Hi Andrew,
>
>  the logs can be found here. From the log you can see atlas-bkp1
> manager are dropping nodes again and again which tries to connect to
> it.
>  http://higgs03.cs.wisc.edu/wguan/
>
>
> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky
> <[log in to unmask]> wrote:
>> Hi Wen, Could you start everything up and provide me a pointer to the
>> manager log file, supervisor log file, and one data server logfile all of
>> which cover the same time-frame (from start to some point where you think
>> things are working or not). That way I can see what is happening. At the
>> moment I only see two "bad" things in the config file:
>>
>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a manager but you claim, via
>> the all.manager directive, that there are three (bkp2 and bkp3). While it
>> should work, the log file will be dense with error messages. Please correct
>> this to be consistent and make it easier to see real errors.
>
> This is not a problem for me. Because this config is used in
> dataserver. In manager, I updated the if atlas-bkp1.cs.wisc.edu to
> atlas-bkp2 or something. This is a history problem. at first only
> atlas-bkp1 is used.  atlas-bkp2 and atlas-bkp3 are added  later.
>
>> 2) Please use cms.space not olb.space (for historical reasons the latter is
>> still accepted and over-rides the former, but that will soon end), and
>> please use only one (the config file uses both directives).
> yes. I should remove this line. in fact cms.space is in the cfg too.
>
>
> Thanks
> Wen
>
>> The xrootd has an internal mechanism to connect servers with supervisors to
>> allow for maximum reliability. You cannot change that algorithm and there is
>> no need to do so. You should *never* tell anyone to directly connect to a
>> supervisor. If you do, you will likely get unreachable nodes.
>>
>> As for dropping data servers, it would appear to me, given the flurry of
>> such activity, that something either crashed or was restarted. That's why it
>> would be good to see the complete log of each one of the entities.
>>
>> Andy
>>
>> On Fri, 11 Dec 2009, wen guan wrote:
>>
>>> Hi Andrew,
>>>
>>>    I read the document. and write a config
>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg).
>>>    I used my conf, I can see manager is dispatch message to
>>> supervisor. But I cannot see any dataserver tries to connect to the
>>> supervisor. At the same time, in the manager's log, I can see some
>>> dataserver are Dropped.
>>>   How does xrootd decide which dataserver will connect supervisor?
>>> should I specify some dataservers to connect the supervisor?
>>>
>>>
>>> (*) supervisor log
>>> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2 for state dlen=42
>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State:
>>> /atlas/xrootd/users/wguan/test/test131141
>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_StateFWD: Path find
>>> failed for state /atlas/xrootd/users/wguan/test/test131141
>>>
>>> (*)manager log
>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu TSpace=5587GB NumFS=1
>>> FSpace=5693644MB MinFR=57218MB Util=0
>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding path: w /atlas
>>> 091211 04:13:24 15661 server.10585:[log in to unmask]:1094
>>> do_Space: 5696231MB free; 0% util
>>> 091211 04:13:24 15661 Protocol:
>>> server.10585:[log in to unmask]:1094 logged in.
>>> 091211 04:13:24 001 XrdInet: Accepted connection from
>>> [log in to unmask]
>>> 091211 04:13:24 15661 XrdSched: running ?:[log in to unmask] inq=0
>>> 091211 04:13:24 15661 XrdProtocol: matched protocol cmsd
>>> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll: FD 79 attached
>>> to poller 2; num=22
>>> 091211 04:13:24 15661 Add server.21739:[log in to unmask] bumps
>>> server.15905:[log in to unmask]:1094 #63
>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>> 091211 04:13:24 15661 Drop_Node:
>>> server.15905:[log in to unmask]:1094 dropped.
>>> 091211 04:13:24 15661 Add Shoved
>>> server.21739:[log in to unmask]:1094 to cluster; id=63.78; num=64;
>>> min=51
>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0
>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu TSpace=5587GB NumFS=1
>>> FSpace=5721854MB MinFR=57218MB Util=0
>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding path: w /atlas
>>> 091211 04:13:24 15661 server.21739:[log in to unmask]:1094
>>> do_Space: 5721854MB free; 0% util
>>> 091211 04:13:24 15661 Protocol:
>>> server.21739:[log in to unmask]:1094 logged in.
>>> 091211 04:13:24 15661 XrdLink: Unable to recieve from
>>> c187.chtc.wisc.edu; connection reset by peer
>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>> 091211 04:13:24 15661 XrdSched: scheduling drop node in 60 seconds
>>> 091211 04:13:24 15661 Remove_Node
>>> server.21739:[log in to unmask]:1094 node 63.78
>>> 091211 04:13:24 15661 Protocol: server.21739:[log in to unmask] logged
>>> out.
>>> 091211 04:13:24 15661 server.21739:[log in to unmask] XrdPoll: FD
>>> 79 detached from poller 2; num=21
>>> 091211 04:13:27 15661 Dispatch server.24718:[log in to unmask]:1094
>>> for status dlen=0
>>> 091211 04:13:27 15661 server.24718:[log in to unmask]:1094 do_Status:
>>> suspend
>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service suspended
>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c177.chtc.wisc.edu
>>> FD=16
>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>> 091211 04:13:27 15661 Remove_Node
>>> server.24718:[log in to unmask]:1094 node 0.3
>>> 091211 04:13:27 15661 Protocol: server.21656:[log in to unmask] logged
>>> out.
>>> 091211 04:13:27 15661 server.21656:[log in to unmask] XrdPoll: FD
>>> 16 detached from poller 2; num=20
>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c179.chtc.wisc.edu
>>> FD=21
>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>> 091211 04:13:27 15661 Remove_Node
>>> server.17065:[log in to unmask]:1094 node 1.4
>>> 091211 04:13:27 15661 Protocol: server.7978:[log in to unmask] logged
>>> out.
>>> 091211 04:13:27 15661 server.7978:[log in to unmask] XrdPoll: FD 21
>>> detached from poller 1; num=21
>>> 091211 04:13:27 15661 State: Status changed to suspended
>>> 091211 04:13:27 15661 Send status to redirector.15656:14@atlas-bkp2
>>> 091211 04:13:27 15661 Dispatch server.12937:[log in to unmask]:1094
>>> for status dlen=0
>>> 091211 04:13:27 15661 server.12937:[log in to unmask]:1094 do_Status:
>>> suspend
>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service suspended
>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c182.chtc.wisc.edu
>>> FD=19
>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>> 091211 04:13:27 15661 Remove_Node
>>> server.12937:[log in to unmask]:1094 node 7.10
>>> 091211 04:13:27 15661 Protocol: server.26620:[log in to unmask] logged
>>> out.
>>> 091211 04:13:27 15661 server.26620:[log in to unmask] XrdPoll: FD
>>> 19 detached from poller 2; num=19
>>> 091211 04:13:27 15661 Dispatch server.10842:[log in to unmask]:1094
>>> for status dlen=0
>>> 091211 04:13:27 15661 server.10842:[log in to unmask]:1094 do_Status:
>>> suspend
>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service suspended
>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c178.chtc.wisc.edu
>>> FD=15
>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>> 091211 04:13:27 15661 Remove_Node
>>> server.10842:[log in to unmask]:1094 node 9.12
>>> 091211 04:13:27 15661 Protocol: server.11901:[log in to unmask] logged
>>> out.
>>> 091211 04:13:27 15661 server.11901:[log in to unmask] XrdPoll: FD
>>> 15 detached from poller 1; num=20
>>> 091211 04:13:27 15661 Dispatch server.5535:[log in to unmask]:1094
>>> for status dlen=0
>>> 091211 04:13:27 15661 server.5535:[log in to unmask]:1094 do_Status:
>>> suspend
>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service suspended
>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c181.chtc.wisc.edu
>>> FD=17
>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>> 091211 04:13:27 15661 Remove_Node
>>> server.5535:[log in to unmask]:1094 node 5.8
>>> 091211 04:13:27 15661 Protocol: server.13984:[log in to unmask] logged
>>> out.
>>> 091211 04:13:27 15661 server.13984:[log in to unmask] XrdPoll: FD
>>> 17 detached from poller 0; num=21
>>> 091211 04:13:27 15661 Dispatch server.23711:[log in to unmask]:1094
>>> for status dlen=0
>>> 091211 04:13:27 15661 server.23711:[log in to unmask]:1094 do_Status:
>>> suspend
>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service suspended
>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c183.chtc.wisc.edu
>>> FD=22
>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>> 091211 04:13:27 15661 Remove_Node
>>> server.23711:[log in to unmask]:1094 node 8.11
>>> 091211 04:13:27 15661 Protocol: server.27735:[log in to unmask] logged
>>> out.
>>> 091211 04:13:27 15661 server.27735:[log in to unmask] XrdPoll: FD
>>> 22 detached from poller 2; num=18
>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c184.chtc.wisc.edu
>>> FD=20
>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>> 091211 04:13:27 15661 Remove_Node
>>> server.4131:[log in to unmask]:1094 node 3.6
>>> 091211 04:13:27 15661 Protocol: server.26787:[log in to unmask] logged
>>> out.
>>> 091211 04:13:27 15661 server.26787:[log in to unmask] XrdPoll: FD
>>> 20 detached from poller 0; num=20
>>> 091211 04:13:27 15661 Dispatch server.10585:[log in to unmask]:1094
>>> for status dlen=0
>>> 091211 04:13:27 15661 server.10585:[log in to unmask]:1094 do_Status:
>>> suspend
>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service suspended
>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c185.chtc.wisc.edu
>>> FD=23
>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>> 091211 04:13:27 15661 Remove_Node
>>> server.10585:[log in to unmask]:1094 node 6.9
>>> 091211 04:13:27 15661 Protocol: server.8524:[log in to unmask] logged
>>> out.
>>> 091211 04:13:27 15661 server.8524:[log in to unmask] XrdPoll: FD 23
>>> detached from poller 0; num=19
>>> 091211 04:13:27 15661 Dispatch server.20264:[log in to unmask]:1094
>>> for status dlen=0
>>> 091211 04:13:27 15661 server.20264:[log in to unmask]:1094 do_Status:
>>> suspend
>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service suspended
>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c180.chtc.wisc.edu
>>> FD=18
>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>> 091211 04:13:27 15661 Remove_Node
>>> server.20264:[log in to unmask]:1094 node 4.7
>>> 091211 04:13:27 15661 Protocol: server.14636:[log in to unmask] logged
>>> out.
>>> 091211 04:13:27 15661 server.14636:[log in to unmask] XrdPoll: FD
>>> 18 detached from poller 1; num=19
>>> 091211 04:13:27 15661 Dispatch server.1656:[log in to unmask]:1094
>>> for status dlen=0
>>> 091211 04:13:27 15661 server.1656:[log in to unmask]:1094 do_Status:
>>> suspend
>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service suspended
>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c186.chtc.wisc.edu
>>> FD=24
>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>> 091211 04:13:27 15661 Remove_Node
>>> server.1656:[log in to unmask]:1094 node 2.5
>>> 091211 04:13:27 15661 Protocol: server.7849:[log in to unmask] logged
>>> out.
>>> 091211 04:13:27 15661 server.7849:[log in to unmask] XrdPoll: FD 24
>>> detached from poller 1; num=18
>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled.
>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled.
>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled.
>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled.
>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled.
>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled.
>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled.
>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled.
>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled.
>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled.
>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled.
>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers
>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled.
>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>>
>>> Wen
>>>
>>>
>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky
>>> <[log in to unmask]> wrote:
>>>>
>>>> Hi Wen,
>>>>
>>>> To go past 64 data servers you will need to setup one or more
>>>> supervisors.
>>>> This does not logically change the current configuration you have. You
>>>> only
>>>> need to configure one or more *new* servers (or at least xrootd
>>>> processes)
>>>> whose role is supervisor. We'd like them to run in separate machines for
>>>> reliability purposes, but they could run on the manager node as long as
>>>> you
>>>> give each one a unique instance name (i.e., -n option).
>>>>
>>>> The front part of the cmsd reference explains how to do this.
>>>>
>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm
>>>>
>>>> Andy
>>>>
>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>
>>>>> Hi Andrew,
>>>>>
>>>>>   Is there any change to configure xrootd with more than 65
>>>>> machines? I used the configure below but it doesn't work.  Should I
>>>>> configure some machines' manager to be supvervisor?
>>>>>
>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg
>>>>>
>>>>>
>>>>> Wen
>>>>>
>>>>
>>
>