Print

Print


because bkp1 was connecting to a different ntp server. I updated it.

Wen

On Sat, Dec 12, 2009 at 3:34 AM, Andrew Hanushevsky
<[log in to unmask]> wrote:
> Hi Wen,
>
> Another thing is that the log timestamp do not overlap:
>
> bkp1    cms-manager    091211 15:05:33 to 15:31:37
> bkp1    xrd-manager    091211 15:05:33 to 15:27:40
>
> higgs03 cms-supervisor 091211 17:25:47 to 17:44:17
> higgs03 xrd-supervisor 091211 17:25:47 to 17:43:57
>
> c193    cms-server     091211 04:13:14 to 17:41:23
> c193    xrd-server     091211 04:13:14 to 17:40:53
>
> As you can see, there is no overlap between the supervisor and the manager
> logs making it impossible to see what the supervisor was doing relative to
> the manager. Could you reclip the supervisor log into the same time-frame?
>
> In any case. Why did you specify the xrd.timeout directive? In general, we
> prefer to run with the defaults and the particular values you have chosen
> will cause problems in the long run. I'd strongly suggest you remove it.
>
> Andy
>
>  On Sat, 12 Dec 2009, wen guan wrote:
>
>> Hi Andrew,
>>
>>  the logs can be found here. From the log you can see atlas-bkp1
>> manager are dropping nodes again and again which tries to connect to
>> it.
>>  http://higgs03.cs.wisc.edu/wguan/
>>
>>
>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky
>> <[log in to unmask]> wrote:
>>>
>>> Hi Wen, Could you start everything up and provide me a pointer to the
>>> manager log file, supervisor log file, and one data server logfile all of
>>> which cover the same time-frame (from start to some point where you think
>>> things are working or not). That way I can see what is happening. At the
>>> moment I only see two "bad" things in the config file:
>>>
>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a manager but you claim,
>>> via
>>> the all.manager directive, that there are three (bkp2 and bkp3). While it
>>> should work, the log file will be dense with error messages. Please
>>> correct
>>> this to be consistent and make it easier to see real errors.
>>
>> This is not a problem for me. Because this config is used in
>> dataserver. In manager, I updated the if atlas-bkp1.cs.wisc.edu to
>> atlas-bkp2 or something. This is a history problem. at first only
>> atlas-bkp1 is used.  atlas-bkp2 and atlas-bkp3 are added  later.
>>
>>> 2) Please use cms.space not olb.space (for historical reasons the latter
>>> is
>>> still accepted and over-rides the former, but that will soon end), and
>>> please use only one (the config file uses both directives).
>>
>> yes. I should remove this line. in fact cms.space is in the cfg too.
>>
>>
>> Thanks
>> Wen
>>
>>> The xrootd has an internal mechanism to connect servers with supervisors
>>> to
>>> allow for maximum reliability. You cannot change that algorithm and there
>>> is
>>> no need to do so. You should *never* tell anyone to directly connect to a
>>> supervisor. If you do, you will likely get unreachable nodes.
>>>
>>> As for dropping data servers, it would appear to me, given the flurry of
>>> such activity, that something either crashed or was restarted. That's why
>>> it
>>> would be good to see the complete log of each one of the entities.
>>>
>>> Andy
>>>
>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>
>>>> Hi Andrew,
>>>>
>>>>    I read the document. and write a config
>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg).
>>>>    I used my conf, I can see manager is dispatch message to
>>>> supervisor. But I cannot see any dataserver tries to connect to the
>>>> supervisor. At the same time, in the manager's log, I can see some
>>>> dataserver are Dropped.
>>>>   How does xrootd decide which dataserver will connect supervisor?
>>>> should I specify some dataservers to connect the supervisor?
>>>>
>>>>
>>>> (*) supervisor log
>>>> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2 for state dlen=42
>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State:
>>>> /atlas/xrootd/users/wguan/test/test131141
>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_StateFWD: Path find
>>>> failed for state /atlas/xrootd/users/wguan/test/test131141
>>>>
>>>> (*)manager log
>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu TSpace=5587GB NumFS=1
>>>> FSpace=5693644MB MinFR=57218MB Util=0
>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding path: w /atlas
>>>> 091211 04:13:24 15661 server.10585:[log in to unmask]:1094
>>>> do_Space: 5696231MB free; 0% util
>>>> 091211 04:13:24 15661 Protocol:
>>>> server.10585:[log in to unmask]:1094 logged in.
>>>> 091211 04:13:24 001 XrdInet: Accepted connection from
>>>> [log in to unmask]
>>>> 091211 04:13:24 15661 XrdSched: running ?:[log in to unmask] inq=0
>>>> 091211 04:13:24 15661 XrdProtocol: matched protocol cmsd
>>>> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll: FD 79 attached
>>>> to poller 2; num=22
>>>> 091211 04:13:24 15661 Add server.21739:[log in to unmask] bumps
>>>> server.15905:[log in to unmask]:1094 #63
>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>> 091211 04:13:24 15661 Drop_Node:
>>>> server.15905:[log in to unmask]:1094 dropped.
>>>> 091211 04:13:24 15661 Add Shoved
>>>> server.21739:[log in to unmask]:1094 to cluster; id=63.78; num=64;
>>>> min=51
>>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0
>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu TSpace=5587GB NumFS=1
>>>> FSpace=5721854MB MinFR=57218MB Util=0
>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding path: w /atlas
>>>> 091211 04:13:24 15661 server.21739:[log in to unmask]:1094
>>>> do_Space: 5721854MB free; 0% util
>>>> 091211 04:13:24 15661 Protocol:
>>>> server.21739:[log in to unmask]:1094 logged in.
>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve from
>>>> c187.chtc.wisc.edu; connection reset by peer
>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>> 091211 04:13:24 15661 XrdSched: scheduling drop node in 60 seconds
>>>> 091211 04:13:24 15661 Remove_Node
>>>> server.21739:[log in to unmask]:1094 node 63.78
>>>> 091211 04:13:24 15661 Protocol: server.21739:[log in to unmask]
>>>> logged
>>>> out.
>>>> 091211 04:13:24 15661 server.21739:[log in to unmask] XrdPoll: FD
>>>> 79 detached from poller 2; num=21
>>>> 091211 04:13:27 15661 Dispatch server.24718:[log in to unmask]:1094
>>>> for status dlen=0
>>>> 091211 04:13:27 15661 server.24718:[log in to unmask]:1094 do_Status:
>>>> suspend
>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service suspended
>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c177.chtc.wisc.edu
>>>> FD=16
>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>> 091211 04:13:27 15661 Remove_Node
>>>> server.24718:[log in to unmask]:1094 node 0.3
>>>> 091211 04:13:27 15661 Protocol: server.21656:[log in to unmask]
>>>> logged
>>>> out.
>>>> 091211 04:13:27 15661 server.21656:[log in to unmask] XrdPoll: FD
>>>> 16 detached from poller 2; num=20
>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c179.chtc.wisc.edu
>>>> FD=21
>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>> 091211 04:13:27 15661 Remove_Node
>>>> server.17065:[log in to unmask]:1094 node 1.4
>>>> 091211 04:13:27 15661 Protocol: server.7978:[log in to unmask] logged
>>>> out.
>>>> 091211 04:13:27 15661 server.7978:[log in to unmask] XrdPoll: FD 21
>>>> detached from poller 1; num=21
>>>> 091211 04:13:27 15661 State: Status changed to suspended
>>>> 091211 04:13:27 15661 Send status to redirector.15656:14@atlas-bkp2
>>>> 091211 04:13:27 15661 Dispatch server.12937:[log in to unmask]:1094
>>>> for status dlen=0
>>>> 091211 04:13:27 15661 server.12937:[log in to unmask]:1094 do_Status:
>>>> suspend
>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service suspended
>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c182.chtc.wisc.edu
>>>> FD=19
>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>> 091211 04:13:27 15661 Remove_Node
>>>> server.12937:[log in to unmask]:1094 node 7.10
>>>> 091211 04:13:27 15661 Protocol: server.26620:[log in to unmask]
>>>> logged
>>>> out.
>>>> 091211 04:13:27 15661 server.26620:[log in to unmask] XrdPoll: FD
>>>> 19 detached from poller 2; num=19
>>>> 091211 04:13:27 15661 Dispatch server.10842:[log in to unmask]:1094
>>>> for status dlen=0
>>>> 091211 04:13:27 15661 server.10842:[log in to unmask]:1094 do_Status:
>>>> suspend
>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service suspended
>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c178.chtc.wisc.edu
>>>> FD=15
>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>> 091211 04:13:27 15661 Remove_Node
>>>> server.10842:[log in to unmask]:1094 node 9.12
>>>> 091211 04:13:27 15661 Protocol: server.11901:[log in to unmask]
>>>> logged
>>>> out.
>>>> 091211 04:13:27 15661 server.11901:[log in to unmask] XrdPoll: FD
>>>> 15 detached from poller 1; num=20
>>>> 091211 04:13:27 15661 Dispatch server.5535:[log in to unmask]:1094
>>>> for status dlen=0
>>>> 091211 04:13:27 15661 server.5535:[log in to unmask]:1094 do_Status:
>>>> suspend
>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service suspended
>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c181.chtc.wisc.edu
>>>> FD=17
>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>> 091211 04:13:27 15661 Remove_Node
>>>> server.5535:[log in to unmask]:1094 node 5.8
>>>> 091211 04:13:27 15661 Protocol: server.13984:[log in to unmask]
>>>> logged
>>>> out.
>>>> 091211 04:13:27 15661 server.13984:[log in to unmask] XrdPoll: FD
>>>> 17 detached from poller 0; num=21
>>>> 091211 04:13:27 15661 Dispatch server.23711:[log in to unmask]:1094
>>>> for status dlen=0
>>>> 091211 04:13:27 15661 server.23711:[log in to unmask]:1094 do_Status:
>>>> suspend
>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service suspended
>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c183.chtc.wisc.edu
>>>> FD=22
>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>> 091211 04:13:27 15661 Remove_Node
>>>> server.23711:[log in to unmask]:1094 node 8.11
>>>> 091211 04:13:27 15661 Protocol: server.27735:[log in to unmask]
>>>> logged
>>>> out.
>>>> 091211 04:13:27 15661 server.27735:[log in to unmask] XrdPoll: FD
>>>> 22 detached from poller 2; num=18
>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c184.chtc.wisc.edu
>>>> FD=20
>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>> 091211 04:13:27 15661 Remove_Node
>>>> server.4131:[log in to unmask]:1094 node 3.6
>>>> 091211 04:13:27 15661 Protocol: server.26787:[log in to unmask]
>>>> logged
>>>> out.
>>>> 091211 04:13:27 15661 server.26787:[log in to unmask] XrdPoll: FD
>>>> 20 detached from poller 0; num=20
>>>> 091211 04:13:27 15661 Dispatch server.10585:[log in to unmask]:1094
>>>> for status dlen=0
>>>> 091211 04:13:27 15661 server.10585:[log in to unmask]:1094 do_Status:
>>>> suspend
>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service suspended
>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c185.chtc.wisc.edu
>>>> FD=23
>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>> 091211 04:13:27 15661 Remove_Node
>>>> server.10585:[log in to unmask]:1094 node 6.9
>>>> 091211 04:13:27 15661 Protocol: server.8524:[log in to unmask] logged
>>>> out.
>>>> 091211 04:13:27 15661 server.8524:[log in to unmask] XrdPoll: FD 23
>>>> detached from poller 0; num=19
>>>> 091211 04:13:27 15661 Dispatch server.20264:[log in to unmask]:1094
>>>> for status dlen=0
>>>> 091211 04:13:27 15661 server.20264:[log in to unmask]:1094 do_Status:
>>>> suspend
>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service suspended
>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c180.chtc.wisc.edu
>>>> FD=18
>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>> 091211 04:13:27 15661 Remove_Node
>>>> server.20264:[log in to unmask]:1094 node 4.7
>>>> 091211 04:13:27 15661 Protocol: server.14636:[log in to unmask]
>>>> logged
>>>> out.
>>>> 091211 04:13:27 15661 server.14636:[log in to unmask] XrdPoll: FD
>>>> 18 detached from poller 1; num=19
>>>> 091211 04:13:27 15661 Dispatch server.1656:[log in to unmask]:1094
>>>> for status dlen=0
>>>> 091211 04:13:27 15661 server.1656:[log in to unmask]:1094 do_Status:
>>>> suspend
>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service suspended
>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c186.chtc.wisc.edu
>>>> FD=24
>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>> 091211 04:13:27 15661 Remove_Node
>>>> server.1656:[log in to unmask]:1094 node 2.5
>>>> 091211 04:13:27 15661 Protocol: server.7849:[log in to unmask] logged
>>>> out.
>>>> 091211 04:13:27 15661 server.7849:[log in to unmask] XrdPoll: FD 24
>>>> detached from poller 1; num=18
>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0
>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds
>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled.
>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled.
>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled.
>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled.
>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled.
>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled.
>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled.
>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled.
>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled.
>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled.
>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1
>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled.
>>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers
>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled.
>>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0
>>>>
>>>> Wen
>>>>
>>>>
>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky
>>>> <[log in to unmask]> wrote:
>>>>>
>>>>> Hi Wen,
>>>>>
>>>>> To go past 64 data servers you will need to setup one or more
>>>>> supervisors.
>>>>> This does not logically change the current configuration you have. You
>>>>> only
>>>>> need to configure one or more *new* servers (or at least xrootd
>>>>> processes)
>>>>> whose role is supervisor. We'd like them to run in separate machines
>>>>> for
>>>>> reliability purposes, but they could run on the manager node as long as
>>>>> you
>>>>> give each one a unique instance name (i.e., -n option).
>>>>>
>>>>> The front part of the cmsd reference explains how to do this.
>>>>>
>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm
>>>>>
>>>>> Andy
>>>>>
>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>
>>>>>> Hi Andrew,
>>>>>>
>>>>>>   Is there any change to configure xrootd with more than 65
>>>>>> machines? I used the configure below but it doesn't work.  Should I
>>>>>> configure some machines' manager to be supvervisor?
>>>>>>
>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg
>>>>>>
>>>>>>
>>>>>> Wen
>>>>>>
>>>>>
>>>
>