because bkp1 was connecting to a different ntp server. I updated it. Wen On Sat, Dec 12, 2009 at 3:34 AM, Andrew Hanushevsky <[log in to unmask]> wrote: > Hi Wen, > > Another thing is that the log timestamp do not overlap: > > bkp1 cms-manager 091211 15:05:33 to 15:31:37 > bkp1 xrd-manager 091211 15:05:33 to 15:27:40 > > higgs03 cms-supervisor 091211 17:25:47 to 17:44:17 > higgs03 xrd-supervisor 091211 17:25:47 to 17:43:57 > > c193 cms-server 091211 04:13:14 to 17:41:23 > c193 xrd-server 091211 04:13:14 to 17:40:53 > > As you can see, there is no overlap between the supervisor and the manager > logs making it impossible to see what the supervisor was doing relative to > the manager. Could you reclip the supervisor log into the same time-frame? > > In any case. Why did you specify the xrd.timeout directive? In general, we > prefer to run with the defaults and the particular values you have chosen > will cause problems in the long run. I'd strongly suggest you remove it. > > Andy > > On Sat, 12 Dec 2009, wen guan wrote: > >> Hi Andrew, >> >> the logs can be found here. From the log you can see atlas-bkp1 >> manager are dropping nodes again and again which tries to connect to >> it. >> http://higgs03.cs.wisc.edu/wguan/ >> >> >> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky >> <[log in to unmask]> wrote: >>> >>> Hi Wen, Could you start everything up and provide me a pointer to the >>> manager log file, supervisor log file, and one data server logfile all of >>> which cover the same time-frame (from start to some point where you think >>> things are working or not). That way I can see what is happening. At the >>> moment I only see two "bad" things in the config file: >>> >>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a manager but you claim, >>> via >>> the all.manager directive, that there are three (bkp2 and bkp3). While it >>> should work, the log file will be dense with error messages. Please >>> correct >>> this to be consistent and make it easier to see real errors. >> >> This is not a problem for me. Because this config is used in >> dataserver. In manager, I updated the if atlas-bkp1.cs.wisc.edu to >> atlas-bkp2 or something. This is a history problem. at first only >> atlas-bkp1 is used. atlas-bkp2 and atlas-bkp3 are added later. >> >>> 2) Please use cms.space not olb.space (for historical reasons the latter >>> is >>> still accepted and over-rides the former, but that will soon end), and >>> please use only one (the config file uses both directives). >> >> yes. I should remove this line. in fact cms.space is in the cfg too. >> >> >> Thanks >> Wen >> >>> The xrootd has an internal mechanism to connect servers with supervisors >>> to >>> allow for maximum reliability. You cannot change that algorithm and there >>> is >>> no need to do so. You should *never* tell anyone to directly connect to a >>> supervisor. If you do, you will likely get unreachable nodes. >>> >>> As for dropping data servers, it would appear to me, given the flurry of >>> such activity, that something either crashed or was restarted. That's why >>> it >>> would be good to see the complete log of each one of the entities. >>> >>> Andy >>> >>> On Fri, 11 Dec 2009, wen guan wrote: >>> >>>> Hi Andrew, >>>> >>>> I read the document. and write a config >>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg). >>>> I used my conf, I can see manager is dispatch message to >>>> supervisor. But I cannot see any dataserver tries to connect to the >>>> supervisor. At the same time, in the manager's log, I can see some >>>> dataserver are Dropped. >>>> How does xrootd decide which dataserver will connect supervisor? >>>> should I specify some dataservers to connect the supervisor? >>>> >>>> >>>> (*) supervisor log >>>> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2 for state dlen=42 >>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State: >>>> /atlas/xrootd/users/wguan/test/test131141 >>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_StateFWD: Path find >>>> failed for state /atlas/xrootd/users/wguan/test/test131141 >>>> >>>> (*)manager log >>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu TSpace=5587GB NumFS=1 >>>> FSpace=5693644MB MinFR=57218MB Util=0 >>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding path: w /atlas >>>> 091211 04:13:24 15661 server.10585:[log in to unmask]:1094 >>>> do_Space: 5696231MB free; 0% util >>>> 091211 04:13:24 15661 Protocol: >>>> server.10585:[log in to unmask]:1094 logged in. >>>> 091211 04:13:24 001 XrdInet: Accepted connection from >>>> [log in to unmask] >>>> 091211 04:13:24 15661 XrdSched: running ?:[log in to unmask] inq=0 >>>> 091211 04:13:24 15661 XrdProtocol: matched protocol cmsd >>>> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll: FD 79 attached >>>> to poller 2; num=22 >>>> 091211 04:13:24 15661 Add server.21739:[log in to unmask] bumps >>>> server.15905:[log in to unmask]:1094 #63 >>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >>>> 091211 04:13:24 15661 Drop_Node: >>>> server.15905:[log in to unmask]:1094 dropped. >>>> 091211 04:13:24 15661 Add Shoved >>>> server.21739:[log in to unmask]:1094 to cluster; id=63.78; num=64; >>>> min=51 >>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0 >>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu TSpace=5587GB NumFS=1 >>>> FSpace=5721854MB MinFR=57218MB Util=0 >>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding path: w /atlas >>>> 091211 04:13:24 15661 server.21739:[log in to unmask]:1094 >>>> do_Space: 5721854MB free; 0% util >>>> 091211 04:13:24 15661 Protocol: >>>> server.21739:[log in to unmask]:1094 logged in. >>>> 091211 04:13:24 15661 XrdLink: Unable to recieve from >>>> c187.chtc.wisc.edu; connection reset by peer >>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >>>> 091211 04:13:24 15661 XrdSched: scheduling drop node in 60 seconds >>>> 091211 04:13:24 15661 Remove_Node >>>> server.21739:[log in to unmask]:1094 node 63.78 >>>> 091211 04:13:24 15661 Protocol: server.21739:[log in to unmask] >>>> logged >>>> out. >>>> 091211 04:13:24 15661 server.21739:[log in to unmask] XrdPoll: FD >>>> 79 detached from poller 2; num=21 >>>> 091211 04:13:27 15661 Dispatch server.24718:[log in to unmask]:1094 >>>> for status dlen=0 >>>> 091211 04:13:27 15661 server.24718:[log in to unmask]:1094 do_Status: >>>> suspend >>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service suspended >>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c177.chtc.wisc.edu >>>> FD=16 >>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>> 091211 04:13:27 15661 Remove_Node >>>> server.24718:[log in to unmask]:1094 node 0.3 >>>> 091211 04:13:27 15661 Protocol: server.21656:[log in to unmask] >>>> logged >>>> out. >>>> 091211 04:13:27 15661 server.21656:[log in to unmask] XrdPoll: FD >>>> 16 detached from poller 2; num=20 >>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c179.chtc.wisc.edu >>>> FD=21 >>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>> 091211 04:13:27 15661 Remove_Node >>>> server.17065:[log in to unmask]:1094 node 1.4 >>>> 091211 04:13:27 15661 Protocol: server.7978:[log in to unmask] logged >>>> out. >>>> 091211 04:13:27 15661 server.7978:[log in to unmask] XrdPoll: FD 21 >>>> detached from poller 1; num=21 >>>> 091211 04:13:27 15661 State: Status changed to suspended >>>> 091211 04:13:27 15661 Send status to redirector.15656:14@atlas-bkp2 >>>> 091211 04:13:27 15661 Dispatch server.12937:[log in to unmask]:1094 >>>> for status dlen=0 >>>> 091211 04:13:27 15661 server.12937:[log in to unmask]:1094 do_Status: >>>> suspend >>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service suspended >>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c182.chtc.wisc.edu >>>> FD=19 >>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>> 091211 04:13:27 15661 Remove_Node >>>> server.12937:[log in to unmask]:1094 node 7.10 >>>> 091211 04:13:27 15661 Protocol: server.26620:[log in to unmask] >>>> logged >>>> out. >>>> 091211 04:13:27 15661 server.26620:[log in to unmask] XrdPoll: FD >>>> 19 detached from poller 2; num=19 >>>> 091211 04:13:27 15661 Dispatch server.10842:[log in to unmask]:1094 >>>> for status dlen=0 >>>> 091211 04:13:27 15661 server.10842:[log in to unmask]:1094 do_Status: >>>> suspend >>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service suspended >>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c178.chtc.wisc.edu >>>> FD=15 >>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>> 091211 04:13:27 15661 Remove_Node >>>> server.10842:[log in to unmask]:1094 node 9.12 >>>> 091211 04:13:27 15661 Protocol: server.11901:[log in to unmask] >>>> logged >>>> out. >>>> 091211 04:13:27 15661 server.11901:[log in to unmask] XrdPoll: FD >>>> 15 detached from poller 1; num=20 >>>> 091211 04:13:27 15661 Dispatch server.5535:[log in to unmask]:1094 >>>> for status dlen=0 >>>> 091211 04:13:27 15661 server.5535:[log in to unmask]:1094 do_Status: >>>> suspend >>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service suspended >>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c181.chtc.wisc.edu >>>> FD=17 >>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>> 091211 04:13:27 15661 Remove_Node >>>> server.5535:[log in to unmask]:1094 node 5.8 >>>> 091211 04:13:27 15661 Protocol: server.13984:[log in to unmask] >>>> logged >>>> out. >>>> 091211 04:13:27 15661 server.13984:[log in to unmask] XrdPoll: FD >>>> 17 detached from poller 0; num=21 >>>> 091211 04:13:27 15661 Dispatch server.23711:[log in to unmask]:1094 >>>> for status dlen=0 >>>> 091211 04:13:27 15661 server.23711:[log in to unmask]:1094 do_Status: >>>> suspend >>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service suspended >>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c183.chtc.wisc.edu >>>> FD=22 >>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>> 091211 04:13:27 15661 Remove_Node >>>> server.23711:[log in to unmask]:1094 node 8.11 >>>> 091211 04:13:27 15661 Protocol: server.27735:[log in to unmask] >>>> logged >>>> out. >>>> 091211 04:13:27 15661 server.27735:[log in to unmask] XrdPoll: FD >>>> 22 detached from poller 2; num=18 >>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c184.chtc.wisc.edu >>>> FD=20 >>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>> 091211 04:13:27 15661 Remove_Node >>>> server.4131:[log in to unmask]:1094 node 3.6 >>>> 091211 04:13:27 15661 Protocol: server.26787:[log in to unmask] >>>> logged >>>> out. >>>> 091211 04:13:27 15661 server.26787:[log in to unmask] XrdPoll: FD >>>> 20 detached from poller 0; num=20 >>>> 091211 04:13:27 15661 Dispatch server.10585:[log in to unmask]:1094 >>>> for status dlen=0 >>>> 091211 04:13:27 15661 server.10585:[log in to unmask]:1094 do_Status: >>>> suspend >>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service suspended >>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c185.chtc.wisc.edu >>>> FD=23 >>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>> 091211 04:13:27 15661 Remove_Node >>>> server.10585:[log in to unmask]:1094 node 6.9 >>>> 091211 04:13:27 15661 Protocol: server.8524:[log in to unmask] logged >>>> out. >>>> 091211 04:13:27 15661 server.8524:[log in to unmask] XrdPoll: FD 23 >>>> detached from poller 0; num=19 >>>> 091211 04:13:27 15661 Dispatch server.20264:[log in to unmask]:1094 >>>> for status dlen=0 >>>> 091211 04:13:27 15661 server.20264:[log in to unmask]:1094 do_Status: >>>> suspend >>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service suspended >>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c180.chtc.wisc.edu >>>> FD=18 >>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>> 091211 04:13:27 15661 Remove_Node >>>> server.20264:[log in to unmask]:1094 node 4.7 >>>> 091211 04:13:27 15661 Protocol: server.14636:[log in to unmask] >>>> logged >>>> out. >>>> 091211 04:13:27 15661 server.14636:[log in to unmask] XrdPoll: FD >>>> 18 detached from poller 1; num=19 >>>> 091211 04:13:27 15661 Dispatch server.1656:[log in to unmask]:1094 >>>> for status dlen=0 >>>> 091211 04:13:27 15661 server.1656:[log in to unmask]:1094 do_Status: >>>> suspend >>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service suspended >>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c186.chtc.wisc.edu >>>> FD=24 >>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >>>> 091211 04:13:27 15661 Remove_Node >>>> server.1656:[log in to unmask]:1094 node 2.5 >>>> 091211 04:13:27 15661 Protocol: server.7849:[log in to unmask] logged >>>> out. >>>> 091211 04:13:27 15661 server.7849:[log in to unmask] XrdPoll: FD 24 >>>> detached from poller 1; num=18 >>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled. >>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled. >>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled. >>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled. >>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled. >>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled. >>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled. >>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled. >>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled. >>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled. >>>> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled. >>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers >>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0 >>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled. >>>> 091211 04:14:24 15661 XrdSched: running drop node inq=0 >>>> >>>> Wen >>>> >>>> >>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky >>>> <[log in to unmask]> wrote: >>>>> >>>>> Hi Wen, >>>>> >>>>> To go past 64 data servers you will need to setup one or more >>>>> supervisors. >>>>> This does not logically change the current configuration you have. You >>>>> only >>>>> need to configure one or more *new* servers (or at least xrootd >>>>> processes) >>>>> whose role is supervisor. We'd like them to run in separate machines >>>>> for >>>>> reliability purposes, but they could run on the manager node as long as >>>>> you >>>>> give each one a unique instance name (i.e., -n option). >>>>> >>>>> The front part of the cmsd reference explains how to do this. >>>>> >>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm >>>>> >>>>> Andy >>>>> >>>>> On Fri, 11 Dec 2009, wen guan wrote: >>>>> >>>>>> Hi Andrew, >>>>>> >>>>>> Is there any change to configure xrootd with more than 65 >>>>>> machines? I used the configure below but it doesn't work. Should I >>>>>> configure some machines' manager to be supvervisor? >>>>>> >>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg >>>>>> >>>>>> >>>>>> Wen >>>>>> >>>>> >>> >