Hi Andrew, the logs can be found here. From the log you can see atlas-bkp1 manager are dropping nodes again and again which tries to connect to it. http://higgs03.cs.wisc.edu/wguan/ On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky <[log in to unmask]> wrote: > Hi Wen, Could you start everything up and provide me a pointer to the > manager log file, supervisor log file, and one data server logfile all of > which cover the same time-frame (from start to some point where you think > things are working or not). That way I can see what is happening. At the > moment I only see two "bad" things in the config file: > > 1) Only atlas-bkp1.cs.wisc.edu is designated as a manager but you claim, via > the all.manager directive, that there are three (bkp2 and bkp3). While it > should work, the log file will be dense with error messages. Please correct > this to be consistent and make it easier to see real errors. This is not a problem for me. Because this config is used in dataserver. In manager, I updated the if atlas-bkp1.cs.wisc.edu to atlas-bkp2 or something. This is a history problem. at first only atlas-bkp1 is used. atlas-bkp2 and atlas-bkp3 are added later. > 2) Please use cms.space not olb.space (for historical reasons the latter is > still accepted and over-rides the former, but that will soon end), and > please use only one (the config file uses both directives). yes. I should remove this line. in fact cms.space is in the cfg too. Thanks Wen > The xrootd has an internal mechanism to connect servers with supervisors to > allow for maximum reliability. You cannot change that algorithm and there is > no need to do so. You should *never* tell anyone to directly connect to a > supervisor. If you do, you will likely get unreachable nodes. > > As for dropping data servers, it would appear to me, given the flurry of > such activity, that something either crashed or was restarted. That's why it > would be good to see the complete log of each one of the entities. > > Andy > > On Fri, 11 Dec 2009, wen guan wrote: > >> Hi Andrew, >> >> I read the document. and write a config >> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg). >> I used my conf, I can see manager is dispatch message to >> supervisor. But I cannot see any dataserver tries to connect to the >> supervisor. At the same time, in the manager's log, I can see some >> dataserver are Dropped. >> How does xrootd decide which dataserver will connect supervisor? >> should I specify some dataservers to connect the supervisor? >> >> >> (*) supervisor log >> 091211 15:07:00 30028 Dispatch manager.0:20@atlas-bkp2 for state dlen=42 >> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_State: >> /atlas/xrootd/users/wguan/test/test131141 >> 091211 15:07:00 30028 manager.0:20@atlas-bkp2 do_StateFWD: Path find >> failed for state /atlas/xrootd/users/wguan/test/test131141 >> >> (*)manager log >> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu TSpace=5587GB NumFS=1 >> FSpace=5693644MB MinFR=57218MB Util=0 >> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu adding path: w /atlas >> 091211 04:13:24 15661 server.10585:[log in to unmask]:1094 >> do_Space: 5696231MB free; 0% util >> 091211 04:13:24 15661 Protocol: >> server.10585:[log in to unmask]:1094 logged in. >> 091211 04:13:24 001 XrdInet: Accepted connection from >> [log in to unmask] >> 091211 04:13:24 15661 XrdSched: running ?:[log in to unmask] inq=0 >> 091211 04:13:24 15661 XrdProtocol: matched protocol cmsd >> 091211 04:13:24 15661 ?:[log in to unmask] XrdPoll: FD 79 attached >> to poller 2; num=22 >> 091211 04:13:24 15661 Add server.21739:[log in to unmask] bumps >> server.15905:[log in to unmask]:1094 #63 >> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >> 091211 04:13:24 15661 Drop_Node: >> server.15905:[log in to unmask]:1094 dropped. >> 091211 04:13:24 15661 Add Shoved >> server.21739:[log in to unmask]:1094 to cluster; id=63.78; num=64; >> min=51 >> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0 >> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu TSpace=5587GB NumFS=1 >> FSpace=5721854MB MinFR=57218MB Util=0 >> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu adding path: w /atlas >> 091211 04:13:24 15661 server.21739:[log in to unmask]:1094 >> do_Space: 5721854MB free; 0% util >> 091211 04:13:24 15661 Protocol: >> server.21739:[log in to unmask]:1094 logged in. >> 091211 04:13:24 15661 XrdLink: Unable to recieve from >> c187.chtc.wisc.edu; connection reset by peer >> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0 >> 091211 04:13:24 15661 XrdSched: scheduling drop node in 60 seconds >> 091211 04:13:24 15661 Remove_Node >> server.21739:[log in to unmask]:1094 node 63.78 >> 091211 04:13:24 15661 Protocol: server.21739:[log in to unmask] logged >> out. >> 091211 04:13:24 15661 server.21739:[log in to unmask] XrdPoll: FD >> 79 detached from poller 2; num=21 >> 091211 04:13:27 15661 Dispatch server.24718:[log in to unmask]:1094 >> for status dlen=0 >> 091211 04:13:27 15661 server.24718:[log in to unmask]:1094 do_Status: >> suspend >> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu service suspended >> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c177.chtc.wisc.edu >> FD=16 >> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >> 091211 04:13:27 15661 Remove_Node >> server.24718:[log in to unmask]:1094 node 0.3 >> 091211 04:13:27 15661 Protocol: server.21656:[log in to unmask] logged >> out. >> 091211 04:13:27 15661 server.21656:[log in to unmask] XrdPoll: FD >> 16 detached from poller 2; num=20 >> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c179.chtc.wisc.edu >> FD=21 >> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >> 091211 04:13:27 15661 Remove_Node >> server.17065:[log in to unmask]:1094 node 1.4 >> 091211 04:13:27 15661 Protocol: server.7978:[log in to unmask] logged >> out. >> 091211 04:13:27 15661 server.7978:[log in to unmask] XrdPoll: FD 21 >> detached from poller 1; num=21 >> 091211 04:13:27 15661 State: Status changed to suspended >> 091211 04:13:27 15661 Send status to redirector.15656:14@atlas-bkp2 >> 091211 04:13:27 15661 Dispatch server.12937:[log in to unmask]:1094 >> for status dlen=0 >> 091211 04:13:27 15661 server.12937:[log in to unmask]:1094 do_Status: >> suspend >> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu service suspended >> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c182.chtc.wisc.edu >> FD=19 >> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >> 091211 04:13:27 15661 Remove_Node >> server.12937:[log in to unmask]:1094 node 7.10 >> 091211 04:13:27 15661 Protocol: server.26620:[log in to unmask] logged >> out. >> 091211 04:13:27 15661 server.26620:[log in to unmask] XrdPoll: FD >> 19 detached from poller 2; num=19 >> 091211 04:13:27 15661 Dispatch server.10842:[log in to unmask]:1094 >> for status dlen=0 >> 091211 04:13:27 15661 server.10842:[log in to unmask]:1094 do_Status: >> suspend >> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu service suspended >> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c178.chtc.wisc.edu >> FD=15 >> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >> 091211 04:13:27 15661 Remove_Node >> server.10842:[log in to unmask]:1094 node 9.12 >> 091211 04:13:27 15661 Protocol: server.11901:[log in to unmask] logged >> out. >> 091211 04:13:27 15661 server.11901:[log in to unmask] XrdPoll: FD >> 15 detached from poller 1; num=20 >> 091211 04:13:27 15661 Dispatch server.5535:[log in to unmask]:1094 >> for status dlen=0 >> 091211 04:13:27 15661 server.5535:[log in to unmask]:1094 do_Status: >> suspend >> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu service suspended >> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c181.chtc.wisc.edu >> FD=17 >> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >> 091211 04:13:27 15661 Remove_Node >> server.5535:[log in to unmask]:1094 node 5.8 >> 091211 04:13:27 15661 Protocol: server.13984:[log in to unmask] logged >> out. >> 091211 04:13:27 15661 server.13984:[log in to unmask] XrdPoll: FD >> 17 detached from poller 0; num=21 >> 091211 04:13:27 15661 Dispatch server.23711:[log in to unmask]:1094 >> for status dlen=0 >> 091211 04:13:27 15661 server.23711:[log in to unmask]:1094 do_Status: >> suspend >> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu service suspended >> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c183.chtc.wisc.edu >> FD=22 >> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >> 091211 04:13:27 15661 Remove_Node >> server.23711:[log in to unmask]:1094 node 8.11 >> 091211 04:13:27 15661 Protocol: server.27735:[log in to unmask] logged >> out. >> 091211 04:13:27 15661 server.27735:[log in to unmask] XrdPoll: FD >> 22 detached from poller 2; num=18 >> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c184.chtc.wisc.edu >> FD=20 >> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >> 091211 04:13:27 15661 Remove_Node >> server.4131:[log in to unmask]:1094 node 3.6 >> 091211 04:13:27 15661 Protocol: server.26787:[log in to unmask] logged >> out. >> 091211 04:13:27 15661 server.26787:[log in to unmask] XrdPoll: FD >> 20 detached from poller 0; num=20 >> 091211 04:13:27 15661 Dispatch server.10585:[log in to unmask]:1094 >> for status dlen=0 >> 091211 04:13:27 15661 server.10585:[log in to unmask]:1094 do_Status: >> suspend >> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu service suspended >> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c185.chtc.wisc.edu >> FD=23 >> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >> 091211 04:13:27 15661 Remove_Node >> server.10585:[log in to unmask]:1094 node 6.9 >> 091211 04:13:27 15661 Protocol: server.8524:[log in to unmask] logged >> out. >> 091211 04:13:27 15661 server.8524:[log in to unmask] XrdPoll: FD 23 >> detached from poller 0; num=19 >> 091211 04:13:27 15661 Dispatch server.20264:[log in to unmask]:1094 >> for status dlen=0 >> 091211 04:13:27 15661 server.20264:[log in to unmask]:1094 do_Status: >> suspend >> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu service suspended >> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c180.chtc.wisc.edu >> FD=18 >> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >> 091211 04:13:27 15661 Remove_Node >> server.20264:[log in to unmask]:1094 node 4.7 >> 091211 04:13:27 15661 Protocol: server.14636:[log in to unmask] logged >> out. >> 091211 04:13:27 15661 server.14636:[log in to unmask] XrdPoll: FD >> 18 detached from poller 1; num=19 >> 091211 04:13:27 15661 Dispatch server.1656:[log in to unmask]:1094 >> for status dlen=0 >> 091211 04:13:27 15661 server.1656:[log in to unmask]:1094 do_Status: >> suspend >> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0 >> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu service suspended >> 091211 04:13:27 15661 XrdLink: No RecvAll() data from c186.chtc.wisc.edu >> FD=24 >> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0 >> 091211 04:13:27 15661 Remove_Node >> server.1656:[log in to unmask]:1094 node 2.5 >> 091211 04:13:27 15661 Protocol: server.7849:[log in to unmask] logged >> out. >> 091211 04:13:27 15661 server.7849:[log in to unmask] XrdPoll: FD 24 >> detached from poller 1; num=18 >> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >> 091211 04:14:14 15661 XrdSched: running drop node inq=0 >> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >> 091211 04:14:14 15661 XrdSched: scheduling drop node in 13 seconds >> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >> 091211 04:14:24 15661 Drop_Node 63.66 cancelled. >> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >> 091211 04:14:24 15661 Drop_Node 63.68 cancelled. >> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >> 091211 04:14:24 15661 Drop_Node 63.69 cancelled. >> 091211 04:14:24 15661 Drop_Node 63.67 cancelled. >> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >> 091211 04:14:24 15661 Drop_Node 63.70 cancelled. >> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >> 091211 04:14:24 15661 Drop_Node 63.71 cancelled. >> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >> 091211 04:14:24 15661 Drop_Node 63.72 cancelled. >> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >> 091211 04:14:24 15661 Drop_Node 63.73 cancelled. >> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >> 091211 04:14:24 15661 Drop_Node 63.74 cancelled. >> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >> 091211 04:14:24 15661 Drop_Node 63.75 cancelled. >> 091211 04:14:24 15661 XrdSched: running drop node inq=1 >> 091211 04:14:24 15661 Drop_Node 63.76 cancelled. >> 091211 04:14:24 15661 XrdSched: Now have 68 workers >> 091211 04:14:24 15661 XrdSched: running drop node inq=0 >> 091211 04:14:24 15661 Drop_Node 63.77 cancelled. >> 091211 04:14:24 15661 XrdSched: running drop node inq=0 >> >> Wen >> >> >> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky >> <[log in to unmask]> wrote: >>> >>> Hi Wen, >>> >>> To go past 64 data servers you will need to setup one or more >>> supervisors. >>> This does not logically change the current configuration you have. You >>> only >>> need to configure one or more *new* servers (or at least xrootd >>> processes) >>> whose role is supervisor. We'd like them to run in separate machines for >>> reliability purposes, but they could run on the manager node as long as >>> you >>> give each one a unique instance name (i.e., -n option). >>> >>> The front part of the cmsd reference explains how to do this. >>> >>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm >>> >>> Andy >>> >>> On Fri, 11 Dec 2009, wen guan wrote: >>> >>>> Hi Andrew, >>>> >>>> Is there any change to configure xrootd with more than 65 >>>> machines? I used the configure below but it doesn't work. Should I >>>> configure some machines' manager to be supvervisor? >>>> >>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg >>>> >>>> >>>> Wen >>>> >>> >