Print

Print


Hi Andy,

   I am sure I am using the right cmsd code. Today I compiled and
reinstall all cmsd and xrootd. But now it still doesn't work.  I will
create an account for you then you can login to these machines to
check what happend.

   In fact today when I doing some restart, I saw some machines
registered itself to higgs07. But unfortunately when reinstalling, the
logs have been cleaned.

   I found the supervisor will come to "suspend" state after a while
it's started, will it cause the supervisor fails to get some
information.

Wen


On Fri, Dec 18, 2009 at 3:05 AM, Andrew Hanushevsky <[log in to unmask]> wrote:
> Hi Wen,
>
> Something is really going wrong with your data servers. For instance, c109
> is quite happy from midnight to 7:23am. Then it dropped the connection. Then
> reconnected 7:24:03 and was again happy 12:37:20 but here it reported that's
> it's xrootd died but then the cmsd promptly killed its connection afterward.
> This appears as if someone restarted the xrootd followed by the cmsd on
> c109. This continued like this until 12:43:00 (i.e., connect, suspend, die,
> repeat). All your servers, in fact, started doing this at 12:36:41 to
> 12:42:51 causing a massive swap of servers. New servers were added and old
> ones reconnecting were redirected to the supervisor. However, it would
> appear that those machines could not connect there as they kept comming back
> to the atlas-bkp1. I can't tell you anything about what was happening on
> higgs07. As far as I can tell it was happily connected to the redirector
> cmsd. The reason is that y=there is no log for higgs07 on the web site for
> 12/17 starting at midnight. Perhaps you can put one there.
>
> So,
>
> 1) Are you *absolutely* sure that *all* your (data, etc) servers are running
> the corrected cmsd?
> 2) Please provide the higgs07 log for 12/17.
>
> 3) Please provide logs for a sampling of data servers say c0109, c094,
> higgs15, and higgs13 between 1/17 12:00:00 to 15:44.
>
> I have never seen a situation like yours so something is very wrong here. In
> the mean time I will add more debugging information to the redirector and
> supervisor and let you know when that is available.
>
> Andy
>
>
> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
> To: "Fabrizio Furano" <[log in to unmask]>
> Cc: "Andrew Hanushevsky" <[log in to unmask]>; <[log in to unmask]>
> Sent: Thursday, December 17, 2009 3:12 PM
> Subject: Re: xrootd with more than 65 machines
>
>
> Hi Fabrizio,
>
>   This is the xrdcp debug message.
>            ClientHeader.header.dlen = 41
> =================== END CLIENT HEADER DUMPING ===================
>
> 091217 16:47:54 15961 Xrd: WriteRaw: Writing 24 bytes to physical connection
> 091217 16:47:54 15961 Xrd: WriteRaw: Writing to substreamid 0
> 091217 16:47:54 15961 Xrd: WriteRaw: Writing 41 bytes to physical connection
> 091217 16:47:54 15961 Xrd: WriteRaw: Writing to substreamid 0
> 091217 16:47:54 15961 Xrd: ReadPartialAnswer: Reading a
> XrdClientMessage from the server [atlas-bkp1.cs.wisc.edu:1094]...
> 091217 16:47:54 15961 Xrd: XrdClientMessage::ReadRaw:  sid: 1, IsAttn:
> 0, substreamid: 0
> 091217 16:47:54 15961 Xrd: XrdClientMessage::ReadRaw: Reading data (4
> bytes) from substream 0
> 091217 16:47:54 15961 Xrd: ReadRaw: Reading from atlas-bkp1.cs.wisc.edu:1094
> 091217 16:47:54 15961 Xrd: BuildMessage:  posting id 1
> 091217 16:47:54 15961 Xrd: XrdClientMessage::ReadRaw: Reading header (8
> bytes).
> 091217 16:47:54 15961 Xrd: ReadRaw: Reading from atlas-bkp1.cs.wisc.edu:1094
>
>
> ======== DUMPING SERVER RESPONSE HEADER ========
>     ServerHeader.streamid = 0x01 0x00
>       ServerHeader.status = kXR_wait (4005)
>         ServerHeader.dlen = 4
> ========== END DUMPING SERVER HEADER ===========
>
> 091217 16:47:54 15961 Xrd: ReadPartialAnswer: Server
> [atlas-bkp1.cs.wisc.edu:1094] answered [kXR_wait] (4005)
> 091217 16:47:54 15961 Xrd: CheckErrorStatus: Server
> [atlas-bkp1.cs.wisc.edu:1094] requested 10 seconds of wait
> 091217 16:48:04 15961 Xrd: DumpPhyConn: Phyconn entry,
> [log in to unmask]:1094', LogCnt=1 Valid
> 091217 16:48:04 15961 Xrd: SendGenCommand: Sending command Open
>
>
> ================= DUMPING CLIENT REQUEST HEADER =================
>               ClientHeader.streamid = 0x01 0x00
>              ClientHeader.requestid = kXR_open (3010)
>              ClientHeader.open.mode = 0x00 0x00
>           ClientHeader.open.options = 0x40 0x04
>          ClientHeader.open.reserved = 0 repeated 12 times
>            ClientHeader.header.dlen = 41
> =================== END CLIENT HEADER DUMPING ===================
>
> 091217 16:48:04 15961 Xrd: WriteRaw: Writing 24 bytes to physical connection
> 091217 16:48:04 15961 Xrd: WriteRaw: Writing to substreamid 0
> 091217 16:48:04 15961 Xrd: WriteRaw: Writing 41 bytes to physical connection
> 091217 16:48:04 15961 Xrd: WriteRaw: Writing to substreamid 0
> 091217 16:48:04 15961 Xrd: ReadPartialAnswer: Reading a
> XrdClientMessage from the server [atlas-bkp1.cs.wisc.edu:1094]...
> 091217 16:48:04 15961 Xrd: XrdClientMessage::ReadRaw:  sid: 1, IsAttn:
> 0, substreamid: 0
> 091217 16:48:04 15961 Xrd: XrdClientMessage::ReadRaw: Reading data (4
> bytes) from substream 0
> 091217 16:48:04 15961 Xrd: ReadRaw: Reading from atlas-bkp1.cs.wisc.edu:1094
> 091217 16:48:04 15961 Xrd: BuildMessage:  posting id 1
> 091217 16:48:04 15961 Xrd: XrdClientMessage::ReadRaw: Reading header (8
> bytes).
> 091217 16:48:04 15961 Xrd: ReadRaw: Reading from atlas-bkp1.cs.wisc.edu:1094
>
>
> ======== DUMPING SERVER RESPONSE HEADER ========
>     ServerHeader.streamid = 0x01 0x00
>       ServerHeader.status = kXR_wait (4005)
>         ServerHeader.dlen = 4
> ========== END DUMPING SERVER HEADER ===========
>
> 091217 16:48:04 15961 Xrd: ReadPartialAnswer: Server
> [atlas-bkp1.cs.wisc.edu:1094] answered [kXR_wait] (4005)
> 091217 16:48:04 15961 Xrd: CheckErrorStatus: Server
> [atlas-bkp1.cs.wisc.edu:1094] requested 10 seconds of wait
> 091217 16:48:14 15961 Xrd: SendGenCommand: Sending command Open
>
>
> ================= DUMPING CLIENT REQUEST HEADER =================
>               ClientHeader.streamid = 0x01 0x00
>              ClientHeader.requestid = kXR_open (3010)
>              ClientHeader.open.mode = 0x00 0x00
>           ClientHeader.open.options = 0x40 0x04
>          ClientHeader.open.reserved = 0 repeated 12 times
>            ClientHeader.header.dlen = 41
> =================== END CLIENT HEADER DUMPING ===================
>
> 091217 16:48:14 15961 Xrd: WriteRaw: Writing 24 bytes to physical connection
> 091217 16:48:14 15961 Xrd: WriteRaw: Writing to substreamid 0
> 091217 16:48:14 15961 Xrd: WriteRaw: Writing 41 bytes to physical connection
> 091217 16:48:14 15961 Xrd: WriteRaw: Writing to substreamid 0
> 091217 16:48:14 15961 Xrd: ReadPartialAnswer: Reading a
> XrdClientMessage from the server [atlas-bkp1.cs.wisc.edu:1094]...
> 091217 16:48:14 15961 Xrd: XrdClientMessage::ReadRaw:  sid: 1, IsAttn:
> 0, substreamid: 0
> 091217 16:48:14 15961 Xrd: XrdClientMessage::ReadRaw: Reading data (4
> bytes) from substream 0
> 091217 16:48:14 15961 Xrd: ReadRaw: Reading from atlas-bkp1.cs.wisc.edu:1094
> 091217 16:48:14 15961 Xrd: BuildMessage:  posting id 1
> 091217 16:48:14 15961 Xrd: XrdClientMessage::ReadRaw: Reading header (8
> bytes).
> 091217 16:48:14 15961 Xrd: ReadRaw: Reading from atlas-bkp1.cs.wisc.edu:1094
>
>
> ======== DUMPING SERVER RESPONSE HEADER ========
>     ServerHeader.streamid = 0x01 0x00
>       ServerHeader.status = kXR_wait (4005)
>         ServerHeader.dlen = 4
> ========== END DUMPING SERVER HEADER ===========
>
> 091217 16:48:14 15961 Xrd: ReadPartialAnswer: Server
> [atlas-bkp1.cs.wisc.edu:1094] answered [kXR_wait] (4005)
> 091217 16:48:14 15961 Xrd: SendGenCommand: Max time limit elapsed for
> request  kXR_open. Aborting command.
> Last server error 10000 ('')
> Error accessing path/file for
> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>
>
> Wen
>
> On Thu, Dec 17, 2009 at 11:27 PM, Fabrizio Furano <[log in to unmask]> wrote:
>>
>> Hi Wen,
>>
>> I see that you are getting error 10000, which means "generic error before
>> any interaction". Could you please run the same command with debug level 3
>> and post the log with the same kind of issue? Something like
>>
>> xrdcp -d 3 ....
>>
>> Most likely this time the problem is different. I may be wrong here, but a
>> possible reason for that error is that the servers require authentication
>> and xrdcp does not find some library in the LD_LIBRARY_PATH.
>>
>> Fabrizio
>>
>>
>> wen guan ha scritto:
>>>
>>> Hi Andy,
>>>
>>> I put new logs in web.
>>>
>>> It still doesn't work. I cannot copy files in and out.
>>>
>>> It seems xrootd daemon at atlas-bkp1 hasn't talked with cmsd.
>>> Normally if xrootd daemont tries to copy a file, in the cms.log I
>>> should see "do_Select: filename". But in this cms.log, there is
>>> nothing from atlas-bkp1.
>>>
>>> (*)
>>> [root@atlas-bkp1 ~]# xrdcp
>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>> /tmp/
>>> Last server error 10000 ('')
>>> Error accessing path/file for
>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>> [root@atlas-bkp1 ~]# xrdcp /bin/mv
>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123
>>> 133
>>>
>>>
>>> Wen
>>>
>>> On Thu, Dec 17, 2009 at 10:54 PM, Andrew Hanushevsky <[log in to unmask]>
>>> wrote:
>>>>
>>>> Hi Wen,
>>>>
>>>> I reviewed the log file. Other than the odd redirect of c131 at 17:47:25
>>>> which I can't comment on because its logs on the web site do not overlap
>>>> with the manager or supervisor. Unless all the logs include the full
>>>> time
>>>> in
>>>> question I can't say much of anything. Can you provide me with inclusive
>>>> logs?
>>>>
>>>> atlas-bkp1 cms: 17:20:57 to 17:42:19 xrd: 17:20:57 to 17:40:57
>>>> higgs07 cms & xrd 17:22:33 to 17:42:33
>>>> c131 cms & xrd 17:31:57 to 17:47:28
>>>>
>>>> That said, it certainly looks like things were working and files were
>>>> being
>>>> accessed and discovered on all the machines. You even werw able to open
>>>> /atlas/xrootd/users/wguan/test/test98123313
>>>> through not
>>>> /atlas/xrootd/users/wguan/test/test123131The other issue is that you did
>>>> not
>>>> specify a stable adminpath and the adminpath defaults to /tmp. If you
>>>> have a
>>>> "cleanup" script that runs periodically for /tmp then eventually your
>>>> cluster will go catonic as important (but not often used) files are
>>>> deleted
>>>> by that script. Could you please find a stable home for the adminpath?
>>>>
>>>> I reran my tests here and things worked as expected. I will ramp up some
>>>> more tests. So, what is your status today?
>>>>
>>>> Andy
>>>>
>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>> Cc: <[log in to unmask]>
>>>> Sent: Thursday, December 17, 2009 5:05 AM
>>>> Subject: Re: xrootd with more than 65 machines
>>>>
>>>>
>>>> Hi Andy,
>>>>
>>>> Yes. I am using the file download from
>>>> http://www.slac.stanford.edu/~abh/cmsd/ which compiled yesterday. I
>>>> just now compiled it again and compare it with one I compiled
>>>> yesterday. they are the same(same md5sum).
>>>>
>>>> Wen
>>>>
>>>> On Thu, Dec 17, 2009 at 2:09 AM, Andrew Hanushevsky <[log in to unmask]>
>>>> wrote:
>>>>>
>>>>> Hi Wen,
>>>>>
>>>>> If c131 cannot connect then either c131 does not have the new cms or
>>>>> atlas-bkp1 does not have the new cms as that would be what would happen
>>>>> if
>>>>> either were true. Looking at the log on c131 it would appear that
>>>>> atlas-bkp1
>>>>> is still using the old cmsd as the response data length is wrong. Could
>>>>> you
>>>>> verify please.
>>>>>
>>>>> Andy
>>>>>
>>>>> ----- Original Message ----- From: "wen guan" <[log in to unmask]>
>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>> Cc: <[log in to unmask]>
>>>>> Sent: Wednesday, December 16, 2009 3:58 PM
>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>
>>>>>
>>>>> Hi Andy,
>>>>>
>>>>> I tried it. But there are still some problem. I put the logs in
>>>>> higgs03.cs.wisc.edu/wguan/
>>>>>
>>>>> In my test, c131 is the 65 nodes to be added the the manager.
>>>>> and I can copy the file to the pool through manager. But I cannot
>>>>> copy a file out which is in c131.
>>>>>
>>>>> In c131's cms.log, I see "Manager:
>>>>> manager.0:[log in to unmask] removed; redirected" again and
>>>>> again. and I cannot see any thing about c131 in higgs07's
>>>>> log(supervisor). Does it mean manager tries to redirect it to higgs07,
>>>>> but c131 hasn't try to connect higgs07. It only tries to connect
>>>>> manager again.
>>>>>
>>>>> (*)
>>>>> [root@c131 ~]# xrdcp /bin/mv
>>>>> root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331
>>>>> Last server error 10000 ('')
>>>>> Error accessing path/file for
>>>>> root://atlas-bkp1//atlas/xrootd/users/wguan/test/test9812331
>>>>> [root@c131 ~]# xrdcp /bin/mv
>>>>>
>>>>>
>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123311
>>>>> [xrootd] Total 0.06 MB |====================| 100.00 % [3.1 MB/s]
>>>>> [root@c131 ~]# xrdcp /bin/mv
>>>>>
>>>>>
>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123312
>>>>> [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s]
>>>>> [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/
>>>>> test123131
>>>>> [root@c131 ~]# xrdcp
>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>>> /tmp/
>>>>> Last server error 3011 ('No servers are available to read the file.')
>>>>> Error accessing path/file for
>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>>> [root@c131 ~]# ls /atlas/xrootd/users/wguan/test/test123131
>>>>> /atlas/xrootd/users/wguan/test/test123131
>>>>> [root@c131 ~]# xrdcp
>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>>> /tmp/
>>>>> Last server error 3011 ('No servers are available to read the file.')
>>>>> Error accessing path/file for
>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>>> [root@c131 ~]# xrdcp /bin/mv
>>>>>
>>>>>
>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test98123313
>>>>> [xrootd] Total 0.06 MB |====================| 100.00 % [inf MB/s]
>>>>> [root@c131 ~]# xrdcp
>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>>> /tmp/
>>>>> Last server error 3011 ('No servers are available to read the file.')
>>>>> Error accessing path/file for
>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>>> [root@c131 ~]# xrdcp
>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>>> /tmp/
>>>>> Last server error 3011 ('No servers are available to read the file.')
>>>>> Error accessing path/file for
>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>>> [root@c131 ~]# xrdcp
>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>>> /tmp/
>>>>> Last server error 3011 ('No servers are available to read the file.')
>>>>> Error accessing path/file for
>>>>> root://atlas-bkp1.cs.wisc.edu//atlas/xrootd/users/wguan/test/test123131
>>>>> [root@c131 ~]# tail -f /var/log/xrootd/cms.log
>>>>> 091216 17:45:52 3103 manager.0:[log in to unmask] XrdLink:
>>>>> Setting ref to 2+-1 post=0
>>>>> 091216 17:45:55 3103 Pander trying to connect to lvl 0
>>>>> atlas-bkp1.cs.wisc.edu:3121
>>>>> 091216 17:45:55 3103 XrdInet: Connected to atlas-bkp1.cs.wisc.edu:3121
>>>>> 091216 17:45:55 3103 Add atlas-bkp1.cs.wisc.edu to manager config; id=0
>>>>> 091216 17:45:55 3103 ManTree: Now connected to 3 root node(s)
>>>>> 091216 17:45:55 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu
>>>>> 091216 17:45:55 3103 Dispatch manager.0:[log in to unmask] for
>>>>> try
>>>>> dlen=3
>>>>> 091216 17:45:55 3103 manager.0:[log in to unmask] do_Try:
>>>>> 091216 17:45:55 3103 Remove completed atlas-bkp1.cs.wisc.edu manager
>>>>> 0.95
>>>>> 091216 17:45:55 3103 Manager: manager.0:[log in to unmask]
>>>>> removed; redirected
>>>>> 091216 17:46:04 3103 Pander trying to connect to lvl 0
>>>>> atlas-bkp1.cs.wisc.edu:3121
>>>>> 091216 17:46:04 3103 XrdInet: Connected to atlas-bkp1.cs.wisc.edu:3121
>>>>> 091216 17:46:04 3103 Add atlas-bkp1.cs.wisc.edu to manager config; id=0
>>>>> 091216 17:46:04 3103 ManTree: Now connected to 3 root node(s)
>>>>> 091216 17:46:04 3103 Protocol: Logged into atlas-bkp1.cs.wisc.edu
>>>>> 091216 17:46:04 3103 Dispatch manager.0:[log in to unmask] for
>>>>> try
>>>>> dlen=3
>>>>> 091216 17:46:04 3103 Protocol: No buffers to serve
>>>>> atlas-bkp1.cs.wisc.edu
>>>>> 091216 17:46:04 3103 Remove completed atlas-bkp1.cs.wisc.edu manager
>>>>> 0.96
>>>>> 091216 17:46:04 3103 Manager: manager.0:[log in to unmask]
>>>>> removed; insufficient buffers
>>>>> 091216 17:46:11 3103 Dispatch manager.0:[log in to unmask] for
>>>>> state dlen=169
>>>>> 091216 17:46:11 3103 manager.0:[log in to unmask] XrdLink:
>>>>> Setting ref to 1+1 post=0
>>>>>
>>>>> Thanks
>>>>> Wen
>>>>>
>>>>> On Thu, Dec 17, 2009 at 12:10 AM, wen guan <[log in to unmask]>
>>>>> wrote:
>>>>>>
>>>>>> Hi Andy,
>>>>>>
>>>>>>> OK, I understand. As for stalling, too many nodes were deemed to be
>>>>>>> in
>>>>>>> trouble for the manager to allow service resumption.
>>>>>>>
>>>>>>> Please make sure that all of the nodes in the cluster receive the new
>>>>>>> cmsd
>>>>>>> as they will drop off with the old one and you'll see the same kind
>>>>>>> of
>>>>>>> activity. Perhaps the best way to know that you suceeded in putting
>>>>>>> everything in sync is to start with 63 data nodes plus one
>>>>>>> supervisor.
>>>>>>> Once
>>>>>>> all connections are established; adding an additional server should
>>>>>>> simply
>>>>>>> send it to the supervisor.
>>>>>>
>>>>>> I will do it.
>>>>>> you said start 63 data server and one supervisor. Does it mean the
>>>>>> supervisor is managed using the same policy? If I there are 64
>>>>>> dataservers which are connected before the supervisor, will the
>>>>>> supervisor be dropped? Is the supervisor has high priority to be
>>>>>> added to the manager? I mean, if there are already 64 dataservers and
>>>>>> a supervisor comes in, will the supervisor be accepted and a datasever
>>>>>> be redirected to the supervisor?
>>>>>>
>>>>>> Thanks
>>>>>> Wen
>>>>>>
>>>>>>> Hi Andrew,
>>>>>>>
>>>>>>> But when I tried to xrdcp a file to it, it doesn't response. In
>>>>>>> atlas-bkp1-xrd.log.20091213, it always prints "stalling client for 10
>>>>>>> sec". But in cms.log, I can find any message about the file.
>>>>>>>
>>>>>>>> I don't see why you say it doesn't work. With the debugging level
>>>>>>>> set
>>>>>>>> so
>>>>>>>> high the noise may make it look like something is going wrong but
>>>>>>>> that
>>>>>>>> isn't
>>>>>>>> necessarily the case.
>>>>>>>>
>>>>>>>> 1) The 'too many subscribers' is correct. The manager was simply
>>>>>>>> redirecting
>>>>>>>> them because there were already 64 servers. However, in your case
>>>>>>>> the
>>>>>>>> supervisor wasn't started until almost 30 minutes after everyone
>>>>>>>> else
>>>>>>>> (i.e.,
>>>>>>>> 10:42 AM). Why was that? I'm not suprised about the flurry of
>>>>>>>> messages
>>>>>>>> with
>>>>>>>> a critical component missing for 30 minutes.
>>>>>>>
>>>>>>> Because the manager is 64bit machine but supervisor is 32 bit
>>>>>>> machine.
>>>>>>> Then I have to recompile the it. At that time, I was interrupted by
>>>>>>> something else.
>>>>>>>
>>>>>>>
>>>>>>>> 2) Once the supervisor started, it started accepting the redirected
>>>>>>>> servers.
>>>>>>>>
>>>>>>>> 3) Then 10 seconds (10:42:10) later the supervisor was restarted.
>>>>>>>> So,
>>>>>>>> that
>>>>>>>> would cause a flurry of activity to occur as there is no backup
>>>>>>>> supervisor
>>>>>>>> to take over.
>>>>>>>>
>>>>>>>> 4) This happened again at 10:42:34 AM then again at 10:48:49. Is the
>>>>>>>> supervisor crashing? Is there a core file?
>>>>>>>>
>>>>>>>> 5) At 11:11 AM the manager restarted. Again, is there a core file
>>>>>>>> here
>>>>>>>> or
>>>>>>>> was this a manual action?
>>>>>>>>
>>>>>>>> During the course of all of this. All nodes connected were operating
>>>>>>>> propely
>>>>>>>> and files were being located.
>>>>>>>>
>>>>>>>> So, the two big questions are:
>>>>>>>>
>>>>>>>> a) Why was the supervisor not started until 30 minutes after the
>>>>>>>> system
>>>>>>>> was
>>>>>>>> started?
>>>>>>>>
>>>>>>>> b) Is there an explanation of the restarts? If this was a crash then
>>>>>>>> we
>>>>>>>> need
>>>>>>>> a core file to figure out what happened.
>>>>>>>
>>>>>>> It's not a crash. There are some reasons that I restarted some
>>>>>>> daemons.
>>>>>>> (1)I thought if a dataserver tried many times to connect to a
>>>>>>> redirector but failed, the dataserver would not try to connect a
>>>>>>> redirector again. The supervisor was missing for long time. So maybe
>>>>>>> some dataservers would not try to connect to atlas-bkp1 again. To
>>>>>>> reactive these dataservers, I restarted any servers.
>>>>>>> (2)When I tried to xrdcp, it was hanging for long time. I thought
>>>>>>> maybe manager was affected by some others things. then I restarte
>>>>>>> manager to see whether a restart can make this xrdcp work.
>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>> Wen
>>>>>>>
>>>>>>>> Andy
>>>>>>>>
>>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>>> <[log in to unmask]>
>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>> Cc: <[log in to unmask]>
>>>>>>>> Sent: Wednesday, December 16, 2009 9:38 AM
>>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Andrew,
>>>>>>>>
>>>>>>>> It still doesn't work.
>>>>>>>> The log file is in higgs03.cs.wisc.edu/wguan/. The name is
>>>>>>>> *.20091216
>>>>>>>> The manager complains there are too many subscribers and the removes
>>>>>>>> nodes.
>>>>>>>>
>>>>>>>> (*)
>>>>>>>> Add server.10040:[log in to unmask] redirected; too many
>>>>>>>> subscribers.
>>>>>>>>
>>>>>>>> Wen
>>>>>>>>
>>>>>>>> On Wed, Dec 16, 2009 at 4:25 AM, Andrew Hanushevsky
>>>>>>>> <[log in to unmask]>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Wen,
>>>>>>>>>
>>>>>>>>> It will be easier for me to retroft as the changes were pretty
>>>>>>>>> minor.
>>>>>>>>> Please
>>>>>>>>> lift the new XrdCmsNode.cc file from
>>>>>>>>>
>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd
>>>>>>>>>
>>>>>>>>> Andy
>>>>>>>>>
>>>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>>>> <[log in to unmask]>
>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>>> Cc: <[log in to unmask]>
>>>>>>>>> Sent: Tuesday, December 15, 2009 5:12 PM
>>>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Andy,
>>>>>>>>>
>>>>>>>>> I can switch to 20091104-1102. Then you don't need to patch
>>>>>>>>> another version. How can I download v20091104-1102?
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Wen
>>>>>>>>>
>>>>>>>>> On Wed, Dec 16, 2009 at 12:52 AM, Andrew Hanushevsky
>>>>>>>>> <[log in to unmask]>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Wen,
>>>>>>>>>>
>>>>>>>>>> Ah yes, I see that now. The file I gave you is based on
>>>>>>>>>> v20091104-1102.
>>>>>>>>>> Let
>>>>>>>>>> me see if I can retrofit the patch for you.
>>>>>>>>>>
>>>>>>>>>> Andy
>>>>>>>>>>
>>>>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>>>>> <[log in to unmask]>
>>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>>>> Cc: <[log in to unmask]>
>>>>>>>>>> Sent: Tuesday, December 15, 2009 1:04 PM
>>>>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Andy,
>>>>>>>>>>
>>>>>>>>>> Which xrootd version are you using? XrdCmsConfig.hh is different.
>>>>>>>>>> XrdCmsConfig.hh is downloaded from
>>>>>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/.
>>>>>>>>>>
>>>>>>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsNode.cc
>>>>>>>>>> 6fb3ae40fe4e10bdd4d372818a341f2c src/XrdCms/XrdCmsNode.cc
>>>>>>>>>> [root@c121 xrootd]# md5sum src/XrdCms/XrdCmsConfig.hh
>>>>>>>>>> 7d57753847d9448186c718f98e963cbe src/XrdCms/XrdCmsConfig.hh
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Wen
>>>>>>>>>>
>>>>>>>>>> On Tue, Dec 15, 2009 at 10:50 PM, Andrew Hanushevsky
>>>>>>>>>> <[log in to unmask]>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>
>>>>>>>>>>> Just compiled on Linux and it was clean. Something is really
>>>>>>>>>>> wrong
>>>>>>>>>>> with
>>>>>>>>>>> your
>>>>>>>>>>> source files, specifically XrdCmsConfig.cc
>>>>>>>>>>>
>>>>>>>>>>> The MD5 checksums on the relevant files are:
>>>>>>>>>>>
>>>>>>>>>>> MD5 (XrdCmsNode.cc) = 6fb3ae40fe4e10bdd4d372818a341f2c
>>>>>>>>>>>
>>>>>>>>>>> MD5 (XrdCmsConfig.hh) = 4a7d655582a7cd43b098947d0676924b
>>>>>>>>>>>
>>>>>>>>>>> Andy
>>>>>>>>>>>
>>>>>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>>>>>> <[log in to unmask]>
>>>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>>>>> Cc: <[log in to unmask]>
>>>>>>>>>>> Sent: Tuesday, December 15, 2009 4:24 AM
>>>>>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi Andy,
>>>>>>>>>>>
>>>>>>>>>>> No problem. Thanks for the fix. But it cannot be compiled. The
>>>>>>>>>>> version I am using is
>>>>>>>>>>> http://xrootd.slac.stanford.edu/download/20091028-1003/.
>>>>>>>>>>>
>>>>>>>>>>> Making cms component...
>>>>>>>>>>> Compiling XrdCmsNode.cc
>>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>>> XrdCmsNode::do_Chmod(XrdCmsRRData&)':
>>>>>>>>>>> XrdCmsNode.cc:268: error: `fsExec' was not declared in this scope
>>>>>>>>>>> XrdCmsNode.cc:268: warning: unused variable 'fsExec'
>>>>>>>>>>> XrdCmsNode.cc:269: error: 'class XrdCmsConfig' has no member
>>>>>>>>>>> named
>>>>>>>>>>> 'ossFS'
>>>>>>>>>>> XrdCmsNode.cc:273: error: `fsFail' was not declared in this scope
>>>>>>>>>>> XrdCmsNode.cc:273: warning: unused variable 'fsFail'
>>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>>> XrdCmsNode::do_Mkdir(XrdCmsRRData&)':
>>>>>>>>>>> XrdCmsNode.cc:600: error: `fsExec' was not declared in this scope
>>>>>>>>>>> XrdCmsNode.cc:600: warning: unused variable 'fsExec'
>>>>>>>>>>> XrdCmsNode.cc:601: error: 'class XrdCmsConfig' has no member
>>>>>>>>>>> named
>>>>>>>>>>> 'ossFS'
>>>>>>>>>>> XrdCmsNode.cc:605: error: `fsFail' was not declared in this scope
>>>>>>>>>>> XrdCmsNode.cc:605: warning: unused variable 'fsFail'
>>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>>> XrdCmsNode::do_Mkpath(XrdCmsRRData&)':
>>>>>>>>>>> XrdCmsNode.cc:640: error: `fsExec' was not declared in this scope
>>>>>>>>>>> XrdCmsNode.cc:640: warning: unused variable 'fsExec'
>>>>>>>>>>> XrdCmsNode.cc:641: error: 'class XrdCmsConfig' has no member
>>>>>>>>>>> named
>>>>>>>>>>> 'ossFS'
>>>>>>>>>>> XrdCmsNode.cc:645: error: `fsFail' was not declared in this scope
>>>>>>>>>>> XrdCmsNode.cc:645: warning: unused variable 'fsFail'
>>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>>> XrdCmsNode::do_Mv(XrdCmsRRData&)':
>>>>>>>>>>> XrdCmsNode.cc:704: error: `fsExec' was not declared in this scope
>>>>>>>>>>> XrdCmsNode.cc:704: warning: unused variable 'fsExec'
>>>>>>>>>>> XrdCmsNode.cc:705: error: 'class XrdCmsConfig' has no member
>>>>>>>>>>> named
>>>>>>>>>>> 'ossFS'
>>>>>>>>>>> XrdCmsNode.cc:709: error: `fsFail' was not declared in this scope
>>>>>>>>>>> XrdCmsNode.cc:709: warning: unused variable 'fsFail'
>>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>>> XrdCmsNode::do_Rm(XrdCmsRRData&)':
>>>>>>>>>>> XrdCmsNode.cc:831: error: `fsExec' was not declared in this scope
>>>>>>>>>>> XrdCmsNode.cc:831: warning: unused variable 'fsExec'
>>>>>>>>>>> XrdCmsNode.cc:832: error: 'class XrdCmsConfig' has no member
>>>>>>>>>>> named
>>>>>>>>>>> 'ossFS'
>>>>>>>>>>> XrdCmsNode.cc:836: error: `fsFail' was not declared in this scope
>>>>>>>>>>> XrdCmsNode.cc:836: warning: unused variable 'fsFail'
>>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>>> XrdCmsNode::do_Rmdir(XrdCmsRRData&)':
>>>>>>>>>>> XrdCmsNode.cc:873: error: `fsExec' was not declared in this scope
>>>>>>>>>>> XrdCmsNode.cc:873: warning: unused variable 'fsExec'
>>>>>>>>>>> XrdCmsNode.cc:874: error: 'class XrdCmsConfig' has no member
>>>>>>>>>>> named
>>>>>>>>>>> 'ossFS'
>>>>>>>>>>> XrdCmsNode.cc:878: error: `fsFail' was not declared in this scope
>>>>>>>>>>> XrdCmsNode.cc:878: warning: unused variable 'fsFail'
>>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>>> XrdCmsNode::do_Trunc(XrdCmsRRData&)':
>>>>>>>>>>> XrdCmsNode.cc:1377: error: `fsExec' was not declared in this
>>>>>>>>>>> scope
>>>>>>>>>>> XrdCmsNode.cc:1377: warning: unused variable 'fsExec'
>>>>>>>>>>> XrdCmsNode.cc:1378: error: 'class XrdCmsConfig' has no member
>>>>>>>>>>> named
>>>>>>>>>>> 'ossFS'
>>>>>>>>>>> XrdCmsNode.cc:1382: error: `fsFail' was not declared in this
>>>>>>>>>>> scope
>>>>>>>>>>> XrdCmsNode.cc:1382: warning: unused variable 'fsFail'
>>>>>>>>>>> XrdCmsNode.cc: At global scope:
>>>>>>>>>>> XrdCmsNode.cc:1524: error: no `int
>>>>>>>>>>> XrdCmsNode::fsExec(XrdOucProg*,
>>>>>>>>>>> char*, char*)' member function declared in class `XrdCmsNode'
>>>>>>>>>>> XrdCmsNode.cc: In member function `int
>>>>>>>>>>> XrdCmsNode::fsExec(XrdOucProg*,
>>>>>>>>>>> char*, char*)':
>>>>>>>>>>> XrdCmsNode.cc:1533: error: `fsL2PFail1' was not declared in this
>>>>>>>>>>> scope
>>>>>>>>>>> XrdCmsNode.cc:1533: warning: unused variable 'fsL2PFail1'
>>>>>>>>>>> XrdCmsNode.cc:1537: error: `fsL2PFail2' was not declared in this
>>>>>>>>>>> scope
>>>>>>>>>>> XrdCmsNode.cc:1537: warning: unused variable 'fsL2PFail2'
>>>>>>>>>>> XrdCmsNode.cc: At global scope:
>>>>>>>>>>> XrdCmsNode.cc:1553: error: no `const char*
>>>>>>>>>>> XrdCmsNode::fsFail(const
>>>>>>>>>>> char*, const char*, const char*, int)' member function declared
>>>>>>>>>>> in
>>>>>>>>>>> class `XrdCmsNode'
>>>>>>>>>>> XrdCmsNode.cc: In member function `const char*
>>>>>>>>>>> XrdCmsNode::fsFail(const char*, const char*, const char*, int)':
>>>>>>>>>>> XrdCmsNode.cc:1559: error: `fsL2PFail1' was not declared in this
>>>>>>>>>>> scope
>>>>>>>>>>> XrdCmsNode.cc:1559: warning: unused variable 'fsL2PFail1'
>>>>>>>>>>> XrdCmsNode.cc:1560: error: `fsL2PFail2' was not declared in this
>>>>>>>>>>> scope
>>>>>>>>>>> XrdCmsNode.cc:1560: warning: unused variable 'fsL2PFail2'
>>>>>>>>>>> XrdCmsNode.cc: In static member function `static int
>>>>>>>>>>> XrdCmsNode::isOnline(char*, int)':
>>>>>>>>>>> XrdCmsNode.cc:1608: error: 'class XrdCmsConfig' has no member
>>>>>>>>>>> named
>>>>>>>>>>> 'ossFS'
>>>>>>>>>>> make[4]: *** [../../obj/XrdCmsNode.o] Error 1
>>>>>>>>>>> make[3]: *** [Linuxall] Error 2
>>>>>>>>>>> make[2]: *** [all] Error 2
>>>>>>>>>>> make[1]: *** [XrdCms] Error 2
>>>>>>>>>>> make: *** [all] Error 2
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Wen
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Dec 15, 2009 at 2:08 AM, Andrew Hanushevsky
>>>>>>>>>>> <[log in to unmask]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>
>>>>>>>>>>>> I have developed a permanent fix. You will find the source files
>>>>>>>>>>>> in
>>>>>>>>>>>>
>>>>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/
>>>>>>>>>>>>
>>>>>>>>>>>> There are three files: XrdCmsCluster.cc XrdCmsNode.cc
>>>>>>>>>>>> XrdCmsProtocol.cc
>>>>>>>>>>>>
>>>>>>>>>>>> Please do a source replacement and recompile. Unfortunately, the
>>>>>>>>>>>> cmsd
>>>>>>>>>>>> will
>>>>>>>>>>>> need to be replaced on each node regardless of role. My
>>>>>>>>>>>> apologies
>>>>>>>>>>>> for
>>>>>>>>>>>> the
>>>>>>>>>>>> disruption. Please let me know how it goes.
>>>>>>>>>>>>
>>>>>>>>>>>> Andy
>>>>>>>>>>>>
>>>>>>>>>>>> ----- Original Message ----- From: "wen guan"
>>>>>>>>>>>> <[log in to unmask]>
>>>>>>>>>>>> To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>>>>>> Cc: <[log in to unmask]>
>>>>>>>>>>>> Sent: Sunday, December 13, 2009 7:04 AM
>>>>>>>>>>>> Subject: Re: xrootd with more than 65 machines
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>> I used the new cmsd at atlas-bkp1 manager. But it's still
>>>>>>>>>>>> dropping
>>>>>>>>>>>> nodes. And in supervisor's log, I cannot find any dataserver to
>>>>>>>>>>>> register to it.
>>>>>>>>>>>>
>>>>>>>>>>>> The new logs are in http://higgs03.cs.wisc.edu/wguan/*.20091213.
>>>>>>>>>>>> The manager is patched at 091213 08:38:15.
>>>>>>>>>>>>
>>>>>>>>>>>> Wen
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Dec 13, 2009 at 1:52 AM, Andrew Hanushevsky
>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Wen
>>>>>>>>>>>>>
>>>>>>>>>>>>> You will find the source replacement at:
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://www.slac.stanford.edu/~abh/cmsd/
>>>>>>>>>>>>>
>>>>>>>>>>>>> It's XrdCmsCluster.cc and it replaces
>>>>>>>>>>>>> xrootd/src/XrdCms/XrdCmsCluster.cc
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm stepping out for a couple of hours but will be back to see
>>>>>>>>>>>>> how
>>>>>>>>>>>>> things
>>>>>>>>>>>>> went. Sorry for the issues :-(
>>>>>>>>>>>>>
>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I prefer a source replacement. Then I can compile it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I can do one of two things here:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1) Supply a source replacement and then you would recompile,
>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2) Give me the uname -a of where the cmsd will run and I'll
>>>>>>>>>>>>>>> supply
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>> binary
>>>>>>>>>>>>>>> replacement for you.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Your choice.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sun, 13 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Andrew
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The problem is found. Great. Thanks.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Where can I find the patched cmsd?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 11:36 PM, Andrew Hanushevsky
>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I found the problem. Looks like a regression from way back
>>>>>>>>>>>>>>>>> when.
>>>>>>>>>>>>>>>>> There
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> missing flag on the redirect. This will require a patched
>>>>>>>>>>>>>>>>> cmsd
>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>> only to replace the redirector's cmsd as this only affects
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> redirector.
>>>>>>>>>>>>>>>>> How would you like to proceed?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> It doesn't work. atlas-bkp1 manager still dropping nodes
>>>>>>>>>>>>>>>>>> again.
>>>>>>>>>>>>>>>>>> In supervisor, I still haven't seen any dataserver
>>>>>>>>>>>>>>>>>> registered.
>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>> said
>>>>>>>>>>>>>>>>>> "I updated the ntp" because you said "the log timestamp do
>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>> overlap".
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 9:33 PM, Andrew Hanushevsky
>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Do you mean that everything is now working? It could be
>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>> removed
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> xrd.timeout directive. That really could cause problems.
>>>>>>>>>>>>>>>>>>> As
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> delays,
>>>>>>>>>>>>>>>>>>> that is normal when the redirector thinks something is
>>>>>>>>>>>>>>>>>>> going
>>>>>>>>>>>>>>>>>>> wrong.
>>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>> strategy is to delay clients until it can get back to a
>>>>>>>>>>>>>>>>>>> stable
>>>>>>>>>>>>>>>>>>> configuration. This usually prevents jobs from crashing
>>>>>>>>>>>>>>>>>>> during
>>>>>>>>>>>>>>>>>>> stressful
>>>>>>>>>>>>>>>>>>> periods.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I restarted it to do supervisor test. Also because
>>>>>>>>>>>>>>>>>>>> xrootd
>>>>>>>>>>>>>>>>>>>> manager
>>>>>>>>>>>>>>>>>>>> frequently doesn't response. (*) is the cms.log, the
>>>>>>>>>>>>>>>>>>>> file
>>>>>>>>>>>>>>>>>>>> select
>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> delayed again and again. When do a restart, all things
>>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>> fine.
>>>>>>>>>>>>>>>>>>>> Now
>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>> am trying to find a clue about it.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (*)
>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask]
>>>>>>>>>>>>>>>>>>>> do_Select:
>>>>>>>>>>>>>>>>>>>> wc
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 Select seeking
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 UnkFile rc=1
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> path=/atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 slot3.14949:[log in to unmask]
>>>>>>>>>>>>>>>>>>>> do_Select:
>>>>>>>>>>>>>>>>>>>> delay 5
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/fang/MC8.108004.PythiaPhotonJet4.7TeV.e444_s479_r635_dmp81_tid001090/LOG/dig.001090._000066.log.2
>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 2+-1
>>>>>>>>>>>>>>>>>>>> post=0
>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 Dispatch
>>>>>>>>>>>>>>>>>>>> redirector.21313:14@atlas-bkp2
>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> select dlen=166
>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdLink: Setting link ref to 1+1
>>>>>>>>>>>>>>>>>>>> post=0
>>>>>>>>>>>>>>>>>>>> 091212 00:00:19 21318 XrdSched: running redirector inq=0
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> There is no core file. I copied a new copies of the logs
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> link
>>>>>>>>>>>>>>>>>>>> below.
>>>>>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Sat, Dec 12, 2009 at 3:16 AM, Andrew Hanushevsky
>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I see in the server log that it is restarting often.
>>>>>>>>>>>>>>>>>>>>> Could
>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>> take
>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>> look
>>>>>>>>>>>>>>>>>>>>> in the c193 to see if you have any core files? Also
>>>>>>>>>>>>>>>>>>>>> please
>>>>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>> core files are enabled as Linux defaults the size to 0.
>>>>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>>>> step
>>>>>>>>>>>>>>>>>>>>> here
>>>>>>>>>>>>>>>>>>>>> is to find out why your servers are restarting.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Sat, 12 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> the logs can be found here. From the log you can see
>>>>>>>>>>>>>>>>>>>>>> atlas-bkp1
>>>>>>>>>>>>>>>>>>>>>> manager are dropping nodes again and again which tries
>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>>>>>>> http://higgs03.cs.wisc.edu/wguan/
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 11:41 PM, Andrew Hanushevsky
>>>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi Wen, Could you start everything up and provide me
>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>> pointer
>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> manager log file, supervisor log file, and one data
>>>>>>>>>>>>>>>>>>>>>>> server
>>>>>>>>>>>>>>>>>>>>>>> logfile
>>>>>>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>> which cover the same time-frame (from start to some
>>>>>>>>>>>>>>>>>>>>>>> point
>>>>>>>>>>>>>>>>>>>>>>> where
>>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>> things are working or not). That way I can see what
>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>> happening.
>>>>>>>>>>>>>>>>>>>>>>> At
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> moment I only see two "bad" things in the config
>>>>>>>>>>>>>>>>>>>>>>> file:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> 1) Only atlas-bkp1.cs.wisc.edu is designated as a
>>>>>>>>>>>>>>>>>>>>>>> manager
>>>>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>> claim,
>>>>>>>>>>>>>>>>>>>>>>> via
>>>>>>>>>>>>>>>>>>>>>>> the all.manager directive, that there are three (bkp2
>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>> bkp3).
>>>>>>>>>>>>>>>>>>>>>>> While
>>>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>> should work, the log file will be dense with error
>>>>>>>>>>>>>>>>>>>>>>> messages.
>>>>>>>>>>>>>>>>>>>>>>> Please
>>>>>>>>>>>>>>>>>>>>>>> correct
>>>>>>>>>>>>>>>>>>>>>>> this to be consistent and make it easier to see real
>>>>>>>>>>>>>>>>>>>>>>> errors.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> This is not a problem for me. Because this config is
>>>>>>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>> dataserver. In manager, I updated the if
>>>>>>>>>>>>>>>>>>>>>> atlas-bkp1.cs.wisc.edu
>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> atlas-bkp2 or something. This is a history problem. at
>>>>>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>> atlas-bkp1 is used. atlas-bkp2 and atlas-bkp3 are
>>>>>>>>>>>>>>>>>>>>>> added
>>>>>>>>>>>>>>>>>>>>>> later.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> 2) Please use cms.space not olb.space (for historical
>>>>>>>>>>>>>>>>>>>>>>> reasons
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> latter
>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>> still accepted and over-rides the former, but that
>>>>>>>>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>>>>> soon
>>>>>>>>>>>>>>>>>>>>>>> end),
>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>> please use only one (the config file uses both
>>>>>>>>>>>>>>>>>>>>>>> directives).
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> yes. I should remove this line. in fact cms.space is
>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> cfg
>>>>>>>>>>>>>>>>>>>>>> too.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> The xrootd has an internal mechanism to connect
>>>>>>>>>>>>>>>>>>>>>>> servers
>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>> supervisors
>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> allow for maximum reliability. You cannot change that
>>>>>>>>>>>>>>>>>>>>>>> algorithm
>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>> no need to do so. You should *never* tell anyone to
>>>>>>>>>>>>>>>>>>>>>>> directly
>>>>>>>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>> supervisor. If you do, you will likely get
>>>>>>>>>>>>>>>>>>>>>>> unreachable
>>>>>>>>>>>>>>>>>>>>>>> nodes.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> As for dropping data servers, it would appear to me,
>>>>>>>>>>>>>>>>>>>>>>> given
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> flurry
>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>> such activity, that something either crashed or was
>>>>>>>>>>>>>>>>>>>>>>> restarted.
>>>>>>>>>>>>>>>>>>>>>>> That's
>>>>>>>>>>>>>>>>>>>>>>> why
>>>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>> would be good to see the complete log of each one of
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> entities.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I read the document. and write a config
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> file(http://wisconsin.cern.ch/~wguan/xrdcluster.cfg).
>>>>>>>>>>>>>>>>>>>>>>>> I used my conf, I can see manager is dispatch
>>>>>>>>>>>>>>>>>>>>>>>> message
>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> supervisor. But I cannot see any dataserver tries to
>>>>>>>>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> supervisor. At the same time, in the manager's log,
>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>>>> dataserver are Dropped.
>>>>>>>>>>>>>>>>>>>>>>>> How does xrootd decide which dataserver will connect
>>>>>>>>>>>>>>>>>>>>>>>> supervisor?
>>>>>>>>>>>>>>>>>>>>>>>> should I specify some dataservers to connect the
>>>>>>>>>>>>>>>>>>>>>>>> supervisor?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> (*) supervisor log
>>>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>> manager.0:20@atlas-bkp2
>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>> state
>>>>>>>>>>>>>>>>>>>>>>>> dlen=42
>>>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2
>>>>>>>>>>>>>>>>>>>>>>>> do_State:
>>>>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>>>>>>>>>>>>>>> 091211 15:07:00 30028 manager.0:20@atlas-bkp2
>>>>>>>>>>>>>>>>>>>>>>>> do_StateFWD:
>>>>>>>>>>>>>>>>>>>>>>>> Path
>>>>>>>>>>>>>>>>>>>>>>>> find
>>>>>>>>>>>>>>>>>>>>>>>> failed for state
>>>>>>>>>>>>>>>>>>>>>>>> /atlas/xrootd/users/wguan/test/test131141
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> (*)manager log
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>> TSpace=5587GB
>>>>>>>>>>>>>>>>>>>>>>>> NumFS=1
>>>>>>>>>>>>>>>>>>>>>>>> FSpace=5693644MB MinFR=57218MB Util=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c185.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>> adding
>>>>>>>>>>>>>>>>>>>>>>>> path:
>>>>>>>>>>>>>>>>>>>>>>>> w
>>>>>>>>>>>>>>>>>>>>>>>> /atlas
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661
>>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>> do_Space: 5696231MB free; 0% util
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 logged in.
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 001 XrdInet: Accepted connection
>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>> [log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: running
>>>>>>>>>>>>>>>>>>>>>>>> ?:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdProtocol: matched protocol
>>>>>>>>>>>>>>>>>>>>>>>> cmsd
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 ?:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>> 79
>>>>>>>>>>>>>>>>>>>>>>>> attached
>>>>>>>>>>>>>>>>>>>>>>>> to poller 2; num=22
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add
>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>> bumps
>>>>>>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 #63
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Drop_Node:
>>>>>>>>>>>>>>>>>>>>>>>> server.15905:[log in to unmask]:1094 dropped.
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Add Shoved
>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 to cluster;
>>>>>>>>>>>>>>>>>>>>>>>> id=63.78;
>>>>>>>>>>>>>>>>>>>>>>>> num=64;
>>>>>>>>>>>>>>>>>>>>>>>> min=51
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>> TSpace=5587GB
>>>>>>>>>>>>>>>>>>>>>>>> NumFS=1
>>>>>>>>>>>>>>>>>>>>>>>> FSpace=5721854MB MinFR=57218MB Util=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Admit c187.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>> adding
>>>>>>>>>>>>>>>>>>>>>>>> path:
>>>>>>>>>>>>>>>>>>>>>>>> w
>>>>>>>>>>>>>>>>>>>>>>>> /atlas
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661
>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>> do_Space: 5721854MB free; 0% util
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 logged in.
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdLink: Unable to recieve
>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>> c187.chtc.wisc.edu; connection reset by peer
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 XrdSched: scheduling drop node
>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> 60
>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]:1094 node 63.78
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:24 15661
>>>>>>>>>>>>>>>>>>>>>>>> server.21739:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>> 79 detached from poller 2; num=21
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c177.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data
>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>> c177.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>> FD=16
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>> server.24718:[log in to unmask]:1094 node 0.3
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>> server.21656:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>> 16 detached from poller 2; num=20
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data
>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>> c179.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>> FD=21
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>> server.17065:[log in to unmask]:1094 node 1.4
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>> server.7978:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>> server.7978:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>> 21
>>>>>>>>>>>>>>>>>>>>>>>> detached from poller 1; num=21
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 State: Status changed to
>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Send status to
>>>>>>>>>>>>>>>>>>>>>>>> redirector.15656:14@atlas-bkp2
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c182.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data
>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>> c182.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>> FD=19
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>> server.12937:[log in to unmask]:1094 node 7.10
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>> server.26620:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>> 19 detached from poller 2; num=19
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c178.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data
>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>> c178.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>> FD=15
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>> server.10842:[log in to unmask]:1094 node 9.12
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>> server.11901:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>> 15 detached from poller 1; num=20
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c181.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data
>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>> c181.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>> FD=17
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>> server.5535:[log in to unmask]:1094 node 5.8
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>> server.13984:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>> 17 detached from poller 0; num=21
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c183.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data
>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>> c183.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>> FD=22
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>> server.23711:[log in to unmask]:1094 node 8.11
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>> server.27735:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>> 22 detached from poller 2; num=18
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data
>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>> c184.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>> FD=20
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>> server.4131:[log in to unmask]:1094 node 3.6
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>> server.26787:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>> 20 detached from poller 0; num=20
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c185.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data
>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>> c185.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>> FD=23
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>> server.10585:[log in to unmask]:1094 node 6.9
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>> server.8524:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>> server.8524:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>> 23
>>>>>>>>>>>>>>>>>>>>>>>> detached from poller 0; num=19
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c180.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data
>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>> c180.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>> FD=18
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>> server.20264:[log in to unmask]:1094 node 4.7
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>> server.14636:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>> 18 detached from poller 1; num=19
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Dispatch
>>>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>> for status dlen=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094
>>>>>>>>>>>>>>>>>>>>>>>> do_Status:
>>>>>>>>>>>>>>>>>>>>>>>> suspend
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=-1 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Node: c186.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>> service
>>>>>>>>>>>>>>>>>>>>>>>> suspended
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 XrdLink: No RecvAll() data
>>>>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>>> c186.chtc.wisc.edu
>>>>>>>>>>>>>>>>>>>>>>>> FD=24
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Update Counts Parm1=0 Parm2=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Remove_Node
>>>>>>>>>>>>>>>>>>>>>>>> server.1656:[log in to unmask]:1094 node 2.5
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661 Protocol:
>>>>>>>>>>>>>>>>>>>>>>>> server.7849:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>> logged
>>>>>>>>>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:13:27 15661
>>>>>>>>>>>>>>>>>>>>>>>> server.7849:[log in to unmask]
>>>>>>>>>>>>>>>>>>>>>>>> XrdPoll:
>>>>>>>>>>>>>>>>>>>>>>>> FD
>>>>>>>>>>>>>>>>>>>>>>>> 24
>>>>>>>>>>>>>>>>>>>>>>>> detached from poller 1; num=18
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node
>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node
>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node
>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node
>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node
>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node
>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node
>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node
>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node
>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:14 15661 XrdSched: scheduling drop node
>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.66 cancelled.
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.68 cancelled.
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.69 cancelled.
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.67 cancelled.
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.70 cancelled.
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.71 cancelled.
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.72 cancelled.
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.73 cancelled.
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.74 cancelled.
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.75 cancelled.
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>> inq=1
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.76 cancelled.
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: Now have 68 workers
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 Drop_Node 63.77 cancelled.
>>>>>>>>>>>>>>>>>>>>>>>> 091211 04:14:24 15661 XrdSched: running drop node
>>>>>>>>>>>>>>>>>>>>>>>> inq=0
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Dec 11, 2009 at 9:50 PM, Andrew Hanushevsky
>>>>>>>>>>>>>>>>>>>>>>>> <[log in to unmask]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Hi Wen,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> To go past 64 data servers you will need to setup
>>>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>>> supervisors.
>>>>>>>>>>>>>>>>>>>>>>>>> This does not logically change the current
>>>>>>>>>>>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>>>> have.
>>>>>>>>>>>>>>>>>>>>>>>>> You
>>>>>>>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>>>> need to configure one or more *new* servers (or at
>>>>>>>>>>>>>>>>>>>>>>>>> least
>>>>>>>>>>>>>>>>>>>>>>>>> xrootd
>>>>>>>>>>>>>>>>>>>>>>>>> processes)
>>>>>>>>>>>>>>>>>>>>>>>>> whose role is supervisor. We'd like them to run in
>>>>>>>>>>>>>>>>>>>>>>>>> separate
>>>>>>>>>>>>>>>>>>>>>>>>> machines
>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>> reliability purposes, but they could run on the
>>>>>>>>>>>>>>>>>>>>>>>>> manager
>>>>>>>>>>>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>>>> long
>>>>>>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>>>> give each one a unique instance name (i.e., -n
>>>>>>>>>>>>>>>>>>>>>>>>> option).
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> The front part of the cmsd reference explains how
>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> http://xrootd.slac.stanford.edu/doc/prod/cms_config.htm
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, 11 Dec 2009, wen guan wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Is there any change to configure xrootd with more
>>>>>>>>>>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>>>>>>>> 65
>>>>>>>>>>>>>>>>>>>>>>>>>> machines? I used the configure below but it
>>>>>>>>>>>>>>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>>>>>>>>>>>>>>> work.
>>>>>>>>>>>>>>>>>>>>>>>>>> Should
>>>>>>>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>>>>> configure some machines' manager to be
>>>>>>>>>>>>>>>>>>>>>>>>>> supvervisor?
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> http://wisconsin.cern.ch/~wguan/xrdcluster.cfg
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Wen
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>
>
>
>