Print

Print


Hi Andy,

On 06/25/11 17:26, Andrew Hanushevsky wrote:
> Hi Matevz,
> 
> Please look at the 'cms.delay servers' directive. It would appear that that your
> threshold for avoiding the Amazon cloud melt-down effect is set too low. There
> is warning in the reference that the default value may, in fact, work against
> you if you don't have a sufficiently large cluster.

Thanks, I'll look it up and set it to something more aggressive. How come it
didn't recuperate automatically?

It is also true that the same machines (uaf-X) are used for interactive logon
and are loaded pretty badly last couple of weeks.

> 
> Andy
> 
> P.S. I hope you realize that the level of tracing is likely reducing performance
> by at least 40%.

You can tell this from those log fragments? I already softened down the
monitoring / reporting rates that Brian was using by default -- but I don't
think it was propagated to all the sites yet (we want to get the user info
sorted out first).

Cheers,
Matevz

> On Sat, 25 Jun 2011, Matevz Tadel wrote:
> 
>> Hi,
>>
>> I noticed this in log of a xrootd manager (xrootd.t2.ucsd.edu):
>>
>> 110625 11:26:59 10971 Receive xrootd 23 bytes on 10550270
>> 110625 11:26:59 10971 Decode xrootd redirects
>> uscmsPoo.11911:[log in to unmask] to uaf-6.t2.ucsd.edu:1094
>> /store/data/Run2011A/DoubleElectron/AOD/May10ReRe
>> co-v1/0000/8EFCAE37-687B-E011-BAEA-001A92971B8E.root
>> 110625 11:26:59 10971 uscmsPoo.11911:[log in to unmask] XrootdProtocol:
>> redirecting to uaf-6.t2.ucsd.edu:1094
>> 110625 11:27:15 10971 Receive xrootd 0 bytes on 0
>> 110625 11:27:15 10971 setStatus xrootd.t2.ucsd.edu sent suspend event
>> 110625 11:27:15 10971 cms_setStatus: Manager xrootd.t2.ucsd.edu suspended
>> 110625 11:29:12 10971 nagios.8294:[log in to unmask] XrootdProtocol: more auth
>> requested; sz=2070
>> 110625 11:29:12 10971 XrootdXeq: nagios.8294:[log in to unmask] login as
>> 92a2e9e2.0
>> 110625 11:29:12 10971 nagios.8294:[log in to unmask] XrootdProtocol: stalling
>> client for 10 sec
>> 110625 11:29:22 10971 nagios.8294:[log in to unmask] XrootdProtocol: stalling
>> client for 10 sec
>> 110625 11:29:32 10971 nagios.8294:[log in to unmask] XrootdProtocol: stalling
>> client for 10 sec
>> 110625 11:29:42 10971 nagios.8294:[log in to unmask] XrootdProtocol: stalling
>> client for 10 sec
>>
>> and then it keeps stalling the clients. What happened here? The cmsd log
>> fragment from arounf this time is below.
>>
>> At about this time one of the servers, uaf-3 was shut-down as it had, it seems
>> so far, a corruption of system disk.
>>
>> The restart of xrootd/cmsd fixed the problem.
>>
>> Cheers,
>> Matevz
>>
>>
>> 110625 11:27:11 19518 Dispatch manager.0:[log in to unmask] for state dlen=204
>> 110625 11:27:11 19518
>> tore/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/DE6E36B9-A67B-E011-BC19-001A92810AEE.root
>> do_State:
>> /store/data/Run2011A/DoubleMu/AOD/PromptReco-v4/000/165/548/44682DE0-6387-E011-B50E-001617E30F48.root,/store/data/Run2011A/DoubleMu/AOD/PromptReco-v4/000/165/472/22826903-2C86-E011-810E-003048F1C420.root
>>
>> 110625 11:27:15 19518 Update Counts Parm1=-1 Parm2=0
>> 110625 11:27:15 19518 Remove_Node server.2524:22@uaf-3:1094 node 4.5
>> 110625 11:27:15 19518 State: Status changed to suspended
>> 110625 11:27:15 19518 Send status to redirector.10971:26@xrootd
>> 110625 11:27:15 19518 Protocol: server.2524:22@uaf-3 logged out; request read
>> failed
>> 110625 11:27:15 19518 Inform xrootd-itb.unl.edu status
>> 110625 11:27:15 19518 Inform xrootd.unl.edu status
>> 110625 11:27:17 19518 Dispatch manager.0:[log in to unmask] for state dlen=96
>> 110625 11:27:17 19518
>> tore/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/DE6E36B9-A67B-E011-BC19-001A92810AEE.root
>> do_State:
>> /store/data/Run2011A/DoubleMu/AOD/May10ReReco-v1/0000/5A1FB529-D67B-E011-8920-0026189438B5.root
>>
>> 110625 11:27:17 19518 Broadcast server.2524:22@uaf-3:1094 is unreachable
>> 110625 11:27:17 19518 Dispatch server.2925:21@uaf-4:1094 for have dlen=96
>> 110625 11:27:17 19518 server.2925:21@uaf-4:1094 do_Have:
>> /store/data/Run2011A/DoubleMu/AOD/May10ReReco-v1/0000/5A1FB529-D67B-E011-8920-0026189438B5.root
>>
>> 110625 11:27:17 19518 Inform xrootd-itb.unl.edu have
>> 110625 11:27:17 19518 Dispatch server.13721:19@uaf-6:1094 for have dlen=96
>> 110625 11:27:17 19518 Inform xrootd.unl.edu have
>> 110625 11:27:17 19518 server.13721:19@uaf-6:1094 do_Have:
>> /store/data/Run2011A/DoubleMu/AOD/May10ReReco-v1/0000/5A1FB529-D67B-E011-8920-0026189438B5.root
>>
>> 110625 11:27:17 19518 Dispatch manager.0:[log in to unmask] for state dlen=108
>> 110625 11:27:17 19518
>> tore/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/DE6E36B9-A67B-E011-BC19-001A92810AEE.root
>> do_State:
>> /store/data/Run2011A/DoubleElectron/AOD/PromptReco-v4/000/166/346/D857F287-D58E-E011-87B0-001D09F2906A.root
>>
>> 110625 11:27:17 19518 Broadcast server.2524:22@uaf-3:1094 is unreachable
>> 110625 11:27:17 19518 Dispatch manager.0:[log in to unmask] for state dlen=102
>> 110625 11:27:17 19518
>> tore/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/DE6E36B9-A67B-E011-BC19-001A92810AEE.root
>> do_State:
>> /store/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/8A376B9B-7A7B-E011-8EA2-001A92810AE4.root
>>
>> 110625 11:27:17 19518 Broadcast server.2524:22@uaf-3:1094 is unreachable
>> 110625 11:27:17 19518 Dispatch server.13721:19@uaf-6:1094 for have dlen=108
>> 110625 11:27:17 19518 server.13721:19@uaf-6:1094 do_Have:
>> /store/data/Run2011A/DoubleElectron/AOD/PromptReco-v4/000/166/346/D857F287-D58E-E011-87B0-001D09F2906A.root
>>
>> 110625 11:27:17 19518 Inform xrootd-itb.unl.edu have
>> 110625 11:27:17 19518 Inform xrootd.unl.edu have
>> 110625 11:27:17 19518 Dispatch server.2925:21@uaf-4:1094 for have dlen=108
>> 110625 11:27:17 19518 Dispatch server.2925:21@uaf-4:1094 for have dlen=102
>> 110625 11:27:17 19518 server.2925:21@uaf-4:1094 do_Have:
>> /store/data/Run2011A/DoubleElectron/AOD/PromptReco-v4/000/166/346/D857F287-D58E-E011-87B0-001D09F2906A.root
>>
>> 110625 11:27:17 19518 server.2925:21@uaf-4:1094 do_Have:
>> /store/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/8A376B9B-7A7B-E011-8EA2-001A92810AE4.root
>>
>> 110625 11:27:17 19518 Inform xrootd-itb.unl.edu have
>> 110625 11:27:17 19518 Inform xrootd.unl.edu have
>> 110625 11:27:17 19518 Dispatch server.13721:19@uaf-6:1094 for have dlen=102
>> 110625 11:27:17 19518 server.13721:19@uaf-6:1094 do_Have:
>> /store/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/8A376B9B-7A7B-E011-8EA2-001A92810AE4.root
>>
>>