Hi Andy, On 06/25/11 17:26, Andrew Hanushevsky wrote: > Hi Matevz, > > Please look at the 'cms.delay servers' directive. It would appear that that your > threshold for avoiding the Amazon cloud melt-down effect is set too low. There > is warning in the reference that the default value may, in fact, work against > you if you don't have a sufficiently large cluster. Thanks, I'll look it up and set it to something more aggressive. How come it didn't recuperate automatically? It is also true that the same machines (uaf-X) are used for interactive logon and are loaded pretty badly last couple of weeks. > > Andy > > P.S. I hope you realize that the level of tracing is likely reducing performance > by at least 40%. You can tell this from those log fragments? I already softened down the monitoring / reporting rates that Brian was using by default -- but I don't think it was propagated to all the sites yet (we want to get the user info sorted out first). Cheers, Matevz > On Sat, 25 Jun 2011, Matevz Tadel wrote: > >> Hi, >> >> I noticed this in log of a xrootd manager (xrootd.t2.ucsd.edu): >> >> 110625 11:26:59 10971 Receive xrootd 23 bytes on 10550270 >> 110625 11:26:59 10971 Decode xrootd redirects >> uscmsPoo.11911:[log in to unmask] to uaf-6.t2.ucsd.edu:1094 >> /store/data/Run2011A/DoubleElectron/AOD/May10ReRe >> co-v1/0000/8EFCAE37-687B-E011-BAEA-001A92971B8E.root >> 110625 11:26:59 10971 uscmsPoo.11911:[log in to unmask] XrootdProtocol: >> redirecting to uaf-6.t2.ucsd.edu:1094 >> 110625 11:27:15 10971 Receive xrootd 0 bytes on 0 >> 110625 11:27:15 10971 setStatus xrootd.t2.ucsd.edu sent suspend event >> 110625 11:27:15 10971 cms_setStatus: Manager xrootd.t2.ucsd.edu suspended >> 110625 11:29:12 10971 nagios.8294:[log in to unmask] XrootdProtocol: more auth >> requested; sz=2070 >> 110625 11:29:12 10971 XrootdXeq: nagios.8294:[log in to unmask] login as >> 92a2e9e2.0 >> 110625 11:29:12 10971 nagios.8294:[log in to unmask] XrootdProtocol: stalling >> client for 10 sec >> 110625 11:29:22 10971 nagios.8294:[log in to unmask] XrootdProtocol: stalling >> client for 10 sec >> 110625 11:29:32 10971 nagios.8294:[log in to unmask] XrootdProtocol: stalling >> client for 10 sec >> 110625 11:29:42 10971 nagios.8294:[log in to unmask] XrootdProtocol: stalling >> client for 10 sec >> >> and then it keeps stalling the clients. What happened here? The cmsd log >> fragment from arounf this time is below. >> >> At about this time one of the servers, uaf-3 was shut-down as it had, it seems >> so far, a corruption of system disk. >> >> The restart of xrootd/cmsd fixed the problem. >> >> Cheers, >> Matevz >> >> >> 110625 11:27:11 19518 Dispatch manager.0:[log in to unmask] for state dlen=204 >> 110625 11:27:11 19518 >> tore/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/DE6E36B9-A67B-E011-BC19-001A92810AEE.root >> do_State: >> /store/data/Run2011A/DoubleMu/AOD/PromptReco-v4/000/165/548/44682DE0-6387-E011-B50E-001617E30F48.root,/store/data/Run2011A/DoubleMu/AOD/PromptReco-v4/000/165/472/22826903-2C86-E011-810E-003048F1C420.root >> >> 110625 11:27:15 19518 Update Counts Parm1=-1 Parm2=0 >> 110625 11:27:15 19518 Remove_Node server.2524:22@uaf-3:1094 node 4.5 >> 110625 11:27:15 19518 State: Status changed to suspended >> 110625 11:27:15 19518 Send status to redirector.10971:26@xrootd >> 110625 11:27:15 19518 Protocol: server.2524:22@uaf-3 logged out; request read >> failed >> 110625 11:27:15 19518 Inform xrootd-itb.unl.edu status >> 110625 11:27:15 19518 Inform xrootd.unl.edu status >> 110625 11:27:17 19518 Dispatch manager.0:[log in to unmask] for state dlen=96 >> 110625 11:27:17 19518 >> tore/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/DE6E36B9-A67B-E011-BC19-001A92810AEE.root >> do_State: >> /store/data/Run2011A/DoubleMu/AOD/May10ReReco-v1/0000/5A1FB529-D67B-E011-8920-0026189438B5.root >> >> 110625 11:27:17 19518 Broadcast server.2524:22@uaf-3:1094 is unreachable >> 110625 11:27:17 19518 Dispatch server.2925:21@uaf-4:1094 for have dlen=96 >> 110625 11:27:17 19518 server.2925:21@uaf-4:1094 do_Have: >> /store/data/Run2011A/DoubleMu/AOD/May10ReReco-v1/0000/5A1FB529-D67B-E011-8920-0026189438B5.root >> >> 110625 11:27:17 19518 Inform xrootd-itb.unl.edu have >> 110625 11:27:17 19518 Dispatch server.13721:19@uaf-6:1094 for have dlen=96 >> 110625 11:27:17 19518 Inform xrootd.unl.edu have >> 110625 11:27:17 19518 server.13721:19@uaf-6:1094 do_Have: >> /store/data/Run2011A/DoubleMu/AOD/May10ReReco-v1/0000/5A1FB529-D67B-E011-8920-0026189438B5.root >> >> 110625 11:27:17 19518 Dispatch manager.0:[log in to unmask] for state dlen=108 >> 110625 11:27:17 19518 >> tore/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/DE6E36B9-A67B-E011-BC19-001A92810AEE.root >> do_State: >> /store/data/Run2011A/DoubleElectron/AOD/PromptReco-v4/000/166/346/D857F287-D58E-E011-87B0-001D09F2906A.root >> >> 110625 11:27:17 19518 Broadcast server.2524:22@uaf-3:1094 is unreachable >> 110625 11:27:17 19518 Dispatch manager.0:[log in to unmask] for state dlen=102 >> 110625 11:27:17 19518 >> tore/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/DE6E36B9-A67B-E011-BC19-001A92810AEE.root >> do_State: >> /store/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/8A376B9B-7A7B-E011-8EA2-001A92810AE4.root >> >> 110625 11:27:17 19518 Broadcast server.2524:22@uaf-3:1094 is unreachable >> 110625 11:27:17 19518 Dispatch server.13721:19@uaf-6:1094 for have dlen=108 >> 110625 11:27:17 19518 server.13721:19@uaf-6:1094 do_Have: >> /store/data/Run2011A/DoubleElectron/AOD/PromptReco-v4/000/166/346/D857F287-D58E-E011-87B0-001D09F2906A.root >> >> 110625 11:27:17 19518 Inform xrootd-itb.unl.edu have >> 110625 11:27:17 19518 Inform xrootd.unl.edu have >> 110625 11:27:17 19518 Dispatch server.2925:21@uaf-4:1094 for have dlen=108 >> 110625 11:27:17 19518 Dispatch server.2925:21@uaf-4:1094 for have dlen=102 >> 110625 11:27:17 19518 server.2925:21@uaf-4:1094 do_Have: >> /store/data/Run2011A/DoubleElectron/AOD/PromptReco-v4/000/166/346/D857F287-D58E-E011-87B0-001D09F2906A.root >> >> 110625 11:27:17 19518 server.2925:21@uaf-4:1094 do_Have: >> /store/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/8A376B9B-7A7B-E011-8EA2-001A92810AE4.root >> >> 110625 11:27:17 19518 Inform xrootd-itb.unl.edu have >> 110625 11:27:17 19518 Inform xrootd.unl.edu have >> 110625 11:27:17 19518 Dispatch server.13721:19@uaf-6:1094 for have dlen=102 >> 110625 11:27:17 19518 server.13721:19@uaf-6:1094 do_Have: >> /store/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/8A376B9B-7A7B-E011-8EA2-001A92810AE4.root >> >>