Hi Matevz, Please look at the 'cms.delay servers' directive. It would appear that that your threshold for avoiding the Amazon cloud melt-down effect is set too low. There is warning in the reference that the default value may, in fact, work against you if you don't have a sufficiently large cluster. Andy P.S. I hope you realize that the level of tracing is likely reducing performance by at least 40%. On Sat, 25 Jun 2011, Matevz Tadel wrote: > Hi, > > I noticed this in log of a xrootd manager (xrootd.t2.ucsd.edu): > > 110625 11:26:59 10971 Receive xrootd 23 bytes on 10550270 > 110625 11:26:59 10971 Decode xrootd redirects uscmsPoo.11911:[log in to unmask] to uaf-6.t2.ucsd.edu:1094 /store/data/Run2011A/DoubleElectron/AOD/May10ReRe > co-v1/0000/8EFCAE37-687B-E011-BAEA-001A92971B8E.root > 110625 11:26:59 10971 uscmsPoo.11911:[log in to unmask] XrootdProtocol: redirecting to uaf-6.t2.ucsd.edu:1094 > 110625 11:27:15 10971 Receive xrootd 0 bytes on 0 > 110625 11:27:15 10971 setStatus xrootd.t2.ucsd.edu sent suspend event > 110625 11:27:15 10971 cms_setStatus: Manager xrootd.t2.ucsd.edu suspended > 110625 11:29:12 10971 nagios.8294:[log in to unmask] XrootdProtocol: more auth requested; sz=2070 > 110625 11:29:12 10971 XrootdXeq: nagios.8294:[log in to unmask] login as 92a2e9e2.0 > 110625 11:29:12 10971 nagios.8294:[log in to unmask] XrootdProtocol: stalling client for 10 sec > 110625 11:29:22 10971 nagios.8294:[log in to unmask] XrootdProtocol: stalling client for 10 sec > 110625 11:29:32 10971 nagios.8294:[log in to unmask] XrootdProtocol: stalling client for 10 sec > 110625 11:29:42 10971 nagios.8294:[log in to unmask] XrootdProtocol: stalling client for 10 sec > > and then it keeps stalling the clients. What happened here? The cmsd log fragment from arounf this time is below. > > At about this time one of the servers, uaf-3 was shut-down as it had, it seems so far, a corruption of system disk. > > The restart of xrootd/cmsd fixed the problem. > > Cheers, > Matevz > > > 110625 11:27:11 19518 Dispatch manager.0:[log in to unmask] for state dlen=204 > 110625 11:27:11 19518 tore/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/DE6E36B9-A67B-E011-BC19-001A92810AEE.root do_State: /store/data/Run2011A/DoubleMu/AOD/PromptReco-v4/000/165/548/44682DE0-6387-E011-B50E-001617E30F48.root,/store/data/Run2011A/DoubleMu/AOD/PromptReco-v4/000/165/472/22826903-2C86-E011-810E-003048F1C420.root > 110625 11:27:15 19518 Update Counts Parm1=-1 Parm2=0 > 110625 11:27:15 19518 Remove_Node server.2524:22@uaf-3:1094 node 4.5 > 110625 11:27:15 19518 State: Status changed to suspended > 110625 11:27:15 19518 Send status to redirector.10971:26@xrootd > 110625 11:27:15 19518 Protocol: server.2524:22@uaf-3 logged out; request read failed > 110625 11:27:15 19518 Inform xrootd-itb.unl.edu status > 110625 11:27:15 19518 Inform xrootd.unl.edu status > 110625 11:27:17 19518 Dispatch manager.0:[log in to unmask] for state dlen=96 > 110625 11:27:17 19518 tore/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/DE6E36B9-A67B-E011-BC19-001A92810AEE.root do_State: /store/data/Run2011A/DoubleMu/AOD/May10ReReco-v1/0000/5A1FB529-D67B-E011-8920-0026189438B5.root > 110625 11:27:17 19518 Broadcast server.2524:22@uaf-3:1094 is unreachable > 110625 11:27:17 19518 Dispatch server.2925:21@uaf-4:1094 for have dlen=96 > 110625 11:27:17 19518 server.2925:21@uaf-4:1094 do_Have: /store/data/Run2011A/DoubleMu/AOD/May10ReReco-v1/0000/5A1FB529-D67B-E011-8920-0026189438B5.root > 110625 11:27:17 19518 Inform xrootd-itb.unl.edu have > 110625 11:27:17 19518 Dispatch server.13721:19@uaf-6:1094 for have dlen=96 > 110625 11:27:17 19518 Inform xrootd.unl.edu have > 110625 11:27:17 19518 server.13721:19@uaf-6:1094 do_Have: /store/data/Run2011A/DoubleMu/AOD/May10ReReco-v1/0000/5A1FB529-D67B-E011-8920-0026189438B5.root > 110625 11:27:17 19518 Dispatch manager.0:[log in to unmask] for state dlen=108 > 110625 11:27:17 19518 tore/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/DE6E36B9-A67B-E011-BC19-001A92810AEE.root do_State: /store/data/Run2011A/DoubleElectron/AOD/PromptReco-v4/000/166/346/D857F287-D58E-E011-87B0-001D09F2906A.root > 110625 11:27:17 19518 Broadcast server.2524:22@uaf-3:1094 is unreachable > 110625 11:27:17 19518 Dispatch manager.0:[log in to unmask] for state dlen=102 > 110625 11:27:17 19518 tore/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/DE6E36B9-A67B-E011-BC19-001A92810AEE.root do_State: /store/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/8A376B9B-7A7B-E011-8EA2-001A92810AE4.root > 110625 11:27:17 19518 Broadcast server.2524:22@uaf-3:1094 is unreachable > 110625 11:27:17 19518 Dispatch server.13721:19@uaf-6:1094 for have dlen=108 > 110625 11:27:17 19518 server.13721:19@uaf-6:1094 do_Have: /store/data/Run2011A/DoubleElectron/AOD/PromptReco-v4/000/166/346/D857F287-D58E-E011-87B0-001D09F2906A.root > 110625 11:27:17 19518 Inform xrootd-itb.unl.edu have > 110625 11:27:17 19518 Inform xrootd.unl.edu have > 110625 11:27:17 19518 Dispatch server.2925:21@uaf-4:1094 for have dlen=108 > 110625 11:27:17 19518 Dispatch server.2925:21@uaf-4:1094 for have dlen=102 > 110625 11:27:17 19518 server.2925:21@uaf-4:1094 do_Have: /store/data/Run2011A/DoubleElectron/AOD/PromptReco-v4/000/166/346/D857F287-D58E-E011-87B0-001D09F2906A.root > 110625 11:27:17 19518 server.2925:21@uaf-4:1094 do_Have: /store/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/8A376B9B-7A7B-E011-8EA2-001A92810AE4.root > 110625 11:27:17 19518 Inform xrootd-itb.unl.edu have > 110625 11:27:17 19518 Inform xrootd.unl.edu have > 110625 11:27:17 19518 Dispatch server.13721:19@uaf-6:1094 for have dlen=102 > 110625 11:27:17 19518 server.13721:19@uaf-6:1094 do_Have: /store/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/8A376B9B-7A7B-E011-8EA2-001A92810AE4.root >