Print

Print


Hi Matevz,

Please look at the 'cms.delay servers' directive. It would appear that 
that your threshold for avoiding the Amazon cloud melt-down effect is set 
too low. There is warning in the reference that the default value may, in 
fact, work against you if you don't have a sufficiently large cluster.

Andy

P.S. I hope you realize that the level of tracing is likely reducing 
performance by at least 40%.

On Sat, 25 Jun 2011, Matevz Tadel wrote:

> Hi,
>
> I noticed this in log of a xrootd manager (xrootd.t2.ucsd.edu):
>
> 110625 11:26:59 10971 Receive xrootd 23 bytes on 10550270
> 110625 11:26:59 10971 Decode xrootd redirects uscmsPoo.11911:[log in to unmask] to uaf-6.t2.ucsd.edu:1094 /store/data/Run2011A/DoubleElectron/AOD/May10ReRe
> co-v1/0000/8EFCAE37-687B-E011-BAEA-001A92971B8E.root
> 110625 11:26:59 10971 uscmsPoo.11911:[log in to unmask] XrootdProtocol: redirecting to uaf-6.t2.ucsd.edu:1094
> 110625 11:27:15 10971 Receive xrootd 0 bytes on 0
> 110625 11:27:15 10971 setStatus xrootd.t2.ucsd.edu sent suspend event
> 110625 11:27:15 10971 cms_setStatus: Manager xrootd.t2.ucsd.edu suspended
> 110625 11:29:12 10971 nagios.8294:[log in to unmask] XrootdProtocol: more auth requested; sz=2070
> 110625 11:29:12 10971 XrootdXeq: nagios.8294:[log in to unmask] login as 92a2e9e2.0
> 110625 11:29:12 10971 nagios.8294:[log in to unmask] XrootdProtocol: stalling client for 10 sec
> 110625 11:29:22 10971 nagios.8294:[log in to unmask] XrootdProtocol: stalling client for 10 sec
> 110625 11:29:32 10971 nagios.8294:[log in to unmask] XrootdProtocol: stalling client for 10 sec
> 110625 11:29:42 10971 nagios.8294:[log in to unmask] XrootdProtocol: stalling client for 10 sec
>
> and then it keeps stalling the clients. What happened here? The cmsd log fragment from arounf this time is below.
>
> At about this time one of the servers, uaf-3 was shut-down as it had, it seems so far, a corruption of system disk.
>
> The restart of xrootd/cmsd fixed the problem.
>
> Cheers,
> Matevz
>
>
> 110625 11:27:11 19518 Dispatch manager.0:[log in to unmask] for state dlen=204
> 110625 11:27:11 19518 tore/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/DE6E36B9-A67B-E011-BC19-001A92810AEE.root do_State: /store/data/Run2011A/DoubleMu/AOD/PromptReco-v4/000/165/548/44682DE0-6387-E011-B50E-001617E30F48.root,/store/data/Run2011A/DoubleMu/AOD/PromptReco-v4/000/165/472/22826903-2C86-E011-810E-003048F1C420.root
> 110625 11:27:15 19518 Update Counts Parm1=-1 Parm2=0
> 110625 11:27:15 19518 Remove_Node server.2524:22@uaf-3:1094 node 4.5
> 110625 11:27:15 19518 State: Status changed to suspended
> 110625 11:27:15 19518 Send status to redirector.10971:26@xrootd
> 110625 11:27:15 19518 Protocol: server.2524:22@uaf-3 logged out; request read failed
> 110625 11:27:15 19518 Inform xrootd-itb.unl.edu status
> 110625 11:27:15 19518 Inform xrootd.unl.edu status
> 110625 11:27:17 19518 Dispatch manager.0:[log in to unmask] for state dlen=96
> 110625 11:27:17 19518 tore/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/DE6E36B9-A67B-E011-BC19-001A92810AEE.root do_State: /store/data/Run2011A/DoubleMu/AOD/May10ReReco-v1/0000/5A1FB529-D67B-E011-8920-0026189438B5.root
> 110625 11:27:17 19518 Broadcast server.2524:22@uaf-3:1094 is unreachable
> 110625 11:27:17 19518 Dispatch server.2925:21@uaf-4:1094 for have dlen=96
> 110625 11:27:17 19518 server.2925:21@uaf-4:1094 do_Have: /store/data/Run2011A/DoubleMu/AOD/May10ReReco-v1/0000/5A1FB529-D67B-E011-8920-0026189438B5.root
> 110625 11:27:17 19518 Inform xrootd-itb.unl.edu have
> 110625 11:27:17 19518 Dispatch server.13721:19@uaf-6:1094 for have dlen=96
> 110625 11:27:17 19518 Inform xrootd.unl.edu have
> 110625 11:27:17 19518 server.13721:19@uaf-6:1094 do_Have: /store/data/Run2011A/DoubleMu/AOD/May10ReReco-v1/0000/5A1FB529-D67B-E011-8920-0026189438B5.root
> 110625 11:27:17 19518 Dispatch manager.0:[log in to unmask] for state dlen=108
> 110625 11:27:17 19518 tore/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/DE6E36B9-A67B-E011-BC19-001A92810AEE.root do_State: /store/data/Run2011A/DoubleElectron/AOD/PromptReco-v4/000/166/346/D857F287-D58E-E011-87B0-001D09F2906A.root
> 110625 11:27:17 19518 Broadcast server.2524:22@uaf-3:1094 is unreachable
> 110625 11:27:17 19518 Dispatch manager.0:[log in to unmask] for state dlen=102
> 110625 11:27:17 19518 tore/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/DE6E36B9-A67B-E011-BC19-001A92810AEE.root do_State: /store/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/8A376B9B-7A7B-E011-8EA2-001A92810AE4.root
> 110625 11:27:17 19518 Broadcast server.2524:22@uaf-3:1094 is unreachable
> 110625 11:27:17 19518 Dispatch server.13721:19@uaf-6:1094 for have dlen=108
> 110625 11:27:17 19518 server.13721:19@uaf-6:1094 do_Have: /store/data/Run2011A/DoubleElectron/AOD/PromptReco-v4/000/166/346/D857F287-D58E-E011-87B0-001D09F2906A.root
> 110625 11:27:17 19518 Inform xrootd-itb.unl.edu have
> 110625 11:27:17 19518 Inform xrootd.unl.edu have
> 110625 11:27:17 19518 Dispatch server.2925:21@uaf-4:1094 for have dlen=108
> 110625 11:27:17 19518 Dispatch server.2925:21@uaf-4:1094 for have dlen=102
> 110625 11:27:17 19518 server.2925:21@uaf-4:1094 do_Have: /store/data/Run2011A/DoubleElectron/AOD/PromptReco-v4/000/166/346/D857F287-D58E-E011-87B0-001D09F2906A.root
> 110625 11:27:17 19518 server.2925:21@uaf-4:1094 do_Have: /store/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/8A376B9B-7A7B-E011-8EA2-001A92810AE4.root
> 110625 11:27:17 19518 Inform xrootd-itb.unl.edu have
> 110625 11:27:17 19518 Inform xrootd.unl.edu have
> 110625 11:27:17 19518 Dispatch server.13721:19@uaf-6:1094 for have dlen=102
> 110625 11:27:17 19518 server.13721:19@uaf-6:1094 do_Have: /store/data/Run2011A/DoubleElectron/AOD/May10ReReco-v1/0000/8A376B9B-7A7B-E011-8EA2-001A92810AE4.root
>