Print

Print


Hi Stepphen,

It's clear that Nurcan has done quite a bit of work to digest some of this information. Do you know if (subset of) this information is graphed somewhere? I am curious about long-term trends of failure rate, distribution of failure causes, etc. Cheers.

					Charlie
--
Charles C. Young
M.S. 43, Stanford Linear Accelerator Center       
P.O. Box 20450                                         
Stanford, CA 94309                                      
[log in to unmask]                                
voice  (650) 926 2669                         
fax    (650) 926 2923                       
CERN GSM +41 76 487 2069 

> -----Original Message-----
> From: [log in to unmask] 
> [mailto:[log in to unmask]] On 
> Behalf Of Stephen J. Gowdy
> Sent: Thursday, August 17, 2006 12:19 AM
> To: atlas-sccs-planning-l
> Subject: [Usatlas-prodsys-l] Panda shift report August 14-15, 
> 2006 (fwd)
> 
> FYI (to do with discussion of success rate for jobs).
> 
> --
>   /------------------------------------+-------------------------\
> |Stephen J. Gowdy, SLAC               | CERN     Office: 32-2-A22|
> |http://www.slac.stanford.edu/~gowdy/ | CH-1211 Geneva 23        |
> |http://calendar.yahoo.com/gowdy      | Switzerland              |
> |EMail: [log in to unmask]       | Tel: +41 22 767 5840     |
>   \------------------------------------+-------------------------/
> 
> ---------- Forwarded message ----------
> Date: Wed, 16 Aug 2006 16:00:11 -0500 (CDT)
> From: Nurcan Ozturk <[log in to unmask]>
> To: [log in to unmask]
> Subject: [Usatlas-prodsys-l] Panda shift report August 14-15, 2006
> 
> Hi all,
> 
> Here is the Panda production status of the last 2 days:
> 
> Wed Aug 16 10:58:15 2006 Central
> --------------------------------------------------------------
> ----------------------------------------------
> All CEs and jobs.   Show production, analysis, test, all jobs/CEs
> --------------------------------------------------------------
> ----------------------------------------------
> Job wall time: 30948 hrs  Error losses: trans: 2219 (7.2%)   
> panda: 832 (2.7%)   ddm: 68 (0.2%)   other: 506 (1.6%)
> --------------------------------------------------------------
> -----------------------------------------------
> Error type (type count)	Count	CPU-hrs	Latest	Code:	
> Description
> --------------------------------------------------------------
> -----------------------------------------------
> All	defined:1   assigned:161   waiting:474   activated:1010 
>   running:752   finished:2339   failed:724   (23.6%)
> 
> ddmErrorCode (9)   	8	0.0	08-15 01:24	100:	
> Input file GUID not found or input prodDBlock not accessible
> ddmErrorCode (9)   	1	11.9	08-16 07:14	200:	
> Could not add output files to dataset
> jobDispatcherErrorCode (59)   	59	698.9	08-16 
> 01:01	100:	Lost heartbeat
> pilotErrorCode (251)   	174	0.9	08-16 04:28	
> 1099:	DQ2 staging input file failed
> pilotErrorCode (251)   	1	0.1	08-14 17:06	
> 1132:	Saving output files to DDM area returned non-zero code
> pilotErrorCode (251)   	6	55.2	08-16 04:26	
> 1142:	DQ2 put error: failed to register the file on local SE
> pilotErrorCode (251)   	68	505.9	08-16 11:21	
> 1150:	Looping job killed by pilot
> pilotErrorCode (251)   	2	133.3	08-15 13:58	
> 1200:	Job killed by SIGTERM from batch system or Condor (eg 
> walltime limit)
> taskBufferErrorCode (149)   	149	0.0	08-15 19:22	
> 100:	Job expired and killed six days after submission (or 
> killed by user)
> transExitCode (234)   	72	13.0	08-16 04:30	
> 1:	Unspecified error, consult log file
> transExitCode (234)   	11	0.2	08-16 03:51	
> 134:	Athena core dump or timeout, or conddb DB connect exception
> transExitCode (234)   	1	76.8	08-14 12:00	
> 143:	Unknown error code
> transExitCode (234)   	8	6.4	08-16 10:30	
> 2:	Athena core dump
> transExitCode (234)   	35	26.5	08-16 06:23	
> 40:	Athena crash - consult log file
> transExitCode (234)   	14	89.8	08-14 12:08	
> 41:	TRF_OUTFILE - output file not found
> transExitCode (234)   	2	75.7	08-14 19:18	
> 50:	Athena crash - consult log file
> transExitCode (234)   	17	818.9	08-16 11:31	
> 60:	TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
> transExitCode (234)   	74	1111.3	08-15 18:12	
> 99:	TRF_UNKNOWN - unknown transformation error
> --------------------------------------------------------------
> --------------------------------------------
> ANALY_BNL_ATLAS_1	defined:0   assigned:0   waiting:0   
> activated:0   running:0   finished:98   failed:18   (15.5%)
> 
> jobDispatcherErrorCode (1)   	1	1.1	08-14 15:22	
> 100:	Lost heartbeat
> pilotErrorCode (1)   	1	0.1	08-14 17:06	1132:	
> Saving output files to DDM area returned non-zero code
> transExitCode (16)   	16	2.5	08-16 06:23	40:	
> Athena crash - consult log file
> --------------------------------------------------------------
> --------------------------------------------
> ANALY_BNL_ATLAS_2	defined:1   assigned:0   waiting:0   
> activated:1   running:0   finished:0   failed:0
> --------------------------------------------------------------
> --------------------------------------------
> ANALY_LONG_BNL_ATLAS	defined:0   assigned:0   waiting:0   
> activated:0   running:0   finished:0   failed:19
> 
> transExitCode (19)   	19	24.0	08-16 04:36	40:	
> Athena crash - consult log file
> --------------------------------------------------------------
> -------------------------------------------
> ANALY_UTA-DPCC
> --------------------------------------------------------------
> -------------------------------------------
> BNL_ATLAS_1	defined:0   assigned:0   waiting:0   
> activated:343   running:283   finished:843   failed:162   (16.1%)
> 
> jobDispatcherErrorCode (48)   	48	53.9	08-16 
> 01:01	100:	Lost heartbeat
> pilotErrorCode (68)   	68	505.9	08-16 11:21	
> 1150:	Looping job killed by pilot
> taskBufferErrorCode (19)   	19	0.0	08-15 17:44	
> 100:	Job expired and killed six days after submission (or 
> killed by user)
> transExitCode (27)   	27	389.8	08-15 18:12	99:	
> TRF_UNKNOWN - unknown transformation error
> --------------------------------------------------------------
> -------------------------------------------
> BNL_ATLAS_2
> --------------------------------------------------------------
> ------------------------------------------
> BU_ATLAS_Tier2	defined:0   assigned:0   waiting:0   
> activated:92   running:93   finished:124   failed:41   (24.8%)
> 
> pilotErrorCode (1)   	1	66.7	08-15 02:44	1200:	
> Job killed by SIGTERM from batch system or Condor (eg walltime limit)
> transExitCode (40)   	40	8.1	08-15 20:43	1:	
> Unspecified error, consult log file
> --------------------------------------------------------------
> ------------------------------------------
> BU_ATLAS_Tier2o	defined:0   assigned:7   waiting:0   
> activated:10   running:12   finished:20   failed:2   (9.1%)
> transExitCode (2)   	2	0.3	08-14 11:59	1:	
> Unspecified error, consult log file
> --------------------------------------------------------------
> ------------------------------------------
> IU_ATLAS_Tier2	defined:0   assigned:10   waiting:0   
> activated:67   running:64   finished:157   failed:28   (15.1%)
> transExitCode (28)   	7	0.7	08-16 06:21	2:	
> Athena core dump
> transExitCode (28)   	6	289.2	08-16 11:05	60:	
> TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
> transExitCode (28)   	15	156.6	08-14 22:33	99:	
> TRF_UNKNOWN - unknown transformation error
> --------------------------------------------------------------
> -------------------------------------------
> Unassigned	defined:0   assigned:0   waiting:474   
> activated:0   running:0   finished:0   failed:129
> 
> taskBufferErrorCode (129)   	129	0.0	08-15 19:22	
> 100:	Job expired and killed six days after submission (or 
> killed by user)
> --------------------------------------------------------------
> -------------------------------------------
> OU_OCHEP_SWT2	defined:0   assigned:6   waiting:0   
> activated:116   running:81   finished:279   failed:40   (12.5%)
> 
> jobDispatcherErrorCode (5)   	5	359.0	08-16 00:40	
> 100:	Lost heartbeat
> taskBufferErrorCode (1)   	1	0.0	08-15 17:41	
> 100:	Job expired and killed six days after submission (or 
> killed by user)
> transExitCode (34)   	13	1.7	08-16 04:30	1:	
> Unspecified error, consult log file
> transExitCode (34)   	1	76.8	08-14 12:00	143:	
> Unknown error code
> transExitCode (34)   	1	89.5	08-14 11:54	41:	
> TRF_OUTFILE - output file not found
> transExitCode (34)   	1	64.1	08-14 12:01	50:	
> Athena crash - consult log file
> transExitCode (34)   	2	96.3	08-16 10:29	60:	
> TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
> transExitCode (34)   	16	234.9	08-15 13:07	99:	
> TRF_UNKNOWN - unknown transformation error
> --------------------------------------------------------------
> --------------------------------------------
> PROD_SLAC	defined:0   assigned:41   waiting:0   
> activated:0   running:0   finished:6   failed:0   (0.0%)
> --------------------------------------------------------------
> --------------------------------------------
> UC_ATLAS_MWT2	defined:0   assigned:22   waiting:0   
> activated:101   running:56   finished:149   failed:4   (2.6%)
> 
> ddmErrorCode (1)   	1	11.9	08-16 07:14	200:	
> Could not add output files to dataset
> transExitCode (3)   	1	11.6	08-14 19:18	50:	
> Athena crash - consult log file
> transExitCode (3)   	2	96.3	08-15 20:38	60:	
> TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
> --------------------------------------------------------------
> --------------------------------------------
> UC_Teraport	defined:0   assigned:12   waiting:0   
> activated:183   running:64   finished:321   failed:202   (38.6%)
> 
> jobDispatcherErrorCode (1)   	1	66.7	08-15 08:08	
> 100:	Lost heartbeat
> pilotErrorCode (180)   	174	0.9	08-16 04:28	
> 1099:	DQ2 staging input file failed
> pilotErrorCode (180)   	6	55.2	08-16 04:26	
> 1142:	DQ2 put error: failed to register the file on local SE
> transExitCode (21)   	7	1.4	08-15 03:15	1:	
> Unspecified error, consult log file
> transExitCode (21)   	1	5.6	08-16 10:30	2:	
> Athena core dump
> transExitCode (21)   	13	0.3	08-14 12:08	41:	
> TRF_OUTFILE - output file not found
> --------------------------------------------------------------
> ---------------------------------------------
> UTA-DPCC	defined:0   assigned:61   waiting:0   
> activated:97   running:99   finished:342   failed:57   (14.3%)
> 
> ddmErrorCode (8)   	8	0.0	08-15 01:24	100:	
> Input file GUID not found or input prodDBlock not accessible
> jobDispatcherErrorCode (4)   	4	218.2	08-16 01:01	
> 100:	Lost heartbeat
> pilotErrorCode (1)   	1	66.7	08-15 13:58	1200:	
> Job killed by SIGTERM from batch system or Condor (eg walltime limit)
> transExitCode (44)   	10	1.4	08-15 06:22	1:	
> Unspecified error, consult log file
> transExitCode (44)   	11	0.2	08-16 03:51	134:	
> Athena core dump or timeout, or conddb DB connect exception
> transExitCode (44)   	7	337.2	08-16 11:31	60:	
> TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
> transExitCode (44)   	16	330.0	08-15 11:58	99:	
> TRF_UNKNOWN - unknown transformation error
> --------------------------------------------------------------
> --------------------------------------------
> UTA_SWT2	defined:0   assigned:2   waiting:0   
> activated:0   running:0   finished:0   failed:0
> --------------------------------------------------------------
> --------------------------------------------
> 
> 
> The pilot job status from the submit host:
> ------------------------------------------------------------------
> [sm@atlas002 jobscheduler]$ ./queue-summary.py 
> ==================== Condor Queue Summary 
> ==================== condor_q run at Wed Aug 16 10:58:15 2006 
> Maximum jobs on a remote host (all but UNKNOWN & UNSUBMITTED): 200
> Maximum jobs being sent to remote host:                        5
> 
> atlas.bu.edu
>          PENDING           59
>          ACTIVE           105
> 
> atlas.dpcc.uta.edu
>          PENDING          100
>          ACTIVE           100
>          UNSUBMITTED        9
> 
> atlas.iu.edu
>          PENDING           52
>          ACTIVE            60
> 
> gk01.swt2.uta.edu
>          PENDING            5
>          ACTIVE             1
>          STAGE_OUT          2
> 
> osgserv01.slac.stanford.edu
>          PENDING           52
>          STAGE_OUT          1
> 
> tier2-01.ochep.ou.edu
>          PENDING           50
>          ACTIVE            80
> 
> tier2-osg.uchicago.edu
>          PENDING           52
>          ACTIVE            41
> 
> tp-osg.uchicago.edu
>          PENDING           54
>          ACTIVE            63
> 
> ------------------------------------------------------------------
> 
> Some notes:
> 
> 1. 51 jobs of the type 
> csc11.005538.AlpgenJimmyToplnlnNp3.evgen.v11004211
> failed with "transExitCode=1:Unspecified error, consult log 
> file" in their second and third attempt, due to:
> 
> --------  Problem report -------
> [Unknown Problem]
> !!! AthenaEventLoo  ERROR Terminating event processing loop 
> due to errors!!!
> ================================
> 
> 43 jobs of the same type failed with "lost heartbeat" due to 
> the same reason. I opened a Savannah bug #19047.
> 
> 2. 6 jobs of the type csc11.005200.T1_McAtNlo_Jimmy.digit.v11004205
> failed at UTA-DPCC with "transExitCode=1:Unspecified error, 
> consult log file" due to:
> --------------------------------------------------------------
> -----------------
>   		 G4AtlasAlg: Event Nr. 1 start processing
> EA2F6442-1926-DB11-9647-00123F20A423    Error Cannot open container,
> invalid Database handle.
>    StorageSvc    Error The requested
> container:POOLContainer_McEventCollection cannot be opened!
> 
>   *** Break *** segmentation violation
>   Generating stack trace...
> /usr/bin/addr2line: python: No such file or directory
> /usr/bin/addr2line: python: No such file or directory
>   0x03d4373e in 
> FadsGeneratorT<AthenaHepMCInterface>::GenerateAnEvent() + 
> 0x1e from 
> /data73/grid3-1.1.11/apps/atlas_app/atlas_rel/11.0.42/dist/11.
> 0.42/InstallArea/
> i686-slc3-gcc323-opt/
> --------------------------------------------------------------
> --------------------
> 
> I opened a Savannah bug #19104.
> 
> 3. 11 jobs of the type 
> csc11.005250.McAtNloWminenu.evgen.v11004209 failed at 
> UTA-DPCC with "transExitCode=134: Athena core dump or 
> timeout, or conddb DB connect exception" in the third attempt due to:
> --------------------------------------------------------------
> --------------------
> found 258 particles
> AtRndmGenSvc         INFO Initializing AtRndmGenSvc - package version
> AthenaServices-01-07-27
>   INITIALISING RANDOM NUMBER STREAMS.
> 
> 
>            HERWIG 6.507  8th March 2005
> 
>            Please reference:  G. Marchesini, B.R. Webber,
>            G.Abbiendi, I.G.Knowles, M.H.Seymour & L.Stanco
>            Computer Physics Communications 67 (1992) 465
>                               and
>            G.Corcella, I.G.Knowles, G.Marchesini, S.Moretti,
>            K.Odagiri, P.Richardson, M.H.Seymour & B.R.Webber,
>            JHEP 0101 (2001) 010
> fmt: end of file
> apparent state: unit 61 named mcatnlo31.005250.Wminenu._000020.events
> last format: (5(1X,D10.4),1X,A)
> lately reading sequential formatted external IO 
> /data73/grid3-1.1.11/apps//atlas_app/atlas_rel/kitval/KitValid
> ation/JobTransforms/
> JobTransforms-11-00-42-09/share/csc.evgen.mcatnlo.trf:
> line 224: 20334 Aborted                 athena.py job.py 2>&1
> --------------------------------------------------------------
> --------------------
> 
> I opened a Savannah bug #19105.
> 
> 4. 7 jobs of the type csc11.005200.T1_McAtNlo_Jimmy.recotrig.v11000505
> failed at IU with "transExitCode=2: Athena core dump" due to:
> --------------------------------------------------------------
> -----------
> python:
> /N/Grid3/apps/atlas_app/atlas_rel/11.0.5/gcc-alt-3.2.3/lib/lib
> gcc_s.so.1:
> version `GCC_4.2.0' not found (required by 
> /usr/lib/libstdc++.so.5) Wed Aug 16 05:20:38 EST 2006
> mv: cannot stat `ntuple.root': No such file or directory
> --------------------------------------------------------------
> -----------
> 
> I created a RT ticket #430 at MWTier2.
> 
> 5. 21 jobs of the type failed with "transExitCode=40: Athena 
> crash - consult log file", they are all user jobs.
> 
> 6. 1 job, csc11.005023.FJ4_pythia_jetjet.digit.v11004206._00653.job,
> failed with "transExitCode=50: Athena crash - consult log 
> file" due to:
> --------------------------------------------------------------
> ---------
> ===> G4QGSMSplitableHadron - Fatal: Cannot sample parton 
> densities under these constraints.
>   G4HadronicProcess failed in ApplyYourself call for
>   - Particle energy[GeV] = 16.702706
>   - Material = Copper
>   - Particle type = neutron
> 
> *** G4Exception : 007
>        issued by : G4HadronicProcess
> GeneralPostStepDoIt failed.
> *** Fatal Exception *** core dump ***
> --------------------------------------------------------------
> ----------
> 
> Savannah bug #16730, a fix went into Release 12.
> 
> 7. 18 jobs of the following type failed with "transExitCode=60:
> TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)":
> 
> testIdeal_06.005020.FJ1_pythia_jetjet.digit.v12000101
> testIdeal_06.005015.J6_pythia_jetjet.digit.v12000101
> testIdeal_06.005001.pythia_minbias.digit.v12000101
> 
> Bug #18466 is closed, 48 hours limit was short for these jobs.
> 
> 8. 74 jobs failed with "transExitCode=99: TRF_UNKNOWN - 
> unknown transformation error"
> 
> testIdeal_06.005015.J6_pythia_jetjet.digit.v12000101
> testIdeal_06.005145.PythiaZmumu.digit.v12000101
> testIdeal_06.005107.pythia_Wtauhad.digit.v12000101
> 
> Bug #18349 is closed, fixed in tag LArG4EC-00-00-71 will go 
> into 12.X.0 and 12.0.X.
> 
> 9. 174 jobs of the type csc11.005200.T1_McAtNlo_Jimmy.digit.v11004205
> failed at UC_Teraport with "DQ2 staging input file failed" 
> early morning:
> 
> -------------- Log from 
> /tmp/Panda_Pilot_15987_1155712210/dq2get.out ----- Getting 
> POOL FileCatalog failed: cound not find the file in LRC!
> Could not get POOL FileCatalog!
> --------------------------------------------------------------
> -------------
> 
> A RT ticket #427 was created by Tomasz.
> 
> 10. Some test jobs were sent to OU_OSCER_ATLAS site to try to 
> utilize a remote DQ2 server as opposed to a local NFS mounted 
> one as requested by Karthik. It turned out that the pilots 
> that were sent used the old version of DQ2ProdClient.py file. 
> Xin was asked about how to use the new version of the file, 
> DQ2ProdClient2.py, in this test.
> 
> Regards,
> Nurcan.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Usatlas-prodsys-l mailing list
> [log in to unmask]
> http://lists.bnl.gov/mailman/listinfo/usatlas-prodsys-l
>