Hi Stepphen,
It's clear that Nurcan has done quite a bit of work to digest some of this information. Do you know if (subset of) this information is graphed somewhere? I am curious about long-term trends of failure rate, distribution of failure causes, etc. Cheers.
Charlie
--
Charles C. Young
M.S. 43, Stanford Linear Accelerator Center
P.O. Box 20450
Stanford, CA 94309
[log in to unmask]
voice (650) 926 2669
fax (650) 926 2923
CERN GSM +41 76 487 2069
> -----Original Message-----
> From: [log in to unmask]
> [mailto:[log in to unmask]] On
> Behalf Of Stephen J. Gowdy
> Sent: Thursday, August 17, 2006 12:19 AM
> To: atlas-sccs-planning-l
> Subject: [Usatlas-prodsys-l] Panda shift report August 14-15,
> 2006 (fwd)
>
> FYI (to do with discussion of success rate for jobs).
>
> --
> /------------------------------------+-------------------------\
> |Stephen J. Gowdy, SLAC | CERN Office: 32-2-A22|
> |http://www.slac.stanford.edu/~gowdy/ | CH-1211 Geneva 23 |
> |http://calendar.yahoo.com/gowdy | Switzerland |
> |EMail: [log in to unmask] | Tel: +41 22 767 5840 |
> \------------------------------------+-------------------------/
>
> ---------- Forwarded message ----------
> Date: Wed, 16 Aug 2006 16:00:11 -0500 (CDT)
> From: Nurcan Ozturk <[log in to unmask]>
> To: [log in to unmask]
> Subject: [Usatlas-prodsys-l] Panda shift report August 14-15, 2006
>
> Hi all,
>
> Here is the Panda production status of the last 2 days:
>
> Wed Aug 16 10:58:15 2006 Central
> --------------------------------------------------------------
> ----------------------------------------------
> All CEs and jobs. Show production, analysis, test, all jobs/CEs
> --------------------------------------------------------------
> ----------------------------------------------
> Job wall time: 30948 hrs Error losses: trans: 2219 (7.2%)
> panda: 832 (2.7%) ddm: 68 (0.2%) other: 506 (1.6%)
> --------------------------------------------------------------
> -----------------------------------------------
> Error type (type count) Count CPU-hrs Latest Code:
> Description
> --------------------------------------------------------------
> -----------------------------------------------
> All defined:1 assigned:161 waiting:474 activated:1010
> running:752 finished:2339 failed:724 (23.6%)
>
> ddmErrorCode (9) 8 0.0 08-15 01:24 100:
> Input file GUID not found or input prodDBlock not accessible
> ddmErrorCode (9) 1 11.9 08-16 07:14 200:
> Could not add output files to dataset
> jobDispatcherErrorCode (59) 59 698.9 08-16
> 01:01 100: Lost heartbeat
> pilotErrorCode (251) 174 0.9 08-16 04:28
> 1099: DQ2 staging input file failed
> pilotErrorCode (251) 1 0.1 08-14 17:06
> 1132: Saving output files to DDM area returned non-zero code
> pilotErrorCode (251) 6 55.2 08-16 04:26
> 1142: DQ2 put error: failed to register the file on local SE
> pilotErrorCode (251) 68 505.9 08-16 11:21
> 1150: Looping job killed by pilot
> pilotErrorCode (251) 2 133.3 08-15 13:58
> 1200: Job killed by SIGTERM from batch system or Condor (eg
> walltime limit)
> taskBufferErrorCode (149) 149 0.0 08-15 19:22
> 100: Job expired and killed six days after submission (or
> killed by user)
> transExitCode (234) 72 13.0 08-16 04:30
> 1: Unspecified error, consult log file
> transExitCode (234) 11 0.2 08-16 03:51
> 134: Athena core dump or timeout, or conddb DB connect exception
> transExitCode (234) 1 76.8 08-14 12:00
> 143: Unknown error code
> transExitCode (234) 8 6.4 08-16 10:30
> 2: Athena core dump
> transExitCode (234) 35 26.5 08-16 06:23
> 40: Athena crash - consult log file
> transExitCode (234) 14 89.8 08-14 12:08
> 41: TRF_OUTFILE - output file not found
> transExitCode (234) 2 75.7 08-14 19:18
> 50: Athena crash - consult log file
> transExitCode (234) 17 818.9 08-16 11:31
> 60: TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
> transExitCode (234) 74 1111.3 08-15 18:12
> 99: TRF_UNKNOWN - unknown transformation error
> --------------------------------------------------------------
> --------------------------------------------
> ANALY_BNL_ATLAS_1 defined:0 assigned:0 waiting:0
> activated:0 running:0 finished:98 failed:18 (15.5%)
>
> jobDispatcherErrorCode (1) 1 1.1 08-14 15:22
> 100: Lost heartbeat
> pilotErrorCode (1) 1 0.1 08-14 17:06 1132:
> Saving output files to DDM area returned non-zero code
> transExitCode (16) 16 2.5 08-16 06:23 40:
> Athena crash - consult log file
> --------------------------------------------------------------
> --------------------------------------------
> ANALY_BNL_ATLAS_2 defined:1 assigned:0 waiting:0
> activated:1 running:0 finished:0 failed:0
> --------------------------------------------------------------
> --------------------------------------------
> ANALY_LONG_BNL_ATLAS defined:0 assigned:0 waiting:0
> activated:0 running:0 finished:0 failed:19
>
> transExitCode (19) 19 24.0 08-16 04:36 40:
> Athena crash - consult log file
> --------------------------------------------------------------
> -------------------------------------------
> ANALY_UTA-DPCC
> --------------------------------------------------------------
> -------------------------------------------
> BNL_ATLAS_1 defined:0 assigned:0 waiting:0
> activated:343 running:283 finished:843 failed:162 (16.1%)
>
> jobDispatcherErrorCode (48) 48 53.9 08-16
> 01:01 100: Lost heartbeat
> pilotErrorCode (68) 68 505.9 08-16 11:21
> 1150: Looping job killed by pilot
> taskBufferErrorCode (19) 19 0.0 08-15 17:44
> 100: Job expired and killed six days after submission (or
> killed by user)
> transExitCode (27) 27 389.8 08-15 18:12 99:
> TRF_UNKNOWN - unknown transformation error
> --------------------------------------------------------------
> -------------------------------------------
> BNL_ATLAS_2
> --------------------------------------------------------------
> ------------------------------------------
> BU_ATLAS_Tier2 defined:0 assigned:0 waiting:0
> activated:92 running:93 finished:124 failed:41 (24.8%)
>
> pilotErrorCode (1) 1 66.7 08-15 02:44 1200:
> Job killed by SIGTERM from batch system or Condor (eg walltime limit)
> transExitCode (40) 40 8.1 08-15 20:43 1:
> Unspecified error, consult log file
> --------------------------------------------------------------
> ------------------------------------------
> BU_ATLAS_Tier2o defined:0 assigned:7 waiting:0
> activated:10 running:12 finished:20 failed:2 (9.1%)
> transExitCode (2) 2 0.3 08-14 11:59 1:
> Unspecified error, consult log file
> --------------------------------------------------------------
> ------------------------------------------
> IU_ATLAS_Tier2 defined:0 assigned:10 waiting:0
> activated:67 running:64 finished:157 failed:28 (15.1%)
> transExitCode (28) 7 0.7 08-16 06:21 2:
> Athena core dump
> transExitCode (28) 6 289.2 08-16 11:05 60:
> TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
> transExitCode (28) 15 156.6 08-14 22:33 99:
> TRF_UNKNOWN - unknown transformation error
> --------------------------------------------------------------
> -------------------------------------------
> Unassigned defined:0 assigned:0 waiting:474
> activated:0 running:0 finished:0 failed:129
>
> taskBufferErrorCode (129) 129 0.0 08-15 19:22
> 100: Job expired and killed six days after submission (or
> killed by user)
> --------------------------------------------------------------
> -------------------------------------------
> OU_OCHEP_SWT2 defined:0 assigned:6 waiting:0
> activated:116 running:81 finished:279 failed:40 (12.5%)
>
> jobDispatcherErrorCode (5) 5 359.0 08-16 00:40
> 100: Lost heartbeat
> taskBufferErrorCode (1) 1 0.0 08-15 17:41
> 100: Job expired and killed six days after submission (or
> killed by user)
> transExitCode (34) 13 1.7 08-16 04:30 1:
> Unspecified error, consult log file
> transExitCode (34) 1 76.8 08-14 12:00 143:
> Unknown error code
> transExitCode (34) 1 89.5 08-14 11:54 41:
> TRF_OUTFILE - output file not found
> transExitCode (34) 1 64.1 08-14 12:01 50:
> Athena crash - consult log file
> transExitCode (34) 2 96.3 08-16 10:29 60:
> TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
> transExitCode (34) 16 234.9 08-15 13:07 99:
> TRF_UNKNOWN - unknown transformation error
> --------------------------------------------------------------
> --------------------------------------------
> PROD_SLAC defined:0 assigned:41 waiting:0
> activated:0 running:0 finished:6 failed:0 (0.0%)
> --------------------------------------------------------------
> --------------------------------------------
> UC_ATLAS_MWT2 defined:0 assigned:22 waiting:0
> activated:101 running:56 finished:149 failed:4 (2.6%)
>
> ddmErrorCode (1) 1 11.9 08-16 07:14 200:
> Could not add output files to dataset
> transExitCode (3) 1 11.6 08-14 19:18 50:
> Athena crash - consult log file
> transExitCode (3) 2 96.3 08-15 20:38 60:
> TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
> --------------------------------------------------------------
> --------------------------------------------
> UC_Teraport defined:0 assigned:12 waiting:0
> activated:183 running:64 finished:321 failed:202 (38.6%)
>
> jobDispatcherErrorCode (1) 1 66.7 08-15 08:08
> 100: Lost heartbeat
> pilotErrorCode (180) 174 0.9 08-16 04:28
> 1099: DQ2 staging input file failed
> pilotErrorCode (180) 6 55.2 08-16 04:26
> 1142: DQ2 put error: failed to register the file on local SE
> transExitCode (21) 7 1.4 08-15 03:15 1:
> Unspecified error, consult log file
> transExitCode (21) 1 5.6 08-16 10:30 2:
> Athena core dump
> transExitCode (21) 13 0.3 08-14 12:08 41:
> TRF_OUTFILE - output file not found
> --------------------------------------------------------------
> ---------------------------------------------
> UTA-DPCC defined:0 assigned:61 waiting:0
> activated:97 running:99 finished:342 failed:57 (14.3%)
>
> ddmErrorCode (8) 8 0.0 08-15 01:24 100:
> Input file GUID not found or input prodDBlock not accessible
> jobDispatcherErrorCode (4) 4 218.2 08-16 01:01
> 100: Lost heartbeat
> pilotErrorCode (1) 1 66.7 08-15 13:58 1200:
> Job killed by SIGTERM from batch system or Condor (eg walltime limit)
> transExitCode (44) 10 1.4 08-15 06:22 1:
> Unspecified error, consult log file
> transExitCode (44) 11 0.2 08-16 03:51 134:
> Athena core dump or timeout, or conddb DB connect exception
> transExitCode (44) 7 337.2 08-16 11:31 60:
> TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
> transExitCode (44) 16 330.0 08-15 11:58 99:
> TRF_UNKNOWN - unknown transformation error
> --------------------------------------------------------------
> --------------------------------------------
> UTA_SWT2 defined:0 assigned:2 waiting:0
> activated:0 running:0 finished:0 failed:0
> --------------------------------------------------------------
> --------------------------------------------
>
>
> The pilot job status from the submit host:
> ------------------------------------------------------------------
> [sm@atlas002 jobscheduler]$ ./queue-summary.py
> ==================== Condor Queue Summary
> ==================== condor_q run at Wed Aug 16 10:58:15 2006
> Maximum jobs on a remote host (all but UNKNOWN & UNSUBMITTED): 200
> Maximum jobs being sent to remote host: 5
>
> atlas.bu.edu
> PENDING 59
> ACTIVE 105
>
> atlas.dpcc.uta.edu
> PENDING 100
> ACTIVE 100
> UNSUBMITTED 9
>
> atlas.iu.edu
> PENDING 52
> ACTIVE 60
>
> gk01.swt2.uta.edu
> PENDING 5
> ACTIVE 1
> STAGE_OUT 2
>
> osgserv01.slac.stanford.edu
> PENDING 52
> STAGE_OUT 1
>
> tier2-01.ochep.ou.edu
> PENDING 50
> ACTIVE 80
>
> tier2-osg.uchicago.edu
> PENDING 52
> ACTIVE 41
>
> tp-osg.uchicago.edu
> PENDING 54
> ACTIVE 63
>
> ------------------------------------------------------------------
>
> Some notes:
>
> 1. 51 jobs of the type
> csc11.005538.AlpgenJimmyToplnlnNp3.evgen.v11004211
> failed with "transExitCode=1:Unspecified error, consult log
> file" in their second and third attempt, due to:
>
> -------- Problem report -------
> [Unknown Problem]
> !!! AthenaEventLoo ERROR Terminating event processing loop
> due to errors!!!
> ================================
>
> 43 jobs of the same type failed with "lost heartbeat" due to
> the same reason. I opened a Savannah bug #19047.
>
> 2. 6 jobs of the type csc11.005200.T1_McAtNlo_Jimmy.digit.v11004205
> failed at UTA-DPCC with "transExitCode=1:Unspecified error,
> consult log file" due to:
> --------------------------------------------------------------
> -----------------
> G4AtlasAlg: Event Nr. 1 start processing
> EA2F6442-1926-DB11-9647-00123F20A423 Error Cannot open container,
> invalid Database handle.
> StorageSvc Error The requested
> container:POOLContainer_McEventCollection cannot be opened!
>
> *** Break *** segmentation violation
> Generating stack trace...
> /usr/bin/addr2line: python: No such file or directory
> /usr/bin/addr2line: python: No such file or directory
> 0x03d4373e in
> FadsGeneratorT<AthenaHepMCInterface>::GenerateAnEvent() +
> 0x1e from
> /data73/grid3-1.1.11/apps/atlas_app/atlas_rel/11.0.42/dist/11.
> 0.42/InstallArea/
> i686-slc3-gcc323-opt/
> --------------------------------------------------------------
> --------------------
>
> I opened a Savannah bug #19104.
>
> 3. 11 jobs of the type
> csc11.005250.McAtNloWminenu.evgen.v11004209 failed at
> UTA-DPCC with "transExitCode=134: Athena core dump or
> timeout, or conddb DB connect exception" in the third attempt due to:
> --------------------------------------------------------------
> --------------------
> found 258 particles
> AtRndmGenSvc INFO Initializing AtRndmGenSvc - package version
> AthenaServices-01-07-27
> INITIALISING RANDOM NUMBER STREAMS.
>
>
> HERWIG 6.507 8th March 2005
>
> Please reference: G. Marchesini, B.R. Webber,
> G.Abbiendi, I.G.Knowles, M.H.Seymour & L.Stanco
> Computer Physics Communications 67 (1992) 465
> and
> G.Corcella, I.G.Knowles, G.Marchesini, S.Moretti,
> K.Odagiri, P.Richardson, M.H.Seymour & B.R.Webber,
> JHEP 0101 (2001) 010
> fmt: end of file
> apparent state: unit 61 named mcatnlo31.005250.Wminenu._000020.events
> last format: (5(1X,D10.4),1X,A)
> lately reading sequential formatted external IO
> /data73/grid3-1.1.11/apps//atlas_app/atlas_rel/kitval/KitValid
> ation/JobTransforms/
> JobTransforms-11-00-42-09/share/csc.evgen.mcatnlo.trf:
> line 224: 20334 Aborted athena.py job.py 2>&1
> --------------------------------------------------------------
> --------------------
>
> I opened a Savannah bug #19105.
>
> 4. 7 jobs of the type csc11.005200.T1_McAtNlo_Jimmy.recotrig.v11000505
> failed at IU with "transExitCode=2: Athena core dump" due to:
> --------------------------------------------------------------
> -----------
> python:
> /N/Grid3/apps/atlas_app/atlas_rel/11.0.5/gcc-alt-3.2.3/lib/lib
> gcc_s.so.1:
> version `GCC_4.2.0' not found (required by
> /usr/lib/libstdc++.so.5) Wed Aug 16 05:20:38 EST 2006
> mv: cannot stat `ntuple.root': No such file or directory
> --------------------------------------------------------------
> -----------
>
> I created a RT ticket #430 at MWTier2.
>
> 5. 21 jobs of the type failed with "transExitCode=40: Athena
> crash - consult log file", they are all user jobs.
>
> 6. 1 job, csc11.005023.FJ4_pythia_jetjet.digit.v11004206._00653.job,
> failed with "transExitCode=50: Athena crash - consult log
> file" due to:
> --------------------------------------------------------------
> ---------
> ===> G4QGSMSplitableHadron - Fatal: Cannot sample parton
> densities under these constraints.
> G4HadronicProcess failed in ApplyYourself call for
> - Particle energy[GeV] = 16.702706
> - Material = Copper
> - Particle type = neutron
>
> *** G4Exception : 007
> issued by : G4HadronicProcess
> GeneralPostStepDoIt failed.
> *** Fatal Exception *** core dump ***
> --------------------------------------------------------------
> ----------
>
> Savannah bug #16730, a fix went into Release 12.
>
> 7. 18 jobs of the following type failed with "transExitCode=60:
> TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)":
>
> testIdeal_06.005020.FJ1_pythia_jetjet.digit.v12000101
> testIdeal_06.005015.J6_pythia_jetjet.digit.v12000101
> testIdeal_06.005001.pythia_minbias.digit.v12000101
>
> Bug #18466 is closed, 48 hours limit was short for these jobs.
>
> 8. 74 jobs failed with "transExitCode=99: TRF_UNKNOWN -
> unknown transformation error"
>
> testIdeal_06.005015.J6_pythia_jetjet.digit.v12000101
> testIdeal_06.005145.PythiaZmumu.digit.v12000101
> testIdeal_06.005107.pythia_Wtauhad.digit.v12000101
>
> Bug #18349 is closed, fixed in tag LArG4EC-00-00-71 will go
> into 12.X.0 and 12.0.X.
>
> 9. 174 jobs of the type csc11.005200.T1_McAtNlo_Jimmy.digit.v11004205
> failed at UC_Teraport with "DQ2 staging input file failed"
> early morning:
>
> -------------- Log from
> /tmp/Panda_Pilot_15987_1155712210/dq2get.out ----- Getting
> POOL FileCatalog failed: cound not find the file in LRC!
> Could not get POOL FileCatalog!
> --------------------------------------------------------------
> -------------
>
> A RT ticket #427 was created by Tomasz.
>
> 10. Some test jobs were sent to OU_OSCER_ATLAS site to try to
> utilize a remote DQ2 server as opposed to a local NFS mounted
> one as requested by Karthik. It turned out that the pilots
> that were sent used the old version of DQ2ProdClient.py file.
> Xin was asked about how to use the new version of the file,
> DQ2ProdClient2.py, in this test.
>
> Regards,
> Nurcan.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> Usatlas-prodsys-l mailing list
> [log in to unmask]
> http://lists.bnl.gov/mailman/listinfo/usatlas-prodsys-l
>
|