FYI (to do with discussion of success rate for jobs). -- /------------------------------------+-------------------------\ |Stephen J. Gowdy, SLAC | CERN Office: 32-2-A22| |http://www.slac.stanford.edu/~gowdy/ | CH-1211 Geneva 23 | |http://calendar.yahoo.com/gowdy | Switzerland | |EMail: [log in to unmask] | Tel: +41 22 767 5840 | \------------------------------------+-------------------------/ ---------- Forwarded message ---------- Date: Wed, 16 Aug 2006 16:00:11 -0500 (CDT) From: Nurcan Ozturk <[log in to unmask]> To: [log in to unmask] Subject: [Usatlas-prodsys-l] Panda shift report August 14-15, 2006 Hi all, Here is the Panda production status of the last 2 days: Wed Aug 16 10:58:15 2006 Central ------------------------------------------------------------------------------------------------------------ All CEs and jobs. Show production, analysis, test, all jobs/CEs ------------------------------------------------------------------------------------------------------------ Job wall time: 30948 hrs Error losses: trans: 2219 (7.2%) panda: 832 (2.7%) ddm: 68 (0.2%) other: 506 (1.6%) ------------------------------------------------------------------------------------------------------------- Error type (type count) Count CPU-hrs Latest Code: Description ------------------------------------------------------------------------------------------------------------- All defined:1 assigned:161 waiting:474 activated:1010 running:752 finished:2339 failed:724 (23.6%) ddmErrorCode (9) 8 0.0 08-15 01:24 100: Input file GUID not found or input prodDBlock not accessible ddmErrorCode (9) 1 11.9 08-16 07:14 200: Could not add output files to dataset jobDispatcherErrorCode (59) 59 698.9 08-16 01:01 100: Lost heartbeat pilotErrorCode (251) 174 0.9 08-16 04:28 1099: DQ2 staging input file failed pilotErrorCode (251) 1 0.1 08-14 17:06 1132: Saving output files to DDM area returned non-zero code pilotErrorCode (251) 6 55.2 08-16 04:26 1142: DQ2 put error: failed to register the file on local SE pilotErrorCode (251) 68 505.9 08-16 11:21 1150: Looping job killed by pilot pilotErrorCode (251) 2 133.3 08-15 13:58 1200: Job killed by SIGTERM from batch system or Condor (eg walltime limit) taskBufferErrorCode (149) 149 0.0 08-15 19:22 100: Job expired and killed six days after submission (or killed by user) transExitCode (234) 72 13.0 08-16 04:30 1: Unspecified error, consult log file transExitCode (234) 11 0.2 08-16 03:51 134: Athena core dump or timeout, or conddb DB connect exception transExitCode (234) 1 76.8 08-14 12:00 143: Unknown error code transExitCode (234) 8 6.4 08-16 10:30 2: Athena core dump transExitCode (234) 35 26.5 08-16 06:23 40: Athena crash - consult log file transExitCode (234) 14 89.8 08-14 12:08 41: TRF_OUTFILE - output file not found transExitCode (234) 2 75.7 08-14 19:18 50: Athena crash - consult log file transExitCode (234) 17 818.9 08-16 11:31 60: TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU) transExitCode (234) 74 1111.3 08-15 18:12 99: TRF_UNKNOWN - unknown transformation error ---------------------------------------------------------------------------------------------------------- ANALY_BNL_ATLAS_1 defined:0 assigned:0 waiting:0 activated:0 running:0 finished:98 failed:18 (15.5%) jobDispatcherErrorCode (1) 1 1.1 08-14 15:22 100: Lost heartbeat pilotErrorCode (1) 1 0.1 08-14 17:06 1132: Saving output files to DDM area returned non-zero code transExitCode (16) 16 2.5 08-16 06:23 40: Athena crash - consult log file ---------------------------------------------------------------------------------------------------------- ANALY_BNL_ATLAS_2 defined:1 assigned:0 waiting:0 activated:1 running:0 finished:0 failed:0 ---------------------------------------------------------------------------------------------------------- ANALY_LONG_BNL_ATLAS defined:0 assigned:0 waiting:0 activated:0 running:0 finished:0 failed:19 transExitCode (19) 19 24.0 08-16 04:36 40: Athena crash - consult log file --------------------------------------------------------------------------------------------------------- ANALY_UTA-DPCC --------------------------------------------------------------------------------------------------------- BNL_ATLAS_1 defined:0 assigned:0 waiting:0 activated:343 running:283 finished:843 failed:162 (16.1%) jobDispatcherErrorCode (48) 48 53.9 08-16 01:01 100: Lost heartbeat pilotErrorCode (68) 68 505.9 08-16 11:21 1150: Looping job killed by pilot taskBufferErrorCode (19) 19 0.0 08-15 17:44 100: Job expired and killed six days after submission (or killed by user) transExitCode (27) 27 389.8 08-15 18:12 99: TRF_UNKNOWN - unknown transformation error --------------------------------------------------------------------------------------------------------- BNL_ATLAS_2 -------------------------------------------------------------------------------------------------------- BU_ATLAS_Tier2 defined:0 assigned:0 waiting:0 activated:92 running:93 finished:124 failed:41 (24.8%) pilotErrorCode (1) 1 66.7 08-15 02:44 1200: Job killed by SIGTERM from batch system or Condor (eg walltime limit) transExitCode (40) 40 8.1 08-15 20:43 1: Unspecified error, consult log file -------------------------------------------------------------------------------------------------------- BU_ATLAS_Tier2o defined:0 assigned:7 waiting:0 activated:10 running:12 finished:20 failed:2 (9.1%) transExitCode (2) 2 0.3 08-14 11:59 1: Unspecified error, consult log file -------------------------------------------------------------------------------------------------------- IU_ATLAS_Tier2 defined:0 assigned:10 waiting:0 activated:67 running:64 finished:157 failed:28 (15.1%) transExitCode (28) 7 0.7 08-16 06:21 2: Athena core dump transExitCode (28) 6 289.2 08-16 11:05 60: TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU) transExitCode (28) 15 156.6 08-14 22:33 99: TRF_UNKNOWN - unknown transformation error --------------------------------------------------------------------------------------------------------- Unassigned defined:0 assigned:0 waiting:474 activated:0 running:0 finished:0 failed:129 taskBufferErrorCode (129) 129 0.0 08-15 19:22 100: Job expired and killed six days after submission (or killed by user) --------------------------------------------------------------------------------------------------------- OU_OCHEP_SWT2 defined:0 assigned:6 waiting:0 activated:116 running:81 finished:279 failed:40 (12.5%) jobDispatcherErrorCode (5) 5 359.0 08-16 00:40 100: Lost heartbeat taskBufferErrorCode (1) 1 0.0 08-15 17:41 100: Job expired and killed six days after submission (or killed by user) transExitCode (34) 13 1.7 08-16 04:30 1: Unspecified error, consult log file transExitCode (34) 1 76.8 08-14 12:00 143: Unknown error code transExitCode (34) 1 89.5 08-14 11:54 41: TRF_OUTFILE - output file not found transExitCode (34) 1 64.1 08-14 12:01 50: Athena crash - consult log file transExitCode (34) 2 96.3 08-16 10:29 60: TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU) transExitCode (34) 16 234.9 08-15 13:07 99: TRF_UNKNOWN - unknown transformation error ---------------------------------------------------------------------------------------------------------- PROD_SLAC defined:0 assigned:41 waiting:0 activated:0 running:0 finished:6 failed:0 (0.0%) ---------------------------------------------------------------------------------------------------------- UC_ATLAS_MWT2 defined:0 assigned:22 waiting:0 activated:101 running:56 finished:149 failed:4 (2.6%) ddmErrorCode (1) 1 11.9 08-16 07:14 200: Could not add output files to dataset transExitCode (3) 1 11.6 08-14 19:18 50: Athena crash - consult log file transExitCode (3) 2 96.3 08-15 20:38 60: TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU) ---------------------------------------------------------------------------------------------------------- UC_Teraport defined:0 assigned:12 waiting:0 activated:183 running:64 finished:321 failed:202 (38.6%) jobDispatcherErrorCode (1) 1 66.7 08-15 08:08 100: Lost heartbeat pilotErrorCode (180) 174 0.9 08-16 04:28 1099: DQ2 staging input file failed pilotErrorCode (180) 6 55.2 08-16 04:26 1142: DQ2 put error: failed to register the file on local SE transExitCode (21) 7 1.4 08-15 03:15 1: Unspecified error, consult log file transExitCode (21) 1 5.6 08-16 10:30 2: Athena core dump transExitCode (21) 13 0.3 08-14 12:08 41: TRF_OUTFILE - output file not found ----------------------------------------------------------------------------------------------------------- UTA-DPCC defined:0 assigned:61 waiting:0 activated:97 running:99 finished:342 failed:57 (14.3%) ddmErrorCode (8) 8 0.0 08-15 01:24 100: Input file GUID not found or input prodDBlock not accessible jobDispatcherErrorCode (4) 4 218.2 08-16 01:01 100: Lost heartbeat pilotErrorCode (1) 1 66.7 08-15 13:58 1200: Job killed by SIGTERM from batch system or Condor (eg walltime limit) transExitCode (44) 10 1.4 08-15 06:22 1: Unspecified error, consult log file transExitCode (44) 11 0.2 08-16 03:51 134: Athena core dump or timeout, or conddb DB connect exception transExitCode (44) 7 337.2 08-16 11:31 60: TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU) transExitCode (44) 16 330.0 08-15 11:58 99: TRF_UNKNOWN - unknown transformation error ---------------------------------------------------------------------------------------------------------- UTA_SWT2 defined:0 assigned:2 waiting:0 activated:0 running:0 finished:0 failed:0 ---------------------------------------------------------------------------------------------------------- The pilot job status from the submit host: ------------------------------------------------------------------ [sm@atlas002 jobscheduler]$ ./queue-summary.py ==================== Condor Queue Summary ==================== condor_q run at Wed Aug 16 10:58:15 2006 Maximum jobs on a remote host (all but UNKNOWN & UNSUBMITTED): 200 Maximum jobs being sent to remote host: 5 atlas.bu.edu PENDING 59 ACTIVE 105 atlas.dpcc.uta.edu PENDING 100 ACTIVE 100 UNSUBMITTED 9 atlas.iu.edu PENDING 52 ACTIVE 60 gk01.swt2.uta.edu PENDING 5 ACTIVE 1 STAGE_OUT 2 osgserv01.slac.stanford.edu PENDING 52 STAGE_OUT 1 tier2-01.ochep.ou.edu PENDING 50 ACTIVE 80 tier2-osg.uchicago.edu PENDING 52 ACTIVE 41 tp-osg.uchicago.edu PENDING 54 ACTIVE 63 ------------------------------------------------------------------ Some notes: 1. 51 jobs of the type csc11.005538.AlpgenJimmyToplnlnNp3.evgen.v11004211 failed with "transExitCode=1:Unspecified error, consult log file" in their second and third attempt, due to: -------- Problem report ------- [Unknown Problem] !!! AthenaEventLoo ERROR Terminating event processing loop due to errors!!! ================================ 43 jobs of the same type failed with "lost heartbeat" due to the same reason. I opened a Savannah bug #19047. 2. 6 jobs of the type csc11.005200.T1_McAtNlo_Jimmy.digit.v11004205 failed at UTA-DPCC with "transExitCode=1:Unspecified error, consult log file" due to: ------------------------------------------------------------------------------- G4AtlasAlg: Event Nr. 1 start processing EA2F6442-1926-DB11-9647-00123F20A423 Error Cannot open container, invalid Database handle. StorageSvc Error The requested container:POOLContainer_McEventCollection cannot be opened! *** Break *** segmentation violation Generating stack trace... /usr/bin/addr2line: python: No such file or directory /usr/bin/addr2line: python: No such file or directory 0x03d4373e in FadsGeneratorT<AthenaHepMCInterface>::GenerateAnEvent() + 0x1e from /data73/grid3-1.1.11/apps/atlas_app/atlas_rel/11.0.42/dist/11.0.42/InstallArea/ i686-slc3-gcc323-opt/ ---------------------------------------------------------------------------------- I opened a Savannah bug #19104. 3. 11 jobs of the type csc11.005250.McAtNloWminenu.evgen.v11004209 failed at UTA-DPCC with "transExitCode=134: Athena core dump or timeout, or conddb DB connect exception" in the third attempt due to: ---------------------------------------------------------------------------------- found 258 particles AtRndmGenSvc INFO Initializing AtRndmGenSvc - package version AthenaServices-01-07-27 INITIALISING RANDOM NUMBER STREAMS. HERWIG 6.507 8th March 2005 Please reference: G. Marchesini, B.R. Webber, G.Abbiendi, I.G.Knowles, M.H.Seymour & L.Stanco Computer Physics Communications 67 (1992) 465 and G.Corcella, I.G.Knowles, G.Marchesini, S.Moretti, K.Odagiri, P.Richardson, M.H.Seymour & B.R.Webber, JHEP 0101 (2001) 010 fmt: end of file apparent state: unit 61 named mcatnlo31.005250.Wminenu._000020.events last format: (5(1X,D10.4),1X,A) lately reading sequential formatted external IO /data73/grid3-1.1.11/apps//atlas_app/atlas_rel/kitval/KitValidation/JobTransforms/ JobTransforms-11-00-42-09/share/csc.evgen.mcatnlo.trf: line 224: 20334 Aborted athena.py job.py 2>&1 ---------------------------------------------------------------------------------- I opened a Savannah bug #19105. 4. 7 jobs of the type csc11.005200.T1_McAtNlo_Jimmy.recotrig.v11000505 failed at IU with "transExitCode=2: Athena core dump" due to: ------------------------------------------------------------------------- python: /N/Grid3/apps/atlas_app/atlas_rel/11.0.5/gcc-alt-3.2.3/lib/libgcc_s.so.1: version `GCC_4.2.0' not found (required by /usr/lib/libstdc++.so.5) Wed Aug 16 05:20:38 EST 2006 mv: cannot stat `ntuple.root': No such file or directory ------------------------------------------------------------------------- I created a RT ticket #430 at MWTier2. 5. 21 jobs of the type failed with "transExitCode=40: Athena crash - consult log file", they are all user jobs. 6. 1 job, csc11.005023.FJ4_pythia_jetjet.digit.v11004206._00653.job, failed with "transExitCode=50: Athena crash - consult log file" due to: ----------------------------------------------------------------------- ===> G4QGSMSplitableHadron - Fatal: Cannot sample parton densities under these constraints. G4HadronicProcess failed in ApplyYourself call for - Particle energy[GeV] = 16.702706 - Material = Copper - Particle type = neutron *** G4Exception : 007 issued by : G4HadronicProcess GeneralPostStepDoIt failed. *** Fatal Exception *** core dump *** ------------------------------------------------------------------------ Savannah bug #16730, a fix went into Release 12. 7. 18 jobs of the following type failed with "transExitCode=60: TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)": testIdeal_06.005020.FJ1_pythia_jetjet.digit.v12000101 testIdeal_06.005015.J6_pythia_jetjet.digit.v12000101 testIdeal_06.005001.pythia_minbias.digit.v12000101 Bug #18466 is closed, 48 hours limit was short for these jobs. 8. 74 jobs failed with "transExitCode=99: TRF_UNKNOWN - unknown transformation error" testIdeal_06.005015.J6_pythia_jetjet.digit.v12000101 testIdeal_06.005145.PythiaZmumu.digit.v12000101 testIdeal_06.005107.pythia_Wtauhad.digit.v12000101 Bug #18349 is closed, fixed in tag LArG4EC-00-00-71 will go into 12.X.0 and 12.0.X. 9. 174 jobs of the type csc11.005200.T1_McAtNlo_Jimmy.digit.v11004205 failed at UC_Teraport with "DQ2 staging input file failed" early morning: -------------- Log from /tmp/Panda_Pilot_15987_1155712210/dq2get.out ----- Getting POOL FileCatalog failed: cound not find the file in LRC! Could not get POOL FileCatalog! --------------------------------------------------------------------------- A RT ticket #427 was created by Tomasz. 10. Some test jobs were sent to OU_OSCER_ATLAS site to try to utilize a remote DQ2 server as opposed to a local NFS mounted one as requested by Karthik. It turned out that the pilots that were sent used the old version of DQ2ProdClient.py file. Xin was asked about how to use the new version of the file, DQ2ProdClient2.py, in this test. Regards, Nurcan. _______________________________________________ Usatlas-prodsys-l mailing list [log in to unmask] http://lists.bnl.gov/mailman/listinfo/usatlas-prodsys-l