Hi Stepphen, It's clear that Nurcan has done quite a bit of work to digest some of this information. Do you know if (subset of) this information is graphed somewhere? I am curious about long-term trends of failure rate, distribution of failure causes, etc. Cheers. Charlie -- Charles C. Young M.S. 43, Stanford Linear Accelerator Center P.O. Box 20450 Stanford, CA 94309 [log in to unmask] voice (650) 926 2669 fax (650) 926 2923 CERN GSM +41 76 487 2069 > -----Original Message----- > From: [log in to unmask] > [mailto:[log in to unmask]] On > Behalf Of Stephen J. Gowdy > Sent: Thursday, August 17, 2006 12:19 AM > To: atlas-sccs-planning-l > Subject: [Usatlas-prodsys-l] Panda shift report August 14-15, > 2006 (fwd) > > FYI (to do with discussion of success rate for jobs). > > -- > /------------------------------------+-------------------------\ > |Stephen J. Gowdy, SLAC | CERN Office: 32-2-A22| > |http://www.slac.stanford.edu/~gowdy/ | CH-1211 Geneva 23 | > |http://calendar.yahoo.com/gowdy | Switzerland | > |EMail: [log in to unmask] | Tel: +41 22 767 5840 | > \------------------------------------+-------------------------/ > > ---------- Forwarded message ---------- > Date: Wed, 16 Aug 2006 16:00:11 -0500 (CDT) > From: Nurcan Ozturk <[log in to unmask]> > To: [log in to unmask] > Subject: [Usatlas-prodsys-l] Panda shift report August 14-15, 2006 > > Hi all, > > Here is the Panda production status of the last 2 days: > > Wed Aug 16 10:58:15 2006 Central > -------------------------------------------------------------- > ---------------------------------------------- > All CEs and jobs. Show production, analysis, test, all jobs/CEs > -------------------------------------------------------------- > ---------------------------------------------- > Job wall time: 30948 hrs Error losses: trans: 2219 (7.2%) > panda: 832 (2.7%) ddm: 68 (0.2%) other: 506 (1.6%) > -------------------------------------------------------------- > ----------------------------------------------- > Error type (type count) Count CPU-hrs Latest Code: > Description > -------------------------------------------------------------- > ----------------------------------------------- > All defined:1 assigned:161 waiting:474 activated:1010 > running:752 finished:2339 failed:724 (23.6%) > > ddmErrorCode (9) 8 0.0 08-15 01:24 100: > Input file GUID not found or input prodDBlock not accessible > ddmErrorCode (9) 1 11.9 08-16 07:14 200: > Could not add output files to dataset > jobDispatcherErrorCode (59) 59 698.9 08-16 > 01:01 100: Lost heartbeat > pilotErrorCode (251) 174 0.9 08-16 04:28 > 1099: DQ2 staging input file failed > pilotErrorCode (251) 1 0.1 08-14 17:06 > 1132: Saving output files to DDM area returned non-zero code > pilotErrorCode (251) 6 55.2 08-16 04:26 > 1142: DQ2 put error: failed to register the file on local SE > pilotErrorCode (251) 68 505.9 08-16 11:21 > 1150: Looping job killed by pilot > pilotErrorCode (251) 2 133.3 08-15 13:58 > 1200: Job killed by SIGTERM from batch system or Condor (eg > walltime limit) > taskBufferErrorCode (149) 149 0.0 08-15 19:22 > 100: Job expired and killed six days after submission (or > killed by user) > transExitCode (234) 72 13.0 08-16 04:30 > 1: Unspecified error, consult log file > transExitCode (234) 11 0.2 08-16 03:51 > 134: Athena core dump or timeout, or conddb DB connect exception > transExitCode (234) 1 76.8 08-14 12:00 > 143: Unknown error code > transExitCode (234) 8 6.4 08-16 10:30 > 2: Athena core dump > transExitCode (234) 35 26.5 08-16 06:23 > 40: Athena crash - consult log file > transExitCode (234) 14 89.8 08-14 12:08 > 41: TRF_OUTFILE - output file not found > transExitCode (234) 2 75.7 08-14 19:18 > 50: Athena crash - consult log file > transExitCode (234) 17 818.9 08-16 11:31 > 60: TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU) > transExitCode (234) 74 1111.3 08-15 18:12 > 99: TRF_UNKNOWN - unknown transformation error > -------------------------------------------------------------- > -------------------------------------------- > ANALY_BNL_ATLAS_1 defined:0 assigned:0 waiting:0 > activated:0 running:0 finished:98 failed:18 (15.5%) > > jobDispatcherErrorCode (1) 1 1.1 08-14 15:22 > 100: Lost heartbeat > pilotErrorCode (1) 1 0.1 08-14 17:06 1132: > Saving output files to DDM area returned non-zero code > transExitCode (16) 16 2.5 08-16 06:23 40: > Athena crash - consult log file > -------------------------------------------------------------- > -------------------------------------------- > ANALY_BNL_ATLAS_2 defined:1 assigned:0 waiting:0 > activated:1 running:0 finished:0 failed:0 > -------------------------------------------------------------- > -------------------------------------------- > ANALY_LONG_BNL_ATLAS defined:0 assigned:0 waiting:0 > activated:0 running:0 finished:0 failed:19 > > transExitCode (19) 19 24.0 08-16 04:36 40: > Athena crash - consult log file > -------------------------------------------------------------- > ------------------------------------------- > ANALY_UTA-DPCC > -------------------------------------------------------------- > ------------------------------------------- > BNL_ATLAS_1 defined:0 assigned:0 waiting:0 > activated:343 running:283 finished:843 failed:162 (16.1%) > > jobDispatcherErrorCode (48) 48 53.9 08-16 > 01:01 100: Lost heartbeat > pilotErrorCode (68) 68 505.9 08-16 11:21 > 1150: Looping job killed by pilot > taskBufferErrorCode (19) 19 0.0 08-15 17:44 > 100: Job expired and killed six days after submission (or > killed by user) > transExitCode (27) 27 389.8 08-15 18:12 99: > TRF_UNKNOWN - unknown transformation error > -------------------------------------------------------------- > ------------------------------------------- > BNL_ATLAS_2 > -------------------------------------------------------------- > ------------------------------------------ > BU_ATLAS_Tier2 defined:0 assigned:0 waiting:0 > activated:92 running:93 finished:124 failed:41 (24.8%) > > pilotErrorCode (1) 1 66.7 08-15 02:44 1200: > Job killed by SIGTERM from batch system or Condor (eg walltime limit) > transExitCode (40) 40 8.1 08-15 20:43 1: > Unspecified error, consult log file > -------------------------------------------------------------- > ------------------------------------------ > BU_ATLAS_Tier2o defined:0 assigned:7 waiting:0 > activated:10 running:12 finished:20 failed:2 (9.1%) > transExitCode (2) 2 0.3 08-14 11:59 1: > Unspecified error, consult log file > -------------------------------------------------------------- > ------------------------------------------ > IU_ATLAS_Tier2 defined:0 assigned:10 waiting:0 > activated:67 running:64 finished:157 failed:28 (15.1%) > transExitCode (28) 7 0.7 08-16 06:21 2: > Athena core dump > transExitCode (28) 6 289.2 08-16 11:05 60: > TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU) > transExitCode (28) 15 156.6 08-14 22:33 99: > TRF_UNKNOWN - unknown transformation error > -------------------------------------------------------------- > ------------------------------------------- > Unassigned defined:0 assigned:0 waiting:474 > activated:0 running:0 finished:0 failed:129 > > taskBufferErrorCode (129) 129 0.0 08-15 19:22 > 100: Job expired and killed six days after submission (or > killed by user) > -------------------------------------------------------------- > ------------------------------------------- > OU_OCHEP_SWT2 defined:0 assigned:6 waiting:0 > activated:116 running:81 finished:279 failed:40 (12.5%) > > jobDispatcherErrorCode (5) 5 359.0 08-16 00:40 > 100: Lost heartbeat > taskBufferErrorCode (1) 1 0.0 08-15 17:41 > 100: Job expired and killed six days after submission (or > killed by user) > transExitCode (34) 13 1.7 08-16 04:30 1: > Unspecified error, consult log file > transExitCode (34) 1 76.8 08-14 12:00 143: > Unknown error code > transExitCode (34) 1 89.5 08-14 11:54 41: > TRF_OUTFILE - output file not found > transExitCode (34) 1 64.1 08-14 12:01 50: > Athena crash - consult log file > transExitCode (34) 2 96.3 08-16 10:29 60: > TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU) > transExitCode (34) 16 234.9 08-15 13:07 99: > TRF_UNKNOWN - unknown transformation error > -------------------------------------------------------------- > -------------------------------------------- > PROD_SLAC defined:0 assigned:41 waiting:0 > activated:0 running:0 finished:6 failed:0 (0.0%) > -------------------------------------------------------------- > -------------------------------------------- > UC_ATLAS_MWT2 defined:0 assigned:22 waiting:0 > activated:101 running:56 finished:149 failed:4 (2.6%) > > ddmErrorCode (1) 1 11.9 08-16 07:14 200: > Could not add output files to dataset > transExitCode (3) 1 11.6 08-14 19:18 50: > Athena crash - consult log file > transExitCode (3) 2 96.3 08-15 20:38 60: > TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU) > -------------------------------------------------------------- > -------------------------------------------- > UC_Teraport defined:0 assigned:12 waiting:0 > activated:183 running:64 finished:321 failed:202 (38.6%) > > jobDispatcherErrorCode (1) 1 66.7 08-15 08:08 > 100: Lost heartbeat > pilotErrorCode (180) 174 0.9 08-16 04:28 > 1099: DQ2 staging input file failed > pilotErrorCode (180) 6 55.2 08-16 04:26 > 1142: DQ2 put error: failed to register the file on local SE > transExitCode (21) 7 1.4 08-15 03:15 1: > Unspecified error, consult log file > transExitCode (21) 1 5.6 08-16 10:30 2: > Athena core dump > transExitCode (21) 13 0.3 08-14 12:08 41: > TRF_OUTFILE - output file not found > -------------------------------------------------------------- > --------------------------------------------- > UTA-DPCC defined:0 assigned:61 waiting:0 > activated:97 running:99 finished:342 failed:57 (14.3%) > > ddmErrorCode (8) 8 0.0 08-15 01:24 100: > Input file GUID not found or input prodDBlock not accessible > jobDispatcherErrorCode (4) 4 218.2 08-16 01:01 > 100: Lost heartbeat > pilotErrorCode (1) 1 66.7 08-15 13:58 1200: > Job killed by SIGTERM from batch system or Condor (eg walltime limit) > transExitCode (44) 10 1.4 08-15 06:22 1: > Unspecified error, consult log file > transExitCode (44) 11 0.2 08-16 03:51 134: > Athena core dump or timeout, or conddb DB connect exception > transExitCode (44) 7 337.2 08-16 11:31 60: > TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU) > transExitCode (44) 16 330.0 08-15 11:58 99: > TRF_UNKNOWN - unknown transformation error > -------------------------------------------------------------- > -------------------------------------------- > UTA_SWT2 defined:0 assigned:2 waiting:0 > activated:0 running:0 finished:0 failed:0 > -------------------------------------------------------------- > -------------------------------------------- > > > The pilot job status from the submit host: > ------------------------------------------------------------------ > [sm@atlas002 jobscheduler]$ ./queue-summary.py > ==================== Condor Queue Summary > ==================== condor_q run at Wed Aug 16 10:58:15 2006 > Maximum jobs on a remote host (all but UNKNOWN & UNSUBMITTED): 200 > Maximum jobs being sent to remote host: 5 > > atlas.bu.edu > PENDING 59 > ACTIVE 105 > > atlas.dpcc.uta.edu > PENDING 100 > ACTIVE 100 > UNSUBMITTED 9 > > atlas.iu.edu > PENDING 52 > ACTIVE 60 > > gk01.swt2.uta.edu > PENDING 5 > ACTIVE 1 > STAGE_OUT 2 > > osgserv01.slac.stanford.edu > PENDING 52 > STAGE_OUT 1 > > tier2-01.ochep.ou.edu > PENDING 50 > ACTIVE 80 > > tier2-osg.uchicago.edu > PENDING 52 > ACTIVE 41 > > tp-osg.uchicago.edu > PENDING 54 > ACTIVE 63 > > ------------------------------------------------------------------ > > Some notes: > > 1. 51 jobs of the type > csc11.005538.AlpgenJimmyToplnlnNp3.evgen.v11004211 > failed with "transExitCode=1:Unspecified error, consult log > file" in their second and third attempt, due to: > > -------- Problem report ------- > [Unknown Problem] > !!! AthenaEventLoo ERROR Terminating event processing loop > due to errors!!! > ================================ > > 43 jobs of the same type failed with "lost heartbeat" due to > the same reason. I opened a Savannah bug #19047. > > 2. 6 jobs of the type csc11.005200.T1_McAtNlo_Jimmy.digit.v11004205 > failed at UTA-DPCC with "transExitCode=1:Unspecified error, > consult log file" due to: > -------------------------------------------------------------- > ----------------- > G4AtlasAlg: Event Nr. 1 start processing > EA2F6442-1926-DB11-9647-00123F20A423 Error Cannot open container, > invalid Database handle. > StorageSvc Error The requested > container:POOLContainer_McEventCollection cannot be opened! > > *** Break *** segmentation violation > Generating stack trace... > /usr/bin/addr2line: python: No such file or directory > /usr/bin/addr2line: python: No such file or directory > 0x03d4373e in > FadsGeneratorT<AthenaHepMCInterface>::GenerateAnEvent() + > 0x1e from > /data73/grid3-1.1.11/apps/atlas_app/atlas_rel/11.0.42/dist/11. > 0.42/InstallArea/ > i686-slc3-gcc323-opt/ > -------------------------------------------------------------- > -------------------- > > I opened a Savannah bug #19104. > > 3. 11 jobs of the type > csc11.005250.McAtNloWminenu.evgen.v11004209 failed at > UTA-DPCC with "transExitCode=134: Athena core dump or > timeout, or conddb DB connect exception" in the third attempt due to: > -------------------------------------------------------------- > -------------------- > found 258 particles > AtRndmGenSvc INFO Initializing AtRndmGenSvc - package version > AthenaServices-01-07-27 > INITIALISING RANDOM NUMBER STREAMS. > > > HERWIG 6.507 8th March 2005 > > Please reference: G. Marchesini, B.R. Webber, > G.Abbiendi, I.G.Knowles, M.H.Seymour & L.Stanco > Computer Physics Communications 67 (1992) 465 > and > G.Corcella, I.G.Knowles, G.Marchesini, S.Moretti, > K.Odagiri, P.Richardson, M.H.Seymour & B.R.Webber, > JHEP 0101 (2001) 010 > fmt: end of file > apparent state: unit 61 named mcatnlo31.005250.Wminenu._000020.events > last format: (5(1X,D10.4),1X,A) > lately reading sequential formatted external IO > /data73/grid3-1.1.11/apps//atlas_app/atlas_rel/kitval/KitValid > ation/JobTransforms/ > JobTransforms-11-00-42-09/share/csc.evgen.mcatnlo.trf: > line 224: 20334 Aborted athena.py job.py 2>&1 > -------------------------------------------------------------- > -------------------- > > I opened a Savannah bug #19105. > > 4. 7 jobs of the type csc11.005200.T1_McAtNlo_Jimmy.recotrig.v11000505 > failed at IU with "transExitCode=2: Athena core dump" due to: > -------------------------------------------------------------- > ----------- > python: > /N/Grid3/apps/atlas_app/atlas_rel/11.0.5/gcc-alt-3.2.3/lib/lib > gcc_s.so.1: > version `GCC_4.2.0' not found (required by > /usr/lib/libstdc++.so.5) Wed Aug 16 05:20:38 EST 2006 > mv: cannot stat `ntuple.root': No such file or directory > -------------------------------------------------------------- > ----------- > > I created a RT ticket #430 at MWTier2. > > 5. 21 jobs of the type failed with "transExitCode=40: Athena > crash - consult log file", they are all user jobs. > > 6. 1 job, csc11.005023.FJ4_pythia_jetjet.digit.v11004206._00653.job, > failed with "transExitCode=50: Athena crash - consult log > file" due to: > -------------------------------------------------------------- > --------- > ===> G4QGSMSplitableHadron - Fatal: Cannot sample parton > densities under these constraints. > G4HadronicProcess failed in ApplyYourself call for > - Particle energy[GeV] = 16.702706 > - Material = Copper > - Particle type = neutron > > *** G4Exception : 007 > issued by : G4HadronicProcess > GeneralPostStepDoIt failed. > *** Fatal Exception *** core dump *** > -------------------------------------------------------------- > ---------- > > Savannah bug #16730, a fix went into Release 12. > > 7. 18 jobs of the following type failed with "transExitCode=60: > TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)": > > testIdeal_06.005020.FJ1_pythia_jetjet.digit.v12000101 > testIdeal_06.005015.J6_pythia_jetjet.digit.v12000101 > testIdeal_06.005001.pythia_minbias.digit.v12000101 > > Bug #18466 is closed, 48 hours limit was short for these jobs. > > 8. 74 jobs failed with "transExitCode=99: TRF_UNKNOWN - > unknown transformation error" > > testIdeal_06.005015.J6_pythia_jetjet.digit.v12000101 > testIdeal_06.005145.PythiaZmumu.digit.v12000101 > testIdeal_06.005107.pythia_Wtauhad.digit.v12000101 > > Bug #18349 is closed, fixed in tag LArG4EC-00-00-71 will go > into 12.X.0 and 12.0.X. > > 9. 174 jobs of the type csc11.005200.T1_McAtNlo_Jimmy.digit.v11004205 > failed at UC_Teraport with "DQ2 staging input file failed" > early morning: > > -------------- Log from > /tmp/Panda_Pilot_15987_1155712210/dq2get.out ----- Getting > POOL FileCatalog failed: cound not find the file in LRC! > Could not get POOL FileCatalog! > -------------------------------------------------------------- > ------------- > > A RT ticket #427 was created by Tomasz. > > 10. Some test jobs were sent to OU_OSCER_ATLAS site to try to > utilize a remote DQ2 server as opposed to a local NFS mounted > one as requested by Karthik. It turned out that the pilots > that were sent used the old version of DQ2ProdClient.py file. > Xin was asked about how to use the new version of the file, > DQ2ProdClient2.py, in this test. > > Regards, > Nurcan. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > Usatlas-prodsys-l mailing list > [log in to unmask] > http://lists.bnl.gov/mailman/listinfo/usatlas-prodsys-l >