Hi Charlie,
I couldn't see anything. I thought there might be somethign on the
Panda monitor page but it doesn't look like it;
http://gridui02.usatlas.bnl.gov:25880/server/pandamon/query
regards,
Stephen.
On Thu, 17 Aug 2006, Young, Charles C. wrote:
> Hi Stepphen,
>
> It's clear that Nurcan has done quite a bit of work to digest some of this information. Do you know if (subset of) this information is graphed somewhere? I am curious about long-term trends of failure rate, distribution of failure causes, etc. Cheers.
>
> Charlie
> --
> Charles C. Young
> M.S. 43, Stanford Linear Accelerator Center
> P.O. Box 20450
> Stanford, CA 94309
> [log in to unmask]
> voice (650) 926 2669
> fax (650) 926 2923
> CERN GSM +41 76 487 2069
>
>> -----Original Message-----
>> From: [log in to unmask]
>> [mailto:[log in to unmask]] On
>> Behalf Of Stephen J. Gowdy
>> Sent: Thursday, August 17, 2006 12:19 AM
>> To: atlas-sccs-planning-l
>> Subject: [Usatlas-prodsys-l] Panda shift report August 14-15,
>> 2006 (fwd)
>>
>> FYI (to do with discussion of success rate for jobs).
>>
>> --
>> /------------------------------------+-------------------------\
>> |Stephen J. Gowdy, SLAC | CERN Office: 32-2-A22|
>> |http://www.slac.stanford.edu/~gowdy/ | CH-1211 Geneva 23 |
>> |http://calendar.yahoo.com/gowdy | Switzerland |
>> |EMail: [log in to unmask] | Tel: +41 22 767 5840 |
>> \------------------------------------+-------------------------/
>>
>> ---------- Forwarded message ----------
>> Date: Wed, 16 Aug 2006 16:00:11 -0500 (CDT)
>> From: Nurcan Ozturk <[log in to unmask]>
>> To: [log in to unmask]
>> Subject: [Usatlas-prodsys-l] Panda shift report August 14-15, 2006
>>
>> Hi all,
>>
>> Here is the Panda production status of the last 2 days:
>>
>> Wed Aug 16 10:58:15 2006 Central
>> --------------------------------------------------------------
>> ----------------------------------------------
>> All CEs and jobs. Show production, analysis, test, all jobs/CEs
>> --------------------------------------------------------------
>> ----------------------------------------------
>> Job wall time: 30948 hrs Error losses: trans: 2219 (7.2%)
>> panda: 832 (2.7%) ddm: 68 (0.2%) other: 506 (1.6%)
>> --------------------------------------------------------------
>> -----------------------------------------------
>> Error type (type count) Count CPU-hrs Latest Code:
>> Description
>> --------------------------------------------------------------
>> -----------------------------------------------
>> All defined:1 assigned:161 waiting:474 activated:1010
>> running:752 finished:2339 failed:724 (23.6%)
>>
>> ddmErrorCode (9) 8 0.0 08-15 01:24 100:
>> Input file GUID not found or input prodDBlock not accessible
>> ddmErrorCode (9) 1 11.9 08-16 07:14 200:
>> Could not add output files to dataset
>> jobDispatcherErrorCode (59) 59 698.9 08-16
>> 01:01 100: Lost heartbeat
>> pilotErrorCode (251) 174 0.9 08-16 04:28
>> 1099: DQ2 staging input file failed
>> pilotErrorCode (251) 1 0.1 08-14 17:06
>> 1132: Saving output files to DDM area returned non-zero code
>> pilotErrorCode (251) 6 55.2 08-16 04:26
>> 1142: DQ2 put error: failed to register the file on local SE
>> pilotErrorCode (251) 68 505.9 08-16 11:21
>> 1150: Looping job killed by pilot
>> pilotErrorCode (251) 2 133.3 08-15 13:58
>> 1200: Job killed by SIGTERM from batch system or Condor (eg
>> walltime limit)
>> taskBufferErrorCode (149) 149 0.0 08-15 19:22
>> 100: Job expired and killed six days after submission (or
>> killed by user)
>> transExitCode (234) 72 13.0 08-16 04:30
>> 1: Unspecified error, consult log file
>> transExitCode (234) 11 0.2 08-16 03:51
>> 134: Athena core dump or timeout, or conddb DB connect exception
>> transExitCode (234) 1 76.8 08-14 12:00
>> 143: Unknown error code
>> transExitCode (234) 8 6.4 08-16 10:30
>> 2: Athena core dump
>> transExitCode (234) 35 26.5 08-16 06:23
>> 40: Athena crash - consult log file
>> transExitCode (234) 14 89.8 08-14 12:08
>> 41: TRF_OUTFILE - output file not found
>> transExitCode (234) 2 75.7 08-14 19:18
>> 50: Athena crash - consult log file
>> transExitCode (234) 17 818.9 08-16 11:31
>> 60: TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
>> transExitCode (234) 74 1111.3 08-15 18:12
>> 99: TRF_UNKNOWN - unknown transformation error
>> --------------------------------------------------------------
>> --------------------------------------------
>> ANALY_BNL_ATLAS_1 defined:0 assigned:0 waiting:0
>> activated:0 running:0 finished:98 failed:18 (15.5%)
>>
>> jobDispatcherErrorCode (1) 1 1.1 08-14 15:22
>> 100: Lost heartbeat
>> pilotErrorCode (1) 1 0.1 08-14 17:06 1132:
>> Saving output files to DDM area returned non-zero code
>> transExitCode (16) 16 2.5 08-16 06:23 40:
>> Athena crash - consult log file
>> --------------------------------------------------------------
>> --------------------------------------------
>> ANALY_BNL_ATLAS_2 defined:1 assigned:0 waiting:0
>> activated:1 running:0 finished:0 failed:0
>> --------------------------------------------------------------
>> --------------------------------------------
>> ANALY_LONG_BNL_ATLAS defined:0 assigned:0 waiting:0
>> activated:0 running:0 finished:0 failed:19
>>
>> transExitCode (19) 19 24.0 08-16 04:36 40:
>> Athena crash - consult log file
>> --------------------------------------------------------------
>> -------------------------------------------
>> ANALY_UTA-DPCC
>> --------------------------------------------------------------
>> -------------------------------------------
>> BNL_ATLAS_1 defined:0 assigned:0 waiting:0
>> activated:343 running:283 finished:843 failed:162 (16.1%)
>>
>> jobDispatcherErrorCode (48) 48 53.9 08-16
>> 01:01 100: Lost heartbeat
>> pilotErrorCode (68) 68 505.9 08-16 11:21
>> 1150: Looping job killed by pilot
>> taskBufferErrorCode (19) 19 0.0 08-15 17:44
>> 100: Job expired and killed six days after submission (or
>> killed by user)
>> transExitCode (27) 27 389.8 08-15 18:12 99:
>> TRF_UNKNOWN - unknown transformation error
>> --------------------------------------------------------------
>> -------------------------------------------
>> BNL_ATLAS_2
>> --------------------------------------------------------------
>> ------------------------------------------
>> BU_ATLAS_Tier2 defined:0 assigned:0 waiting:0
>> activated:92 running:93 finished:124 failed:41 (24.8%)
>>
>> pilotErrorCode (1) 1 66.7 08-15 02:44 1200:
>> Job killed by SIGTERM from batch system or Condor (eg walltime limit)
>> transExitCode (40) 40 8.1 08-15 20:43 1:
>> Unspecified error, consult log file
>> --------------------------------------------------------------
>> ------------------------------------------
>> BU_ATLAS_Tier2o defined:0 assigned:7 waiting:0
>> activated:10 running:12 finished:20 failed:2 (9.1%)
>> transExitCode (2) 2 0.3 08-14 11:59 1:
>> Unspecified error, consult log file
>> --------------------------------------------------------------
>> ------------------------------------------
>> IU_ATLAS_Tier2 defined:0 assigned:10 waiting:0
>> activated:67 running:64 finished:157 failed:28 (15.1%)
>> transExitCode (28) 7 0.7 08-16 06:21 2:
>> Athena core dump
>> transExitCode (28) 6 289.2 08-16 11:05 60:
>> TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
>> transExitCode (28) 15 156.6 08-14 22:33 99:
>> TRF_UNKNOWN - unknown transformation error
>> --------------------------------------------------------------
>> -------------------------------------------
>> Unassigned defined:0 assigned:0 waiting:474
>> activated:0 running:0 finished:0 failed:129
>>
>> taskBufferErrorCode (129) 129 0.0 08-15 19:22
>> 100: Job expired and killed six days after submission (or
>> killed by user)
>> --------------------------------------------------------------
>> -------------------------------------------
>> OU_OCHEP_SWT2 defined:0 assigned:6 waiting:0
>> activated:116 running:81 finished:279 failed:40 (12.5%)
>>
>> jobDispatcherErrorCode (5) 5 359.0 08-16 00:40
>> 100: Lost heartbeat
>> taskBufferErrorCode (1) 1 0.0 08-15 17:41
>> 100: Job expired and killed six days after submission (or
>> killed by user)
>> transExitCode (34) 13 1.7 08-16 04:30 1:
>> Unspecified error, consult log file
>> transExitCode (34) 1 76.8 08-14 12:00 143:
>> Unknown error code
>> transExitCode (34) 1 89.5 08-14 11:54 41:
>> TRF_OUTFILE - output file not found
>> transExitCode (34) 1 64.1 08-14 12:01 50:
>> Athena crash - consult log file
>> transExitCode (34) 2 96.3 08-16 10:29 60:
>> TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
>> transExitCode (34) 16 234.9 08-15 13:07 99:
>> TRF_UNKNOWN - unknown transformation error
>> --------------------------------------------------------------
>> --------------------------------------------
>> PROD_SLAC defined:0 assigned:41 waiting:0
>> activated:0 running:0 finished:6 failed:0 (0.0%)
>> --------------------------------------------------------------
>> --------------------------------------------
>> UC_ATLAS_MWT2 defined:0 assigned:22 waiting:0
>> activated:101 running:56 finished:149 failed:4 (2.6%)
>>
>> ddmErrorCode (1) 1 11.9 08-16 07:14 200:
>> Could not add output files to dataset
>> transExitCode (3) 1 11.6 08-14 19:18 50:
>> Athena crash - consult log file
>> transExitCode (3) 2 96.3 08-15 20:38 60:
>> TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
>> --------------------------------------------------------------
>> --------------------------------------------
>> UC_Teraport defined:0 assigned:12 waiting:0
>> activated:183 running:64 finished:321 failed:202 (38.6%)
>>
>> jobDispatcherErrorCode (1) 1 66.7 08-15 08:08
>> 100: Lost heartbeat
>> pilotErrorCode (180) 174 0.9 08-16 04:28
>> 1099: DQ2 staging input file failed
>> pilotErrorCode (180) 6 55.2 08-16 04:26
>> 1142: DQ2 put error: failed to register the file on local SE
>> transExitCode (21) 7 1.4 08-15 03:15 1:
>> Unspecified error, consult log file
>> transExitCode (21) 1 5.6 08-16 10:30 2:
>> Athena core dump
>> transExitCode (21) 13 0.3 08-14 12:08 41:
>> TRF_OUTFILE - output file not found
>> --------------------------------------------------------------
>> ---------------------------------------------
>> UTA-DPCC defined:0 assigned:61 waiting:0
>> activated:97 running:99 finished:342 failed:57 (14.3%)
>>
>> ddmErrorCode (8) 8 0.0 08-15 01:24 100:
>> Input file GUID not found or input prodDBlock not accessible
>> jobDispatcherErrorCode (4) 4 218.2 08-16 01:01
>> 100: Lost heartbeat
>> pilotErrorCode (1) 1 66.7 08-15 13:58 1200:
>> Job killed by SIGTERM from batch system or Condor (eg walltime limit)
>> transExitCode (44) 10 1.4 08-15 06:22 1:
>> Unspecified error, consult log file
>> transExitCode (44) 11 0.2 08-16 03:51 134:
>> Athena core dump or timeout, or conddb DB connect exception
>> transExitCode (44) 7 337.2 08-16 11:31 60:
>> TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
>> transExitCode (44) 16 330.0 08-15 11:58 99:
>> TRF_UNKNOWN - unknown transformation error
>> --------------------------------------------------------------
>> --------------------------------------------
>> UTA_SWT2 defined:0 assigned:2 waiting:0
>> activated:0 running:0 finished:0 failed:0
>> --------------------------------------------------------------
>> --------------------------------------------
>>
>>
>> The pilot job status from the submit host:
>> ------------------------------------------------------------------
>> [sm@atlas002 jobscheduler]$ ./queue-summary.py
>> ==================== Condor Queue Summary
>> ==================== condor_q run at Wed Aug 16 10:58:15 2006
>> Maximum jobs on a remote host (all but UNKNOWN & UNSUBMITTED): 200
>> Maximum jobs being sent to remote host: 5
>>
>> atlas.bu.edu
>> PENDING 59
>> ACTIVE 105
>>
>> atlas.dpcc.uta.edu
>> PENDING 100
>> ACTIVE 100
>> UNSUBMITTED 9
>>
>> atlas.iu.edu
>> PENDING 52
>> ACTIVE 60
>>
>> gk01.swt2.uta.edu
>> PENDING 5
>> ACTIVE 1
>> STAGE_OUT 2
>>
>> osgserv01.slac.stanford.edu
>> PENDING 52
>> STAGE_OUT 1
>>
>> tier2-01.ochep.ou.edu
>> PENDING 50
>> ACTIVE 80
>>
>> tier2-osg.uchicago.edu
>> PENDING 52
>> ACTIVE 41
>>
>> tp-osg.uchicago.edu
>> PENDING 54
>> ACTIVE 63
>>
>> ------------------------------------------------------------------
>>
>> Some notes:
>>
>> 1. 51 jobs of the type
>> csc11.005538.AlpgenJimmyToplnlnNp3.evgen.v11004211
>> failed with "transExitCode=1:Unspecified error, consult log
>> file" in their second and third attempt, due to:
>>
>> -------- Problem report -------
>> [Unknown Problem]
>> !!! AthenaEventLoo ERROR Terminating event processing loop
>> due to errors!!!
>> ================================
>>
>> 43 jobs of the same type failed with "lost heartbeat" due to
>> the same reason. I opened a Savannah bug #19047.
>>
>> 2. 6 jobs of the type csc11.005200.T1_McAtNlo_Jimmy.digit.v11004205
>> failed at UTA-DPCC with "transExitCode=1:Unspecified error,
>> consult log file" due to:
>> --------------------------------------------------------------
>> -----------------
>> G4AtlasAlg: Event Nr. 1 start processing
>> EA2F6442-1926-DB11-9647-00123F20A423 Error Cannot open container,
>> invalid Database handle.
>> StorageSvc Error The requested
>> container:POOLContainer_McEventCollection cannot be opened!
>>
>> *** Break *** segmentation violation
>> Generating stack trace...
>> /usr/bin/addr2line: python: No such file or directory
>> /usr/bin/addr2line: python: No such file or directory
>> 0x03d4373e in
>> FadsGeneratorT<AthenaHepMCInterface>::GenerateAnEvent() +
>> 0x1e from
>> /data73/grid3-1.1.11/apps/atlas_app/atlas_rel/11.0.42/dist/11.
>> 0.42/InstallArea/
>> i686-slc3-gcc323-opt/
>> --------------------------------------------------------------
>> --------------------
>>
>> I opened a Savannah bug #19104.
>>
>> 3. 11 jobs of the type
>> csc11.005250.McAtNloWminenu.evgen.v11004209 failed at
>> UTA-DPCC with "transExitCode=134: Athena core dump or
>> timeout, or conddb DB connect exception" in the third attempt due to:
>> --------------------------------------------------------------
>> --------------------
>> found 258 particles
>> AtRndmGenSvc INFO Initializing AtRndmGenSvc - package version
>> AthenaServices-01-07-27
>> INITIALISING RANDOM NUMBER STREAMS.
>>
>>
>> HERWIG 6.507 8th March 2005
>>
>> Please reference: G. Marchesini, B.R. Webber,
>> G.Abbiendi, I.G.Knowles, M.H.Seymour & L.Stanco
>> Computer Physics Communications 67 (1992) 465
>> and
>> G.Corcella, I.G.Knowles, G.Marchesini, S.Moretti,
>> K.Odagiri, P.Richardson, M.H.Seymour & B.R.Webber,
>> JHEP 0101 (2001) 010
>> fmt: end of file
>> apparent state: unit 61 named mcatnlo31.005250.Wminenu._000020.events
>> last format: (5(1X,D10.4),1X,A)
>> lately reading sequential formatted external IO
>> /data73/grid3-1.1.11/apps//atlas_app/atlas_rel/kitval/KitValid
>> ation/JobTransforms/
>> JobTransforms-11-00-42-09/share/csc.evgen.mcatnlo.trf:
>> line 224: 20334 Aborted athena.py job.py 2>&1
>> --------------------------------------------------------------
>> --------------------
>>
>> I opened a Savannah bug #19105.
>>
>> 4. 7 jobs of the type csc11.005200.T1_McAtNlo_Jimmy.recotrig.v11000505
>> failed at IU with "transExitCode=2: Athena core dump" due to:
>> --------------------------------------------------------------
>> -----------
>> python:
>> /N/Grid3/apps/atlas_app/atlas_rel/11.0.5/gcc-alt-3.2.3/lib/lib
>> gcc_s.so.1:
>> version `GCC_4.2.0' not found (required by
>> /usr/lib/libstdc++.so.5) Wed Aug 16 05:20:38 EST 2006
>> mv: cannot stat `ntuple.root': No such file or directory
>> --------------------------------------------------------------
>> -----------
>>
>> I created a RT ticket #430 at MWTier2.
>>
>> 5. 21 jobs of the type failed with "transExitCode=40: Athena
>> crash - consult log file", they are all user jobs.
>>
>> 6. 1 job, csc11.005023.FJ4_pythia_jetjet.digit.v11004206._00653.job,
>> failed with "transExitCode=50: Athena crash - consult log
>> file" due to:
>> --------------------------------------------------------------
>> ---------
>> ===> G4QGSMSplitableHadron - Fatal: Cannot sample parton
>> densities under these constraints.
>> G4HadronicProcess failed in ApplyYourself call for
>> - Particle energy[GeV] = 16.702706
>> - Material = Copper
>> - Particle type = neutron
>>
>> *** G4Exception : 007
>> issued by : G4HadronicProcess
>> GeneralPostStepDoIt failed.
>> *** Fatal Exception *** core dump ***
>> --------------------------------------------------------------
>> ----------
>>
>> Savannah bug #16730, a fix went into Release 12.
>>
>> 7. 18 jobs of the following type failed with "transExitCode=60:
>> TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)":
>>
>> testIdeal_06.005020.FJ1_pythia_jetjet.digit.v12000101
>> testIdeal_06.005015.J6_pythia_jetjet.digit.v12000101
>> testIdeal_06.005001.pythia_minbias.digit.v12000101
>>
>> Bug #18466 is closed, 48 hours limit was short for these jobs.
>>
>> 8. 74 jobs failed with "transExitCode=99: TRF_UNKNOWN -
>> unknown transformation error"
>>
>> testIdeal_06.005015.J6_pythia_jetjet.digit.v12000101
>> testIdeal_06.005145.PythiaZmumu.digit.v12000101
>> testIdeal_06.005107.pythia_Wtauhad.digit.v12000101
>>
>> Bug #18349 is closed, fixed in tag LArG4EC-00-00-71 will go
>> into 12.X.0 and 12.0.X.
>>
>> 9. 174 jobs of the type csc11.005200.T1_McAtNlo_Jimmy.digit.v11004205
>> failed at UC_Teraport with "DQ2 staging input file failed"
>> early morning:
>>
>> -------------- Log from
>> /tmp/Panda_Pilot_15987_1155712210/dq2get.out ----- Getting
>> POOL FileCatalog failed: cound not find the file in LRC!
>> Could not get POOL FileCatalog!
>> --------------------------------------------------------------
>> -------------
>>
>> A RT ticket #427 was created by Tomasz.
>>
>> 10. Some test jobs were sent to OU_OSCER_ATLAS site to try to
>> utilize a remote DQ2 server as opposed to a local NFS mounted
>> one as requested by Karthik. It turned out that the pilots
>> that were sent used the old version of DQ2ProdClient.py file.
>> Xin was asked about how to use the new version of the file,
>> DQ2ProdClient2.py, in this test.
>>
>> Regards,
>> Nurcan.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Usatlas-prodsys-l mailing list
>> [log in to unmask]
>> http://lists.bnl.gov/mailman/listinfo/usatlas-prodsys-l
>>
>
--
/------------------------------------+-------------------------\
|Stephen J. Gowdy, SLAC | CERN Office: 32-2-A22|
|http://www.slac.stanford.edu/~gowdy/ | CH-1211 Geneva 23 |
|http://calendar.yahoo.com/gowdy | Switzerland |
|EMail: [log in to unmask] | Tel: +41 22 767 5840 |
\------------------------------------+-------------------------/
|