FYI (to do with discussion of success rate for jobs).
--
/------------------------------------+-------------------------\
|Stephen J. Gowdy, SLAC | CERN Office: 32-2-A22|
|http://www.slac.stanford.edu/~gowdy/ | CH-1211 Geneva 23 |
|http://calendar.yahoo.com/gowdy | Switzerland |
|EMail: [log in to unmask] | Tel: +41 22 767 5840 |
\------------------------------------+-------------------------/
---------- Forwarded message ----------
Date: Wed, 16 Aug 2006 16:00:11 -0500 (CDT)
From: Nurcan Ozturk <[log in to unmask]>
To: [log in to unmask]
Subject: [Usatlas-prodsys-l] Panda shift report August 14-15, 2006
Hi all,
Here is the Panda production status of the last 2 days:
Wed Aug 16 10:58:15 2006 Central
------------------------------------------------------------------------------------------------------------
All CEs and jobs. Show production, analysis, test, all jobs/CEs
------------------------------------------------------------------------------------------------------------
Job wall time: 30948 hrs Error losses: trans: 2219 (7.2%) panda: 832 (2.7%) ddm: 68 (0.2%) other: 506 (1.6%)
-------------------------------------------------------------------------------------------------------------
Error type (type count) Count CPU-hrs Latest Code: Description
-------------------------------------------------------------------------------------------------------------
All defined:1 assigned:161 waiting:474 activated:1010 running:752 finished:2339 failed:724 (23.6%)
ddmErrorCode (9) 8 0.0 08-15 01:24 100: Input file GUID not found or input prodDBlock not accessible
ddmErrorCode (9) 1 11.9 08-16 07:14 200: Could not add output files to dataset
jobDispatcherErrorCode (59) 59 698.9 08-16 01:01 100: Lost heartbeat
pilotErrorCode (251) 174 0.9 08-16 04:28 1099: DQ2 staging input file failed
pilotErrorCode (251) 1 0.1 08-14 17:06 1132: Saving output files to DDM area returned non-zero code
pilotErrorCode (251) 6 55.2 08-16 04:26 1142: DQ2 put error: failed to register the file on local SE
pilotErrorCode (251) 68 505.9 08-16 11:21 1150: Looping job killed by pilot
pilotErrorCode (251) 2 133.3 08-15 13:58 1200: Job killed by SIGTERM from batch system or Condor (eg walltime limit)
taskBufferErrorCode (149) 149 0.0 08-15 19:22 100: Job expired and killed six days after submission (or killed by user)
transExitCode (234) 72 13.0 08-16 04:30 1: Unspecified error, consult log file
transExitCode (234) 11 0.2 08-16 03:51 134: Athena core dump or timeout, or conddb DB connect exception
transExitCode (234) 1 76.8 08-14 12:00 143: Unknown error code
transExitCode (234) 8 6.4 08-16 10:30 2: Athena core dump
transExitCode (234) 35 26.5 08-16 06:23 40: Athena crash - consult log file
transExitCode (234) 14 89.8 08-14 12:08 41: TRF_OUTFILE - output file not found
transExitCode (234) 2 75.7 08-14 19:18 50: Athena crash - consult log file
transExitCode (234) 17 818.9 08-16 11:31 60: TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
transExitCode (234) 74 1111.3 08-15 18:12 99: TRF_UNKNOWN - unknown transformation error
----------------------------------------------------------------------------------------------------------
ANALY_BNL_ATLAS_1 defined:0 assigned:0 waiting:0 activated:0 running:0 finished:98 failed:18 (15.5%)
jobDispatcherErrorCode (1) 1 1.1 08-14 15:22 100: Lost heartbeat
pilotErrorCode (1) 1 0.1 08-14 17:06 1132: Saving output files to DDM area returned non-zero code
transExitCode (16) 16 2.5 08-16 06:23 40: Athena crash - consult log file
----------------------------------------------------------------------------------------------------------
ANALY_BNL_ATLAS_2 defined:1 assigned:0 waiting:0 activated:1 running:0 finished:0 failed:0
----------------------------------------------------------------------------------------------------------
ANALY_LONG_BNL_ATLAS defined:0 assigned:0 waiting:0 activated:0 running:0 finished:0 failed:19
transExitCode (19) 19 24.0 08-16 04:36 40: Athena crash - consult log file
---------------------------------------------------------------------------------------------------------
ANALY_UTA-DPCC
---------------------------------------------------------------------------------------------------------
BNL_ATLAS_1 defined:0 assigned:0 waiting:0 activated:343 running:283 finished:843 failed:162 (16.1%)
jobDispatcherErrorCode (48) 48 53.9 08-16 01:01 100: Lost heartbeat
pilotErrorCode (68) 68 505.9 08-16 11:21 1150: Looping job killed by pilot
taskBufferErrorCode (19) 19 0.0 08-15 17:44 100: Job expired and killed six days after submission (or killed by user)
transExitCode (27) 27 389.8 08-15 18:12 99: TRF_UNKNOWN - unknown transformation error
---------------------------------------------------------------------------------------------------------
BNL_ATLAS_2
--------------------------------------------------------------------------------------------------------
BU_ATLAS_Tier2 defined:0 assigned:0 waiting:0 activated:92 running:93 finished:124 failed:41 (24.8%)
pilotErrorCode (1) 1 66.7 08-15 02:44 1200: Job killed by SIGTERM from batch system or Condor (eg walltime limit)
transExitCode (40) 40 8.1 08-15 20:43 1: Unspecified error, consult log file
--------------------------------------------------------------------------------------------------------
BU_ATLAS_Tier2o defined:0 assigned:7 waiting:0 activated:10 running:12 finished:20 failed:2 (9.1%)
transExitCode (2) 2 0.3 08-14 11:59 1: Unspecified error, consult log file
--------------------------------------------------------------------------------------------------------
IU_ATLAS_Tier2 defined:0 assigned:10 waiting:0 activated:67 running:64 finished:157 failed:28 (15.1%)
transExitCode (28) 7 0.7 08-16 06:21 2: Athena core dump
transExitCode (28) 6 289.2 08-16 11:05 60: TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
transExitCode (28) 15 156.6 08-14 22:33 99: TRF_UNKNOWN - unknown transformation error
---------------------------------------------------------------------------------------------------------
Unassigned defined:0 assigned:0 waiting:474 activated:0 running:0 finished:0 failed:129
taskBufferErrorCode (129) 129 0.0 08-15 19:22 100: Job expired and killed six days after submission (or killed by user)
---------------------------------------------------------------------------------------------------------
OU_OCHEP_SWT2 defined:0 assigned:6 waiting:0 activated:116 running:81 finished:279 failed:40 (12.5%)
jobDispatcherErrorCode (5) 5 359.0 08-16 00:40 100: Lost heartbeat
taskBufferErrorCode (1) 1 0.0 08-15 17:41 100: Job expired and killed six days after submission (or killed by user)
transExitCode (34) 13 1.7 08-16 04:30 1: Unspecified error, consult log file
transExitCode (34) 1 76.8 08-14 12:00 143: Unknown error code
transExitCode (34) 1 89.5 08-14 11:54 41: TRF_OUTFILE - output file not found
transExitCode (34) 1 64.1 08-14 12:01 50: Athena crash - consult log file
transExitCode (34) 2 96.3 08-16 10:29 60: TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
transExitCode (34) 16 234.9 08-15 13:07 99: TRF_UNKNOWN - unknown transformation error
----------------------------------------------------------------------------------------------------------
PROD_SLAC defined:0 assigned:41 waiting:0 activated:0 running:0 finished:6 failed:0 (0.0%)
----------------------------------------------------------------------------------------------------------
UC_ATLAS_MWT2 defined:0 assigned:22 waiting:0 activated:101 running:56 finished:149 failed:4 (2.6%)
ddmErrorCode (1) 1 11.9 08-16 07:14 200: Could not add output files to dataset
transExitCode (3) 1 11.6 08-14 19:18 50: Athena crash - consult log file
transExitCode (3) 2 96.3 08-15 20:38 60: TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
----------------------------------------------------------------------------------------------------------
UC_Teraport defined:0 assigned:12 waiting:0 activated:183 running:64 finished:321 failed:202 (38.6%)
jobDispatcherErrorCode (1) 1 66.7 08-15 08:08 100: Lost heartbeat
pilotErrorCode (180) 174 0.9 08-16 04:28 1099: DQ2 staging input file failed
pilotErrorCode (180) 6 55.2 08-16 04:26 1142: DQ2 put error: failed to register the file on local SE
transExitCode (21) 7 1.4 08-15 03:15 1: Unspecified error, consult log file
transExitCode (21) 1 5.6 08-16 10:30 2: Athena core dump
transExitCode (21) 13 0.3 08-14 12:08 41: TRF_OUTFILE - output file not found
-----------------------------------------------------------------------------------------------------------
UTA-DPCC defined:0 assigned:61 waiting:0 activated:97 running:99 finished:342 failed:57 (14.3%)
ddmErrorCode (8) 8 0.0 08-15 01:24 100: Input file GUID not found or input prodDBlock not accessible
jobDispatcherErrorCode (4) 4 218.2 08-16 01:01 100: Lost heartbeat
pilotErrorCode (1) 1 66.7 08-15 13:58 1200: Job killed by SIGTERM from batch system or Condor (eg walltime limit)
transExitCode (44) 10 1.4 08-15 06:22 1: Unspecified error, consult log file
transExitCode (44) 11 0.2 08-16 03:51 134: Athena core dump or timeout, or conddb DB connect exception
transExitCode (44) 7 337.2 08-16 11:31 60: TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
transExitCode (44) 16 330.0 08-15 11:58 99: TRF_UNKNOWN - unknown transformation error
----------------------------------------------------------------------------------------------------------
UTA_SWT2 defined:0 assigned:2 waiting:0 activated:0 running:0 finished:0 failed:0
----------------------------------------------------------------------------------------------------------
The pilot job status from the submit host:
------------------------------------------------------------------
[sm@atlas002 jobscheduler]$ ./queue-summary.py
====================
Condor Queue Summary
====================
condor_q run at Wed Aug 16 10:58:15 2006
Maximum jobs on a remote host (all but UNKNOWN & UNSUBMITTED): 200
Maximum jobs being sent to remote host: 5
atlas.bu.edu
PENDING 59
ACTIVE 105
atlas.dpcc.uta.edu
PENDING 100
ACTIVE 100
UNSUBMITTED 9
atlas.iu.edu
PENDING 52
ACTIVE 60
gk01.swt2.uta.edu
PENDING 5
ACTIVE 1
STAGE_OUT 2
osgserv01.slac.stanford.edu
PENDING 52
STAGE_OUT 1
tier2-01.ochep.ou.edu
PENDING 50
ACTIVE 80
tier2-osg.uchicago.edu
PENDING 52
ACTIVE 41
tp-osg.uchicago.edu
PENDING 54
ACTIVE 63
------------------------------------------------------------------
Some notes:
1. 51 jobs of the type csc11.005538.AlpgenJimmyToplnlnNp3.evgen.v11004211
failed with "transExitCode=1:Unspecified error, consult log file" in
their second and third attempt, due to:
-------- Problem report -------
[Unknown Problem]
!!! AthenaEventLoo ERROR Terminating event processing loop due to errors!!!
================================
43 jobs of the same type failed with "lost heartbeat" due to the same
reason. I opened a Savannah bug #19047.
2. 6 jobs of the type csc11.005200.T1_McAtNlo_Jimmy.digit.v11004205
failed at UTA-DPCC with "transExitCode=1:Unspecified error, consult log
file" due to:
-------------------------------------------------------------------------------
G4AtlasAlg: Event Nr. 1 start processing
EA2F6442-1926-DB11-9647-00123F20A423 Error Cannot open container,
invalid Database handle.
StorageSvc Error The requested
container:POOLContainer_McEventCollection cannot be opened!
*** Break *** segmentation violation
Generating stack trace...
/usr/bin/addr2line: python: No such file or directory
/usr/bin/addr2line: python: No such file or directory
0x03d4373e in FadsGeneratorT<AthenaHepMCInterface>::GenerateAnEvent() + 0x1e from
/data73/grid3-1.1.11/apps/atlas_app/atlas_rel/11.0.42/dist/11.0.42/InstallArea/
i686-slc3-gcc323-opt/
----------------------------------------------------------------------------------
I opened a Savannah bug #19104.
3. 11 jobs of the type csc11.005250.McAtNloWminenu.evgen.v11004209 failed
at UTA-DPCC with "transExitCode=134: Athena core dump or timeout, or
conddb DB connect exception" in the third attempt due to:
----------------------------------------------------------------------------------
found 258 particles
AtRndmGenSvc INFO Initializing AtRndmGenSvc - package version
AthenaServices-01-07-27
INITIALISING RANDOM NUMBER STREAMS.
HERWIG 6.507 8th March 2005
Please reference: G. Marchesini, B.R. Webber,
G.Abbiendi, I.G.Knowles, M.H.Seymour & L.Stanco
Computer Physics Communications 67 (1992) 465
and
G.Corcella, I.G.Knowles, G.Marchesini, S.Moretti,
K.Odagiri, P.Richardson, M.H.Seymour & B.R.Webber,
JHEP 0101 (2001) 010
fmt: end of file
apparent state: unit 61 named mcatnlo31.005250.Wminenu._000020.events
last format: (5(1X,D10.4),1X,A)
lately reading sequential formatted external IO
/data73/grid3-1.1.11/apps//atlas_app/atlas_rel/kitval/KitValidation/JobTransforms/
JobTransforms-11-00-42-09/share/csc.evgen.mcatnlo.trf:
line 224: 20334 Aborted athena.py job.py 2>&1
----------------------------------------------------------------------------------
I opened a Savannah bug #19105.
4. 7 jobs of the type csc11.005200.T1_McAtNlo_Jimmy.recotrig.v11000505
failed at IU with "transExitCode=2: Athena core dump" due to:
-------------------------------------------------------------------------
python:
/N/Grid3/apps/atlas_app/atlas_rel/11.0.5/gcc-alt-3.2.3/lib/libgcc_s.so.1:
version `GCC_4.2.0' not found (required by /usr/lib/libstdc++.so.5)
Wed Aug 16 05:20:38 EST 2006
mv: cannot stat `ntuple.root': No such file or directory
-------------------------------------------------------------------------
I created a RT ticket #430 at MWTier2.
5. 21 jobs of the type failed with "transExitCode=40: Athena crash -
consult log file", they are all user jobs.
6. 1 job, csc11.005023.FJ4_pythia_jetjet.digit.v11004206._00653.job,
failed with "transExitCode=50: Athena crash - consult log file" due to:
-----------------------------------------------------------------------
===> G4QGSMSplitableHadron - Fatal: Cannot sample
parton densities under these constraints.
G4HadronicProcess failed in ApplyYourself call for
- Particle energy[GeV] = 16.702706
- Material = Copper
- Particle type = neutron
*** G4Exception : 007
issued by : G4HadronicProcess
GeneralPostStepDoIt failed.
*** Fatal Exception *** core dump ***
------------------------------------------------------------------------
Savannah bug #16730, a fix went into Release 12.
7. 18 jobs of the following type failed with "transExitCode=60:
TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)":
testIdeal_06.005020.FJ1_pythia_jetjet.digit.v12000101
testIdeal_06.005015.J6_pythia_jetjet.digit.v12000101
testIdeal_06.005001.pythia_minbias.digit.v12000101
Bug #18466 is closed, 48 hours limit was short for these jobs.
8. 74 jobs failed with "transExitCode=99: TRF_UNKNOWN - unknown
transformation error"
testIdeal_06.005015.J6_pythia_jetjet.digit.v12000101
testIdeal_06.005145.PythiaZmumu.digit.v12000101
testIdeal_06.005107.pythia_Wtauhad.digit.v12000101
Bug #18349 is closed, fixed in tag LArG4EC-00-00-71 will go into
12.X.0 and 12.0.X.
9. 174 jobs of the type csc11.005200.T1_McAtNlo_Jimmy.digit.v11004205
failed at UC_Teraport with "DQ2 staging input file failed" early morning:
-------------- Log from /tmp/Panda_Pilot_15987_1155712210/dq2get.out -----
Getting POOL FileCatalog failed: cound not find the file in LRC!
Could not get POOL FileCatalog!
---------------------------------------------------------------------------
A RT ticket #427 was created by Tomasz.
10. Some test jobs were sent to OU_OSCER_ATLAS site to try to
utilize a remote DQ2 server as opposed to a local NFS mounted one
as requested by Karthik. It turned out that the pilots that were
sent used the old version of DQ2ProdClient.py file. Xin was asked
about how to use the new version of the file, DQ2ProdClient2.py,
in this test.
Regards,
Nurcan.
_______________________________________________
Usatlas-prodsys-l mailing list
[log in to unmask]
http://lists.bnl.gov/mailman/listinfo/usatlas-prodsys-l
|