Print

Print


FYI (to do with discussion of success rate for jobs).

--
  /------------------------------------+-------------------------\
|Stephen J. Gowdy, SLAC               | CERN     Office: 32-2-A22|
|http://www.slac.stanford.edu/~gowdy/ | CH-1211 Geneva 23        |
|http://calendar.yahoo.com/gowdy      | Switzerland              |
|EMail: [log in to unmask]       | Tel: +41 22 767 5840     |
  \------------------------------------+-------------------------/

---------- Forwarded message ----------
Date: Wed, 16 Aug 2006 16:00:11 -0500 (CDT)
From: Nurcan Ozturk <[log in to unmask]>
To: [log in to unmask]
Subject: [Usatlas-prodsys-l] Panda shift report August 14-15, 2006

Hi all,

Here is the Panda production status of the last 2 days:

Wed Aug 16 10:58:15 2006 Central
------------------------------------------------------------------------------------------------------------
All CEs and jobs.   Show production, analysis, test, all jobs/CEs
------------------------------------------------------------------------------------------------------------
Job wall time: 30948 hrs  Error losses: trans: 2219 (7.2%)   panda: 832 (2.7%)   ddm: 68 (0.2%)   other: 506 (1.6%)
-------------------------------------------------------------------------------------------------------------
Error type (type count)	Count	CPU-hrs	Latest	Code:	Description
-------------------------------------------------------------------------------------------------------------
All	defined:1   assigned:161   waiting:474   activated:1010   running:752   finished:2339   failed:724   (23.6%)

ddmErrorCode (9)   	8	0.0	08-15 01:24	100:	Input file GUID not found or input prodDBlock not accessible
ddmErrorCode (9)   	1	11.9	08-16 07:14	200:	Could not add output files to dataset
jobDispatcherErrorCode (59)   	59	698.9	08-16 01:01	100:	Lost heartbeat
pilotErrorCode (251)   	174	0.9	08-16 04:28	1099:	DQ2 staging input file failed
pilotErrorCode (251)   	1	0.1	08-14 17:06	1132:	Saving output files to DDM area returned non-zero code
pilotErrorCode (251)   	6	55.2	08-16 04:26	1142:	DQ2 put error: failed to register the file on local SE
pilotErrorCode (251)   	68	505.9	08-16 11:21	1150:	Looping job killed by pilot
pilotErrorCode (251)   	2	133.3	08-15 13:58	1200:	Job killed by SIGTERM from batch system or Condor (eg walltime limit)
taskBufferErrorCode (149)   	149	0.0	08-15 19:22	100:	Job expired and killed six days after submission (or killed by user)
transExitCode (234)   	72	13.0	08-16 04:30	1:	Unspecified error, consult log file
transExitCode (234)   	11	0.2	08-16 03:51	134:	Athena core dump or timeout, or conddb DB connect exception
transExitCode (234)   	1	76.8	08-14 12:00	143:	Unknown error code
transExitCode (234)   	8	6.4	08-16 10:30	2:	Athena core dump
transExitCode (234)   	35	26.5	08-16 06:23	40:	Athena crash - consult log file
transExitCode (234)   	14	89.8	08-14 12:08	41:	TRF_OUTFILE - output file not found
transExitCode (234)   	2	75.7	08-14 19:18	50:	Athena crash - consult log file
transExitCode (234)   	17	818.9	08-16 11:31	60:	TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
transExitCode (234)   	74	1111.3	08-15 18:12	99:	TRF_UNKNOWN - unknown transformation error
----------------------------------------------------------------------------------------------------------
ANALY_BNL_ATLAS_1	defined:0   assigned:0   waiting:0   activated:0   running:0   finished:98   failed:18   (15.5%)

jobDispatcherErrorCode (1)   	1	1.1	08-14 15:22	100:	Lost heartbeat
pilotErrorCode (1)   	1	0.1	08-14 17:06	1132:	Saving output files to DDM area returned non-zero code
transExitCode (16)   	16	2.5	08-16 06:23	40:	Athena crash - consult log file
----------------------------------------------------------------------------------------------------------
ANALY_BNL_ATLAS_2	defined:1   assigned:0   waiting:0   activated:1   running:0   finished:0   failed:0
----------------------------------------------------------------------------------------------------------
ANALY_LONG_BNL_ATLAS	defined:0   assigned:0   waiting:0   activated:0   running:0   finished:0   failed:19

transExitCode (19)   	19	24.0	08-16 04:36	40:	Athena crash - consult log file
---------------------------------------------------------------------------------------------------------
ANALY_UTA-DPCC
---------------------------------------------------------------------------------------------------------
BNL_ATLAS_1	defined:0   assigned:0   waiting:0   activated:343   running:283   finished:843   failed:162   (16.1%)

jobDispatcherErrorCode (48)   	48	53.9	08-16 01:01	100:	Lost heartbeat
pilotErrorCode (68)   	68	505.9	08-16 11:21	1150:	Looping job killed by pilot
taskBufferErrorCode (19)   	19	0.0	08-15 17:44	100:	Job expired and killed six days after submission (or killed by user)
transExitCode (27)   	27	389.8	08-15 18:12	99:	TRF_UNKNOWN - unknown transformation error
---------------------------------------------------------------------------------------------------------
BNL_ATLAS_2
--------------------------------------------------------------------------------------------------------
BU_ATLAS_Tier2	defined:0   assigned:0   waiting:0   activated:92   running:93   finished:124   failed:41   (24.8%)

pilotErrorCode (1)   	1	66.7	08-15 02:44	1200:	Job killed by SIGTERM from batch system or Condor (eg walltime limit)
transExitCode (40)   	40	8.1	08-15 20:43	1:	Unspecified error, consult log file
--------------------------------------------------------------------------------------------------------
BU_ATLAS_Tier2o	defined:0   assigned:7   waiting:0   activated:10   running:12   finished:20   failed:2   (9.1%)
transExitCode (2)   	2	0.3	08-14 11:59	1:	Unspecified error, consult log file
--------------------------------------------------------------------------------------------------------
IU_ATLAS_Tier2	defined:0   assigned:10   waiting:0   activated:67   running:64   finished:157   failed:28   (15.1%)
transExitCode (28)   	7	0.7	08-16 06:21	2:	Athena core dump
transExitCode (28)   	6	289.2	08-16 11:05	60:	TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
transExitCode (28)   	15	156.6	08-14 22:33	99:	TRF_UNKNOWN - unknown transformation error
---------------------------------------------------------------------------------------------------------
Unassigned	defined:0   assigned:0   waiting:474   activated:0   running:0   finished:0   failed:129

taskBufferErrorCode (129)   	129	0.0	08-15 19:22	100:	Job expired and killed six days after submission (or killed by user)
---------------------------------------------------------------------------------------------------------
OU_OCHEP_SWT2	defined:0   assigned:6   waiting:0   activated:116   running:81   finished:279   failed:40   (12.5%)

jobDispatcherErrorCode (5)   	5	359.0	08-16 00:40	100:	Lost heartbeat
taskBufferErrorCode (1)   	1	0.0	08-15 17:41	100:	Job expired and killed six days after submission (or killed by user)
transExitCode (34)   	13	1.7	08-16 04:30	1:	Unspecified error, consult log file
transExitCode (34)   	1	76.8	08-14 12:00	143:	Unknown error code
transExitCode (34)   	1	89.5	08-14 11:54	41:	TRF_OUTFILE - output file not found
transExitCode (34)   	1	64.1	08-14 12:01	50:	Athena crash - consult log file
transExitCode (34)   	2	96.3	08-16 10:29	60:	TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
transExitCode (34)   	16	234.9	08-15 13:07	99:	TRF_UNKNOWN - unknown transformation error
----------------------------------------------------------------------------------------------------------
PROD_SLAC	defined:0   assigned:41   waiting:0   activated:0   running:0   finished:6   failed:0   (0.0%)
----------------------------------------------------------------------------------------------------------
UC_ATLAS_MWT2	defined:0   assigned:22   waiting:0   activated:101   running:56   finished:149   failed:4   (2.6%)

ddmErrorCode (1)   	1	11.9	08-16 07:14	200:	Could not add output files to dataset
transExitCode (3)   	1	11.6	08-14 19:18	50:	Athena crash - consult log file
transExitCode (3)   	2	96.3	08-15 20:38	60:	TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
----------------------------------------------------------------------------------------------------------
UC_Teraport	defined:0   assigned:12   waiting:0   activated:183   running:64   finished:321   failed:202   (38.6%)

jobDispatcherErrorCode (1)   	1	66.7	08-15 08:08	100:	Lost heartbeat
pilotErrorCode (180)   	174	0.9	08-16 04:28	1099:	DQ2 staging input file failed
pilotErrorCode (180)   	6	55.2	08-16 04:26	1142:	DQ2 put error: failed to register the file on local SE
transExitCode (21)   	7	1.4	08-15 03:15	1:	Unspecified error, consult log file
transExitCode (21)   	1	5.6	08-16 10:30	2:	Athena core dump
transExitCode (21)   	13	0.3	08-14 12:08	41:	TRF_OUTFILE - output file not found
-----------------------------------------------------------------------------------------------------------
UTA-DPCC	defined:0   assigned:61   waiting:0   activated:97   running:99   finished:342   failed:57   (14.3%)

ddmErrorCode (8)   	8	0.0	08-15 01:24	100:	Input file GUID not found or input prodDBlock not accessible
jobDispatcherErrorCode (4)   	4	218.2	08-16 01:01	100:	Lost heartbeat
pilotErrorCode (1)   	1	66.7	08-15 13:58	1200:	Job killed by SIGTERM from batch system or Condor (eg walltime limit)
transExitCode (44)   	10	1.4	08-15 06:22	1:	Unspecified error, consult log file
transExitCode (44)   	11	0.2	08-16 03:51	134:	Athena core dump or timeout, or conddb DB connect exception
transExitCode (44)   	7	337.2	08-16 11:31	60:	TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
transExitCode (44)   	16	330.0	08-15 11:58	99:	TRF_UNKNOWN - unknown transformation error
----------------------------------------------------------------------------------------------------------
UTA_SWT2	defined:0   assigned:2   waiting:0   activated:0   running:0   finished:0   failed:0
----------------------------------------------------------------------------------------------------------


The pilot job status from the submit host:
------------------------------------------------------------------
[sm@atlas002 jobscheduler]$ ./queue-summary.py
====================
Condor Queue Summary
====================
condor_q run at Wed Aug 16 10:58:15 2006
Maximum jobs on a remote host (all but UNKNOWN & UNSUBMITTED): 200
Maximum jobs being sent to remote host:                        5

atlas.bu.edu
         PENDING           59
         ACTIVE           105

atlas.dpcc.uta.edu
         PENDING          100
         ACTIVE           100
         UNSUBMITTED        9

atlas.iu.edu
         PENDING           52
         ACTIVE            60

gk01.swt2.uta.edu
         PENDING            5
         ACTIVE             1
         STAGE_OUT          2

osgserv01.slac.stanford.edu
         PENDING           52
         STAGE_OUT          1

tier2-01.ochep.ou.edu
         PENDING           50
         ACTIVE            80

tier2-osg.uchicago.edu
         PENDING           52
         ACTIVE            41

tp-osg.uchicago.edu
         PENDING           54
         ACTIVE            63

------------------------------------------------------------------

Some notes:

1. 51 jobs of the type csc11.005538.AlpgenJimmyToplnlnNp3.evgen.v11004211
failed with "transExitCode=1:Unspecified error, consult log file" in
their second and third attempt, due to:

--------  Problem report -------
[Unknown Problem]
!!! AthenaEventLoo  ERROR Terminating event processing loop due to errors!!!
================================

43 jobs of the same type failed with "lost heartbeat" due to the same
reason. I opened a Savannah bug #19047.

2. 6 jobs of the type csc11.005200.T1_McAtNlo_Jimmy.digit.v11004205
failed at UTA-DPCC with "transExitCode=1:Unspecified error, consult log
file" due to:
-------------------------------------------------------------------------------
  		 G4AtlasAlg: Event Nr. 1 start processing
EA2F6442-1926-DB11-9647-00123F20A423    Error Cannot open container,
invalid Database handle.
   StorageSvc    Error The requested
container:POOLContainer_McEventCollection cannot be opened!

  *** Break *** segmentation violation
  Generating stack trace...
/usr/bin/addr2line: python: No such file or directory
/usr/bin/addr2line: python: No such file or directory
  0x03d4373e in FadsGeneratorT<AthenaHepMCInterface>::GenerateAnEvent() + 0x1e from
/data73/grid3-1.1.11/apps/atlas_app/atlas_rel/11.0.42/dist/11.0.42/InstallArea/
i686-slc3-gcc323-opt/
----------------------------------------------------------------------------------

I opened a Savannah bug #19104.

3. 11 jobs of the type csc11.005250.McAtNloWminenu.evgen.v11004209 failed
at UTA-DPCC with "transExitCode=134: Athena core dump or timeout, or
conddb DB connect exception" in the third attempt due to:
----------------------------------------------------------------------------------
found 258 particles
AtRndmGenSvc         INFO Initializing AtRndmGenSvc - package version
AthenaServices-01-07-27
  INITIALISING RANDOM NUMBER STREAMS.


           HERWIG 6.507  8th March 2005

           Please reference:  G. Marchesini, B.R. Webber,
           G.Abbiendi, I.G.Knowles, M.H.Seymour & L.Stanco
           Computer Physics Communications 67 (1992) 465
                              and
           G.Corcella, I.G.Knowles, G.Marchesini, S.Moretti,
           K.Odagiri, P.Richardson, M.H.Seymour & B.R.Webber,
           JHEP 0101 (2001) 010
fmt: end of file
apparent state: unit 61 named mcatnlo31.005250.Wminenu._000020.events
last format: (5(1X,D10.4),1X,A)
lately reading sequential formatted external IO
/data73/grid3-1.1.11/apps//atlas_app/atlas_rel/kitval/KitValidation/JobTransforms/
JobTransforms-11-00-42-09/share/csc.evgen.mcatnlo.trf:
line 224: 20334 Aborted                 athena.py job.py 2>&1
----------------------------------------------------------------------------------

I opened a Savannah bug #19105.

4. 7 jobs of the type csc11.005200.T1_McAtNlo_Jimmy.recotrig.v11000505
failed at IU with "transExitCode=2: Athena core dump" due to:
-------------------------------------------------------------------------
python:
/N/Grid3/apps/atlas_app/atlas_rel/11.0.5/gcc-alt-3.2.3/lib/libgcc_s.so.1:
version `GCC_4.2.0' not found (required by /usr/lib/libstdc++.so.5)
Wed Aug 16 05:20:38 EST 2006
mv: cannot stat `ntuple.root': No such file or directory
-------------------------------------------------------------------------

I created a RT ticket #430 at MWTier2.

5. 21 jobs of the type failed with "transExitCode=40: Athena crash -
consult log file", they are all user jobs.

6. 1 job, csc11.005023.FJ4_pythia_jetjet.digit.v11004206._00653.job,
failed with "transExitCode=50: Athena crash - consult log file" due to:
-----------------------------------------------------------------------
===> G4QGSMSplitableHadron - Fatal: Cannot sample
parton densities under these constraints.
  G4HadronicProcess failed in ApplyYourself call for
  - Particle energy[GeV] = 16.702706
  - Material = Copper
  - Particle type = neutron

*** G4Exception : 007
       issued by : G4HadronicProcess
GeneralPostStepDoIt failed.
*** Fatal Exception *** core dump ***
------------------------------------------------------------------------

Savannah bug #16730, a fix went into Release 12.

7. 18 jobs of the following type failed with "transExitCode=60:
TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)":

testIdeal_06.005020.FJ1_pythia_jetjet.digit.v12000101
testIdeal_06.005015.J6_pythia_jetjet.digit.v12000101
testIdeal_06.005001.pythia_minbias.digit.v12000101

Bug #18466 is closed, 48 hours limit was short for these jobs.

8. 74 jobs failed with "transExitCode=99: TRF_UNKNOWN - unknown
transformation error"

testIdeal_06.005015.J6_pythia_jetjet.digit.v12000101
testIdeal_06.005145.PythiaZmumu.digit.v12000101
testIdeal_06.005107.pythia_Wtauhad.digit.v12000101

Bug #18349 is closed, fixed in tag LArG4EC-00-00-71 will go into
12.X.0 and 12.0.X.

9. 174 jobs of the type csc11.005200.T1_McAtNlo_Jimmy.digit.v11004205
failed at UC_Teraport with "DQ2 staging input file failed" early morning:

-------------- Log from /tmp/Panda_Pilot_15987_1155712210/dq2get.out -----
Getting POOL FileCatalog failed: cound not find the file in LRC!
Could not get POOL FileCatalog!
---------------------------------------------------------------------------

A RT ticket #427 was created by Tomasz.

10. Some test jobs were sent to OU_OSCER_ATLAS site to try to
utilize a remote DQ2 server as opposed to a local NFS mounted one
as requested by Karthik. It turned out that the pilots that were
sent used the old version of DQ2ProdClient.py file. Xin was asked
about how to use the new version of the file, DQ2ProdClient2.py,
in this test.

Regards,
Nurcan.





























































































































































































_______________________________________________
Usatlas-prodsys-l mailing list
[log in to unmask]
http://lists.bnl.gov/mailman/listinfo/usatlas-prodsys-l