LISTSERV mailing list manager LISTSERV 16.5

Help for ATLAS-SCCS-PLANNING-L Archives


ATLAS-SCCS-PLANNING-L Archives

ATLAS-SCCS-PLANNING-L Archives


ATLAS-SCCS-PLANNING-L@LISTSERV.SLAC.STANFORD.EDU


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

ATLAS-SCCS-PLANNING-L Home

ATLAS-SCCS-PLANNING-L Home

ATLAS-SCCS-PLANNING-L  August 2006

ATLAS-SCCS-PLANNING-L August 2006

Subject:

RE: [Usatlas-prodsys-l] Panda shift report August 14-15, 2006 (fwd)

From:

"Young, Charles C." <[log in to unmask]>

Date:

17 Aug 2006 09:19:26 -0700Thu, 17 Aug 2006 09:19:26 -0700

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (653 lines)

Hi Stepphen,

It's clear that Nurcan has done quite a bit of work to digest some of this information. Do you know if (subset of) this information is graphed somewhere? I am curious about long-term trends of failure rate, distribution of failure causes, etc. Cheers.

					Charlie
--
Charles C. Young
M.S. 43, Stanford Linear Accelerator Center       
P.O. Box 20450                                         
Stanford, CA 94309                                      
[log in to unmask]                                
voice  (650) 926 2669                         
fax    (650) 926 2923                       
CERN GSM +41 76 487 2069 

> -----Original Message-----
> From: [log in to unmask] 
> [mailto:[log in to unmask]] On 
> Behalf Of Stephen J. Gowdy
> Sent: Thursday, August 17, 2006 12:19 AM
> To: atlas-sccs-planning-l
> Subject: [Usatlas-prodsys-l] Panda shift report August 14-15, 
> 2006 (fwd)
> 
> FYI (to do with discussion of success rate for jobs).
> 
> --
>   /------------------------------------+-------------------------\
> |Stephen J. Gowdy, SLAC               | CERN     Office: 32-2-A22|
> |http://www.slac.stanford.edu/~gowdy/ | CH-1211 Geneva 23        |
> |http://calendar.yahoo.com/gowdy      | Switzerland              |
> |EMail: [log in to unmask]       | Tel: +41 22 767 5840     |
>   \------------------------------------+-------------------------/
> 
> ---------- Forwarded message ----------
> Date: Wed, 16 Aug 2006 16:00:11 -0500 (CDT)
> From: Nurcan Ozturk <[log in to unmask]>
> To: [log in to unmask]
> Subject: [Usatlas-prodsys-l] Panda shift report August 14-15, 2006
> 
> Hi all,
> 
> Here is the Panda production status of the last 2 days:
> 
> Wed Aug 16 10:58:15 2006 Central
> --------------------------------------------------------------
> ----------------------------------------------
> All CEs and jobs.   Show production, analysis, test, all jobs/CEs
> --------------------------------------------------------------
> ----------------------------------------------
> Job wall time: 30948 hrs  Error losses: trans: 2219 (7.2%)   
> panda: 832 (2.7%)   ddm: 68 (0.2%)   other: 506 (1.6%)
> --------------------------------------------------------------
> -----------------------------------------------
> Error type (type count)	Count	CPU-hrs	Latest	Code:	
> Description
> --------------------------------------------------------------
> -----------------------------------------------
> All	defined:1   assigned:161   waiting:474   activated:1010 
>   running:752   finished:2339   failed:724   (23.6%)
> 
> ddmErrorCode (9)   	8	0.0	08-15 01:24	100:	
> Input file GUID not found or input prodDBlock not accessible
> ddmErrorCode (9)   	1	11.9	08-16 07:14	200:	
> Could not add output files to dataset
> jobDispatcherErrorCode (59)   	59	698.9	08-16 
> 01:01	100:	Lost heartbeat
> pilotErrorCode (251)   	174	0.9	08-16 04:28	
> 1099:	DQ2 staging input file failed
> pilotErrorCode (251)   	1	0.1	08-14 17:06	
> 1132:	Saving output files to DDM area returned non-zero code
> pilotErrorCode (251)   	6	55.2	08-16 04:26	
> 1142:	DQ2 put error: failed to register the file on local SE
> pilotErrorCode (251)   	68	505.9	08-16 11:21	
> 1150:	Looping job killed by pilot
> pilotErrorCode (251)   	2	133.3	08-15 13:58	
> 1200:	Job killed by SIGTERM from batch system or Condor (eg 
> walltime limit)
> taskBufferErrorCode (149)   	149	0.0	08-15 19:22	
> 100:	Job expired and killed six days after submission (or 
> killed by user)
> transExitCode (234)   	72	13.0	08-16 04:30	
> 1:	Unspecified error, consult log file
> transExitCode (234)   	11	0.2	08-16 03:51	
> 134:	Athena core dump or timeout, or conddb DB connect exception
> transExitCode (234)   	1	76.8	08-14 12:00	
> 143:	Unknown error code
> transExitCode (234)   	8	6.4	08-16 10:30	
> 2:	Athena core dump
> transExitCode (234)   	35	26.5	08-16 06:23	
> 40:	Athena crash - consult log file
> transExitCode (234)   	14	89.8	08-14 12:08	
> 41:	TRF_OUTFILE - output file not found
> transExitCode (234)   	2	75.7	08-14 19:18	
> 50:	Athena crash - consult log file
> transExitCode (234)   	17	818.9	08-16 11:31	
> 60:	TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
> transExitCode (234)   	74	1111.3	08-15 18:12	
> 99:	TRF_UNKNOWN - unknown transformation error
> --------------------------------------------------------------
> --------------------------------------------
> ANALY_BNL_ATLAS_1	defined:0   assigned:0   waiting:0   
> activated:0   running:0   finished:98   failed:18   (15.5%)
> 
> jobDispatcherErrorCode (1)   	1	1.1	08-14 15:22	
> 100:	Lost heartbeat
> pilotErrorCode (1)   	1	0.1	08-14 17:06	1132:	
> Saving output files to DDM area returned non-zero code
> transExitCode (16)   	16	2.5	08-16 06:23	40:	
> Athena crash - consult log file
> --------------------------------------------------------------
> --------------------------------------------
> ANALY_BNL_ATLAS_2	defined:1   assigned:0   waiting:0   
> activated:1   running:0   finished:0   failed:0
> --------------------------------------------------------------
> --------------------------------------------
> ANALY_LONG_BNL_ATLAS	defined:0   assigned:0   waiting:0   
> activated:0   running:0   finished:0   failed:19
> 
> transExitCode (19)   	19	24.0	08-16 04:36	40:	
> Athena crash - consult log file
> --------------------------------------------------------------
> -------------------------------------------
> ANALY_UTA-DPCC
> --------------------------------------------------------------
> -------------------------------------------
> BNL_ATLAS_1	defined:0   assigned:0   waiting:0   
> activated:343   running:283   finished:843   failed:162   (16.1%)
> 
> jobDispatcherErrorCode (48)   	48	53.9	08-16 
> 01:01	100:	Lost heartbeat
> pilotErrorCode (68)   	68	505.9	08-16 11:21	
> 1150:	Looping job killed by pilot
> taskBufferErrorCode (19)   	19	0.0	08-15 17:44	
> 100:	Job expired and killed six days after submission (or 
> killed by user)
> transExitCode (27)   	27	389.8	08-15 18:12	99:	
> TRF_UNKNOWN - unknown transformation error
> --------------------------------------------------------------
> -------------------------------------------
> BNL_ATLAS_2
> --------------------------------------------------------------
> ------------------------------------------
> BU_ATLAS_Tier2	defined:0   assigned:0   waiting:0   
> activated:92   running:93   finished:124   failed:41   (24.8%)
> 
> pilotErrorCode (1)   	1	66.7	08-15 02:44	1200:	
> Job killed by SIGTERM from batch system or Condor (eg walltime limit)
> transExitCode (40)   	40	8.1	08-15 20:43	1:	
> Unspecified error, consult log file
> --------------------------------------------------------------
> ------------------------------------------
> BU_ATLAS_Tier2o	defined:0   assigned:7   waiting:0   
> activated:10   running:12   finished:20   failed:2   (9.1%)
> transExitCode (2)   	2	0.3	08-14 11:59	1:	
> Unspecified error, consult log file
> --------------------------------------------------------------
> ------------------------------------------
> IU_ATLAS_Tier2	defined:0   assigned:10   waiting:0   
> activated:67   running:64   finished:157   failed:28   (15.1%)
> transExitCode (28)   	7	0.7	08-16 06:21	2:	
> Athena core dump
> transExitCode (28)   	6	289.2	08-16 11:05	60:	
> TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
> transExitCode (28)   	15	156.6	08-14 22:33	99:	
> TRF_UNKNOWN - unknown transformation error
> --------------------------------------------------------------
> -------------------------------------------
> Unassigned	defined:0   assigned:0   waiting:474   
> activated:0   running:0   finished:0   failed:129
> 
> taskBufferErrorCode (129)   	129	0.0	08-15 19:22	
> 100:	Job expired and killed six days after submission (or 
> killed by user)
> --------------------------------------------------------------
> -------------------------------------------
> OU_OCHEP_SWT2	defined:0   assigned:6   waiting:0   
> activated:116   running:81   finished:279   failed:40   (12.5%)
> 
> jobDispatcherErrorCode (5)   	5	359.0	08-16 00:40	
> 100:	Lost heartbeat
> taskBufferErrorCode (1)   	1	0.0	08-15 17:41	
> 100:	Job expired and killed six days after submission (or 
> killed by user)
> transExitCode (34)   	13	1.7	08-16 04:30	1:	
> Unspecified error, consult log file
> transExitCode (34)   	1	76.8	08-14 12:00	143:	
> Unknown error code
> transExitCode (34)   	1	89.5	08-14 11:54	41:	
> TRF_OUTFILE - output file not found
> transExitCode (34)   	1	64.1	08-14 12:01	50:	
> Athena crash - consult log file
> transExitCode (34)   	2	96.3	08-16 10:29	60:	
> TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
> transExitCode (34)   	16	234.9	08-15 13:07	99:	
> TRF_UNKNOWN - unknown transformation error
> --------------------------------------------------------------
> --------------------------------------------
> PROD_SLAC	defined:0   assigned:41   waiting:0   
> activated:0   running:0   finished:6   failed:0   (0.0%)
> --------------------------------------------------------------
> --------------------------------------------
> UC_ATLAS_MWT2	defined:0   assigned:22   waiting:0   
> activated:101   running:56   finished:149   failed:4   (2.6%)
> 
> ddmErrorCode (1)   	1	11.9	08-16 07:14	200:	
> Could not add output files to dataset
> transExitCode (3)   	1	11.6	08-14 19:18	50:	
> Athena crash - consult log file
> transExitCode (3)   	2	96.3	08-15 20:38	60:	
> TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
> --------------------------------------------------------------
> --------------------------------------------
> UC_Teraport	defined:0   assigned:12   waiting:0   
> activated:183   running:64   finished:321   failed:202   (38.6%)
> 
> jobDispatcherErrorCode (1)   	1	66.7	08-15 08:08	
> 100:	Lost heartbeat
> pilotErrorCode (180)   	174	0.9	08-16 04:28	
> 1099:	DQ2 staging input file failed
> pilotErrorCode (180)   	6	55.2	08-16 04:26	
> 1142:	DQ2 put error: failed to register the file on local SE
> transExitCode (21)   	7	1.4	08-15 03:15	1:	
> Unspecified error, consult log file
> transExitCode (21)   	1	5.6	08-16 10:30	2:	
> Athena core dump
> transExitCode (21)   	13	0.3	08-14 12:08	41:	
> TRF_OUTFILE - output file not found
> --------------------------------------------------------------
> ---------------------------------------------
> UTA-DPCC	defined:0   assigned:61   waiting:0   
> activated:97   running:99   finished:342   failed:57   (14.3%)
> 
> ddmErrorCode (8)   	8	0.0	08-15 01:24	100:	
> Input file GUID not found or input prodDBlock not accessible
> jobDispatcherErrorCode (4)   	4	218.2	08-16 01:01	
> 100:	Lost heartbeat
> pilotErrorCode (1)   	1	66.7	08-15 13:58	1200:	
> Job killed by SIGTERM from batch system or Condor (eg walltime limit)
> transExitCode (44)   	10	1.4	08-15 06:22	1:	
> Unspecified error, consult log file
> transExitCode (44)   	11	0.2	08-16 03:51	134:	
> Athena core dump or timeout, or conddb DB connect exception
> transExitCode (44)   	7	337.2	08-16 11:31	60:	
> TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)
> transExitCode (44)   	16	330.0	08-15 11:58	99:	
> TRF_UNKNOWN - unknown transformation error
> --------------------------------------------------------------
> --------------------------------------------
> UTA_SWT2	defined:0   assigned:2   waiting:0   
> activated:0   running:0   finished:0   failed:0
> --------------------------------------------------------------
> --------------------------------------------
> 
> 
> The pilot job status from the submit host:
> ------------------------------------------------------------------
> [sm@atlas002 jobscheduler]$ ./queue-summary.py 
> ==================== Condor Queue Summary 
> ==================== condor_q run at Wed Aug 16 10:58:15 2006 
> Maximum jobs on a remote host (all but UNKNOWN & UNSUBMITTED): 200
> Maximum jobs being sent to remote host:                        5
> 
> atlas.bu.edu
>          PENDING           59
>          ACTIVE           105
> 
> atlas.dpcc.uta.edu
>          PENDING          100
>          ACTIVE           100
>          UNSUBMITTED        9
> 
> atlas.iu.edu
>          PENDING           52
>          ACTIVE            60
> 
> gk01.swt2.uta.edu
>          PENDING            5
>          ACTIVE             1
>          STAGE_OUT          2
> 
> osgserv01.slac.stanford.edu
>          PENDING           52
>          STAGE_OUT          1
> 
> tier2-01.ochep.ou.edu
>          PENDING           50
>          ACTIVE            80
> 
> tier2-osg.uchicago.edu
>          PENDING           52
>          ACTIVE            41
> 
> tp-osg.uchicago.edu
>          PENDING           54
>          ACTIVE            63
> 
> ------------------------------------------------------------------
> 
> Some notes:
> 
> 1. 51 jobs of the type 
> csc11.005538.AlpgenJimmyToplnlnNp3.evgen.v11004211
> failed with "transExitCode=1:Unspecified error, consult log 
> file" in their second and third attempt, due to:
> 
> --------  Problem report -------
> [Unknown Problem]
> !!! AthenaEventLoo  ERROR Terminating event processing loop 
> due to errors!!!
> ================================
> 
> 43 jobs of the same type failed with "lost heartbeat" due to 
> the same reason. I opened a Savannah bug #19047.
> 
> 2. 6 jobs of the type csc11.005200.T1_McAtNlo_Jimmy.digit.v11004205
> failed at UTA-DPCC with "transExitCode=1:Unspecified error, 
> consult log file" due to:
> --------------------------------------------------------------
> -----------------
>   		 G4AtlasAlg: Event Nr. 1 start processing
> EA2F6442-1926-DB11-9647-00123F20A423    Error Cannot open container,
> invalid Database handle.
>    StorageSvc    Error The requested
> container:POOLContainer_McEventCollection cannot be opened!
> 
>   *** Break *** segmentation violation
>   Generating stack trace...
> /usr/bin/addr2line: python: No such file or directory
> /usr/bin/addr2line: python: No such file or directory
>   0x03d4373e in 
> FadsGeneratorT<AthenaHepMCInterface>::GenerateAnEvent() + 
> 0x1e from 
> /data73/grid3-1.1.11/apps/atlas_app/atlas_rel/11.0.42/dist/11.
> 0.42/InstallArea/
> i686-slc3-gcc323-opt/
> --------------------------------------------------------------
> --------------------
> 
> I opened a Savannah bug #19104.
> 
> 3. 11 jobs of the type 
> csc11.005250.McAtNloWminenu.evgen.v11004209 failed at 
> UTA-DPCC with "transExitCode=134: Athena core dump or 
> timeout, or conddb DB connect exception" in the third attempt due to:
> --------------------------------------------------------------
> --------------------
> found 258 particles
> AtRndmGenSvc         INFO Initializing AtRndmGenSvc - package version
> AthenaServices-01-07-27
>   INITIALISING RANDOM NUMBER STREAMS.
> 
> 
>            HERWIG 6.507  8th March 2005
> 
>            Please reference:  G. Marchesini, B.R. Webber,
>            G.Abbiendi, I.G.Knowles, M.H.Seymour & L.Stanco
>            Computer Physics Communications 67 (1992) 465
>                               and
>            G.Corcella, I.G.Knowles, G.Marchesini, S.Moretti,
>            K.Odagiri, P.Richardson, M.H.Seymour & B.R.Webber,
>            JHEP 0101 (2001) 010
> fmt: end of file
> apparent state: unit 61 named mcatnlo31.005250.Wminenu._000020.events
> last format: (5(1X,D10.4),1X,A)
> lately reading sequential formatted external IO 
> /data73/grid3-1.1.11/apps//atlas_app/atlas_rel/kitval/KitValid
> ation/JobTransforms/
> JobTransforms-11-00-42-09/share/csc.evgen.mcatnlo.trf:
> line 224: 20334 Aborted                 athena.py job.py 2>&1
> --------------------------------------------------------------
> --------------------
> 
> I opened a Savannah bug #19105.
> 
> 4. 7 jobs of the type csc11.005200.T1_McAtNlo_Jimmy.recotrig.v11000505
> failed at IU with "transExitCode=2: Athena core dump" due to:
> --------------------------------------------------------------
> -----------
> python:
> /N/Grid3/apps/atlas_app/atlas_rel/11.0.5/gcc-alt-3.2.3/lib/lib
> gcc_s.so.1:
> version `GCC_4.2.0' not found (required by 
> /usr/lib/libstdc++.so.5) Wed Aug 16 05:20:38 EST 2006
> mv: cannot stat `ntuple.root': No such file or directory
> --------------------------------------------------------------
> -----------
> 
> I created a RT ticket #430 at MWTier2.
> 
> 5. 21 jobs of the type failed with "transExitCode=40: Athena 
> crash - consult log file", they are all user jobs.
> 
> 6. 1 job, csc11.005023.FJ4_pythia_jetjet.digit.v11004206._00653.job,
> failed with "transExitCode=50: Athena crash - consult log 
> file" due to:
> --------------------------------------------------------------
> ---------
> ===> G4QGSMSplitableHadron - Fatal: Cannot sample parton 
> densities under these constraints.
>   G4HadronicProcess failed in ApplyYourself call for
>   - Particle energy[GeV] = 16.702706
>   - Material = Copper
>   - Particle type = neutron
> 
> *** G4Exception : 007
>        issued by : G4HadronicProcess
> GeneralPostStepDoIt failed.
> *** Fatal Exception *** core dump ***
> --------------------------------------------------------------
> ----------
> 
> Savannah bug #16730, a fix went into Release 12.
> 
> 7. 18 jobs of the following type failed with "transExitCode=60:
> TRF_GBB_TIME - GriBB - output limit exceeded (time, memory, CPU)":
> 
> testIdeal_06.005020.FJ1_pythia_jetjet.digit.v12000101
> testIdeal_06.005015.J6_pythia_jetjet.digit.v12000101
> testIdeal_06.005001.pythia_minbias.digit.v12000101
> 
> Bug #18466 is closed, 48 hours limit was short for these jobs.
> 
> 8. 74 jobs failed with "transExitCode=99: TRF_UNKNOWN - 
> unknown transformation error"
> 
> testIdeal_06.005015.J6_pythia_jetjet.digit.v12000101
> testIdeal_06.005145.PythiaZmumu.digit.v12000101
> testIdeal_06.005107.pythia_Wtauhad.digit.v12000101
> 
> Bug #18349 is closed, fixed in tag LArG4EC-00-00-71 will go 
> into 12.X.0 and 12.0.X.
> 
> 9. 174 jobs of the type csc11.005200.T1_McAtNlo_Jimmy.digit.v11004205
> failed at UC_Teraport with "DQ2 staging input file failed" 
> early morning:
> 
> -------------- Log from 
> /tmp/Panda_Pilot_15987_1155712210/dq2get.out ----- Getting 
> POOL FileCatalog failed: cound not find the file in LRC!
> Could not get POOL FileCatalog!
> --------------------------------------------------------------
> -------------
> 
> A RT ticket #427 was created by Tomasz.
> 
> 10. Some test jobs were sent to OU_OSCER_ATLAS site to try to 
> utilize a remote DQ2 server as opposed to a local NFS mounted 
> one as requested by Karthik. It turned out that the pilots 
> that were sent used the old version of DQ2ProdClient.py file. 
> Xin was asked about how to use the new version of the file, 
> DQ2ProdClient2.py, in this test.
> 
> Regards,
> Nurcan.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Usatlas-prodsys-l mailing list
> [log in to unmask]
> http://lists.bnl.gov/mailman/listinfo/usatlas-prodsys-l
> 



Top of Message | Previous Page | Permalink

Advanced Options


Options

Log In

Log In

Get Password

Get Password


Search Archives

Search Archives


Subscribe or Unsubscribe

Subscribe or Unsubscribe


Archives

September 2016
July 2016
June 2016
May 2016
April 2016
March 2016
November 2015
September 2015
July 2015
June 2015
May 2015
April 2015
February 2015
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
September 2013
August 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
December 2006
November 2006
October 2006
September 2006
August 2006
July 2006
June 2006
May 2006
April 2006
March 2006
February 2006

ATOM RSS1 RSS2



LISTSERV.SLAC.STANFORD.EDU

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager

Privacy Notice, Security Notice and Terms of Use