LISTSERV 16.5 - HPS-SOFTWARE Archives

Subscriber's Corner
Email Lists
HPS-SOFTWARE Archives

HPS-SOFTWARE@LISTSERV.SLAC.STANFORD.EDU

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		HPS-SOFTWARE Home
		HPS-SOFTWARE November 2015
Subject:
Re: beam-tri singles MC problems
From:
"McCormick, Jeremy I." <[log in to unmask]>
Reply-To:
Software for the Heavy Photon Search Experiment <[log in to unmask]>
Date:
Fri, 6 Nov 2015 20:39:13 +0000
Content-Type:
text/plain
Parts/Attachments:
text/plain (1 lines)
Can we rerun a limited subset of all the data including all the MC event types for pass3 with these fixes?

Then I think we should do QA before relaunching all the jobs to fix the scoring plane issue.

-----Original Message-----
From: Holly Vance [mailto:[log in to unmask]] 
Sent: Friday, November 06, 2015 12:33 PM
To: Uemura, Sho
Cc: Bradley T Yale; McCormick, Jeremy I.; hps-software
Subject: Re: beam-tri singles MC problems

Hi all,

The ECal scoring plane needs to be working. It's a good check on the track projection to the Ecal face (which was not working in Pass3), and it's the only way to verify certain cluster property corrections in the Ecal. 

-Holly

On Fri, Nov 6, 2015 at 3:27 PM, Sho Uemura <[log in to unmask]> wrote:


	Looks good to me.
	
	For people who care: This is nothing to do with SLIC, since the error was coming from one of the standalone stdhep utilities (/u/group/hps/hps_soft/stdhep/bin/beam_coords) that reads in a stdhep file from tape. But neither the utility nor the input stdhep files have changed since pass2, and this error did not happen in pass2. My theory is that the cache copy of the stdhep file was corrupt.
	
	Anyway, I think this problem is fixed now. The affected beam-tri and tritrig-beam-tri must be rerun, but we should decide if the ECal scoring plane fix merits rerunning all of the pass3 MC.


	On Fri, 6 Nov 2015, Bradley T Yale wrote:
	
	

		Re-running the problem files in quarantine, it looks like the same stdhep files are being read now:
		/work/hallb/hps/mc_production/pass3/test/logs/slic/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_952.out
		
		Maybe the latest SLIC update worked. I can redo the (tritrig)-beam-tri and see if it's fixed.
		As mentioned, I don't see this problem in the other MC components, only beam-tri.
		If you REALLY want to be safe, I can re-run everything, but would at least like to do it with a post-release jar (3.4.2-SNAPSHOT or 3.4.2-20151014.013425-5) so we can test current things.
		
		________________________________________
		From: Sho Uemura <[log in to unmask]>
		Sent: Thursday, November 5, 2015 9:39 PM
		To: Bradley T Yale
		Subject: Re: beam-tri singles MC problems
		
		I tried running beam_coords on the farm
		(/work/hallb/hps/uemura/bradtest/beam-tri_100.xml, logfiles and output in
		same directory) and it works fine.
		
		I looked at beam-tri logs for pass2 for the same files, and they are fine.
		
		So this stuff worked in pass2, broke in pass3, works again now, but
		nothing has changed - same stdhep file, and the beam_coords binary hasn't
		changed.
		
		Can you try rerunning the slic beam-tri job? It could be something weird
		like jcache screwing up and not copying the file correctly from tape -
		that would affect every job in that run but not runs before or after.
		
		On Thu, 5 Nov 2015, Sho Uemura wrote:
		
		

			It looks like the problem is that beam_coords is having trouble reading the
			beam.stdhep file and crashes, and so the beam-tri.stdhep file that goes into
			SLIC is missing all the beam background, and the trigger rate ends up being
			ridiculously low. Of course this affects every SLIC run that uses that
			beam.stdhep file.
			
			I get that from looking at
			/work/hallb/hps/mc_production/pass3/logs/slic/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_952.err
			and
			/work/hallb/hps/mc_production/pass3/logs/slic/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_952.out
			and comparing to other runs - you'll see that the .out file is missing some
			printouts after "Rotating Beam" and rot_beam.stdhep is missing from the file
			list. For example, one of the first things beam_coords should print is the
			number of events in the input file.
			
			So there must be something wrong with that stdhep file, but it has nothing to
			do with SLIC. Is it possible that this has always been happening, in pass2
			and earlier? I'll look at log files.
			
			Weirdly I have no difficulty running beam_coords on egsv3_10.stdhep on ifarm.
			Maybe there's something different about the batch farm environment?
			
			The bad news is that this must affect every MC that has beam background or
			beam-tri mixed in.
			
			On Thu, 5 Nov 2015, Bradley T Yale wrote:
			
			

				First, I submitted a report about those otherwise successful jobs not being
				written to tape, and it turned out to be a system glitch. It appears fixed
				now and unrelated to the following,
				which only affects ~15% of Pass3 beam-tri and tritrig-beam-tri files but no
				other Pass3 MC components.
				
				The beam-tri files that were readout 10-to-1 have the same problem with an
				inconsistent # of events, so it wasn't a problem with time/space allottment
				for the jobs.
				A few recon files with no time limit set for the jobs (100-to-1, labelled
				'NOTIMELIMIT') made it through before the tape-writing glitch as well, and
				have the same problem.
				
				Digging a little further, it appears that this issue with readout event
				inconsistency is likely related to the stdhep file-reading problem that
				Jeremy found while fixing SLIC for v3-fieldmap, so I brought him into this.
				Let me motivate that conclusion...
				
				About 85% of Pass3 beam-tri readout files look fine, and then:
				cat
				/work/hallb/hps/mc_production/pass3/data_quality/readout/beam-tri/1pt05/egsv3-triv2-g4v1_10to1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_*.txt
				| grep "^Read "
				..........
				Read 41911 events
				Read 42775 events
				Read 41551 events
				Read 42055 events
				Read 42556 events
				Read 9 events
				Read 7 events
				Read 7 events
				Read 3 events
				Read 9 events
				Read 10 events
				Read 2 events
				Read 13 events
				Read 7 events
				Read 41529 events
				Read 8 events
				Read 42149 events
				Read 42141 events
				Read 41933 events
				Read 41856 events
				Read 41711 events
				Read 42038 events
				Read 42004 events
				Read 41997 events
				Read 42029 events
				Read 41764 events
				Read 42156 events
				Read 42245 events
				Read 41732 events
				Read 42060 events
				Read 42070 events
				Read 42060 events
				Read 41962 events
				Read 41967 events
				Read 42071 events
				Read 42067 events
				Read 42017 events
				Read 42046 events
				Read 42614 events
				Read 42655 events
				Read 42337 events
				Read 42342 events
				Read 42503 events
				Read 42454 events
				Read 42237 events
				Read 42338 events
				Read 42607 events
				Read 41791 events
				Read 42309 events
				Read 3 events
				Read 4 events
				Read 7 events
				Read 7 events
				Read 4 events
				Read 6 events
				Read 7 events
				Read 7 events
				Read 4 events
				Read 41993 events
				
				The affected 10-to-1 readout files are #51-60 and #91-100, which were made
				from SLIC files #501-600, and #901-1000.
				For example:
				cat
				/work/hallb/hps/mc_production/pass3/data_quality/readout/beam-tri/1pt05/egsv3-triv2-g4v1_10to1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_96.txt
				| grep "^Read "
				/work/hallb/hps/mc_production/pass3/logs/readout/beam-tri/1pt05/egsv3-triv2-g4v1_10to1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_96.out
				
				Looking at the SLIC files that were used for readout (e.g. #951-960):
				/work/hallb/hps/mc_production/pass3/logs/slic/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_952.out
				
				This shows that stdhep is not reading the events from 1 out of every 25
				beam.stdhep files using the Pass3 setup.
				The actual beam.stdhep files from this problem (#51-60 and #91-100 in
				/mss/hallb/hps/production/stdhep/beam/1pt05/) look fine.
				
				Also, Pass3 tritrig-beam-tri, which are readout 1-to-1, have occasional
				files which contain no events. This means that when the beam-tri files are
				readout in larger quantities, these files without events shave off ~4000
				events for each affected SLIC file used. This is probably why some of the
				original 100-to-1 beam-tri files appear light on events, and are a lot
				worse with 10-to-1.
				
				The corresponding Pass2 readout/recon, which used the same seed and files
				as the problem ones, seem correct though:
				cat
				/work/hallb/hps/mc_production/data_quality/readout/beam-tri/1pt05/egsv3-triv2-g4v1_s2d6_HPS-EngRun2015-Nominal-v1_3.4.0-20150710_singles1_9*.txt
				| grep "^Read "
				cat
				/work/hallb/hps/mc_production/pass2/logs/readout/beam-tri/1pt05/egsv3-triv2-g4v1_s2d6_HPS-EngRun2015-Nominal-v3_3.4.0_singles1_*.out
				| grep "events "
				
				In summary, this inconsistency at readout is due to beam.stdhep files
				occasionally not being able to be read during Pass3 SLIC jobs.
				It only affects beam-tri made using the updated SLIC and v3-fieldmap
				detector.
				I'll make a Jira item about it.
				
				________________________________________
				From: Nathan Baltzell <[log in to unmask]>
				Sent: Thursday, November 5, 2015 4:26 AM
				To: Bradley T Yale
				Cc: Sho Uemura; Omar Moreno; Matthew Solt; Mathew Thomas Graham
				Subject: Re: beam-tri singles MC problems
				
				Probably should submit a CCPR on the failing to write to tape (including
				an example failed jobid/url).  I don't notice any related CCPRs in the
				system,
				and no corresponding errors in the farm_outs.
				
				
				On Nov 5, 2015, at 9:00 AM, Bradley T Yale <[log in to unmask]> wrote:
				
				

					Ok, I'll do those 10to1 as well to match everything else.
					
					By the way, the "failed" job status you see is because the trigger plots
					fail for some reason and so the entire job gets classified that way.
					All other output is fine though, and just can't be written to tape. That
					has never been an issue before, but I disabled the trigger plots for the
					latest batch just in case.
					It could just be something with the system. I'll see if it's resolved
					tomorrow.
					
					________________________________________
					From: Sho Uemura <[log in to unmask]>
					Sent: Thursday, November 5, 2015 1:49 AM
					To: Bradley T Yale
					Cc: Omar Moreno; Matthew Solt; Mathew Thomas Graham; Nathan Baltzell
					Subject: Re: beam-tri singles MC problems
					
					pairs1 seems better - there are still quite a few files that run under,
					but maybe 75% have the right number (1 ms/file * 100 files * 20 kHz =
					2000) of events.
					
					cat
					/work/hallb/hps/mc_production/pass3/data_quality/recon/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_pairs1_*.txt
					| grep "^Read "
					Read 111 events
					Read 1987 events
					Read 2014 events
					Read 2013 events
					Read 2094 events
					Read 2094 events
					Read 1989 events
					Read 2083 events
					Read 2070 events
					Read 1887 events
					Read 2007 events
					Read 1955 events
					Read 2037 events
					Read 2013 events
					Read 1991 events
					Read 1900 events
					Read 2002 events
					Read 1996 events
					Read 1835 events
					Read 85 events
					Read 1914 events
					Read 111 events
					Read 98 events
					Read 202 events
					Read 114 events
					Read 155 events
					Read 2007 events
					Read 59 events
					Read 1800 events
					Read 2052 events
					
					
					On Thu, 5 Nov 2015, Bradley T Yale wrote:
					
					

						Everything is failing to write to tape.
						
						Maybe this is also the cause of the badly cached dst files you were
						seeing as well.
						
						I have no idea what is causing this. That's why I included Nathan in
						this.
						
						
						On a side note, are you seeing the same inconsistency in pairs1 beam-tri,
						or just singles?
						
						
						________________________________
						From: Bradley T Yale
						Sent: Thursday, November 5, 2015 1:13 AM
						To: Omar Moreno; Sho Uemura
						Cc: Matthew Solt; Mathew Thomas Graham; Nathan Baltzell
						Subject: Re: beam-tri singles MC problems
						
						
						So, the 10to1 readout jobs successfully completed, but failed to write to
						tape:
						
						http://scicomp.jlab.org/scicomp/#/jasmine/jobs?requested=details&id=115214062
						
						
						I'm trying again after setting 'Memory space' back to "1024 MB", which is
						what it had been before.
						
						Is there anything else that could be causing this?
						
						
						________________________________
						From: Bradley T Yale
						Sent: Wednesday, November 4, 2015 7:41 PM
						To: Omar Moreno; Sho Uemura
						Cc: Matthew Solt; Mathew Thomas Graham
						Subject: Re: beam-tri singles MC problems
						
						
						Sorry. The latest ones are being reconstructed now and labelled
						'NOTIMELIMIT'. They shouldn't take long once active. Their readout did
						not have a time limit to try to fix the problem, but just in case, I'm
						also reading out others 10-to-1 (labelled '10to1') and will probably
						start doing it that way so readout doesn't take forever.
						
						
						
						________________________________
						From: [log in to unmask] <[log in to unmask]> on behalf of Omar
						Moreno <[log in to unmask]>
						Sent: Wednesday, November 4, 2015 4:24 PM
						To: Sho Uemura
						Cc: Bradley T Yale; Omar Moreno; Matthew Solt; Mathew Thomas Graham
						Subject: Re: beam-tri singles MC problems
						
						Any news on this?  I'm transferring all of the beam-tri files over to
						SLAC and I'm noticing that they are still all random sizes.
						
						On Fri, Oct 23, 2015 at 3:33 PM, Sho Uemura
						<[log in to unmask]<mailto:[log in to unmask]>> wrote:
						Hi Brad,
						
						1. readout files seem to be really random lengths:
						
						cat
						/work/hallb/hps/mc_production/pass3/data_quality/readout/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_*|grep
						"^Read "|less
						
						Read 52 events
						Read 16814 events
						Read 17062 events
						Read 12543 events
						Read 328300 events
						Read 355896 events
						Read 12912 events
						Read 309460 events
						Read 306093 events
						Read 313868 events
						Read 325727 events
						Read 298129 events
						Read 417300 events
						Read 423734 events
						Read 308954 events
						Read 365261 events
						Read 301648 events
						Read 316249 events
						Read 340949 events
						Read 319316 events
						Read 424033 events
						Read 308746 events
						Read 317204 events
						Read 12363 events
						Read 355813 events
						Read 329739 events
						Read 298601 events
						Read 29700 events
						Read 12675 events
						Read 287237 events
						Read 311071 events
						Read 12406 events
						Read 12719 events
						Read 30428 events
						Read 324795 events
						Read 345850 events
						Read 25765 events
						Read 29806 events
						Read 77 events
						Read 12544 events
						Read 372642 events
						Read 12779 events
						
						which makes it seem like jobs are failing randomly or something - I think
						normally we see most files have the same length, and a minority of files
						(missing some input files, or whatever) are shorter. In this case I think
						the expected number of events (number of triggers from 100 SLIC output
						files) is roughly 420k, and as you can see only a few files get there.
						
						I looked at log files and I don't see any obvious error messages, but
						maybe you have ideas? I'll keep digging.
						
						2. Looks like the singles recon jobs are running into the job disk space
						limit, so that while readout files can have as many as 420k events, recon
						files never have more than 240k. Looks like the disk limit is set to 5 GB
						(and a 240k-event LCIO recon file is 5.5 GB), but it needs to be at least
						doubled - or the number of SLIC files per readout job needs to be
						reduced?
						
						cat
						/work/hallb/hps/mc_production/pass3/data_quality/recon/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_*|grep
						"^Read "|less
						Read 1 events
						Read 16814 events
						Read 17062 events
						Read 242359 events
						Read 243949 events
						Read 242153 events
						Read 12776 events
						Read 242666 events
						Read 244165 events
						Read 243592 events
						Read 243433 events
						Read 242878 events
						Read 241861 events
						Read 242055 events
						Read 30428 events
						Read 243156 events
						Read 241638 events
						Read 4 events
						Read 241882 events
						
						

							From
							/work/hallb/hps/mc_production/pass3/logs/recon/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_22.err:
							


						java.lang.RuntimeException: Error writing LCIO file
						      at org.lcsim.util.loop.LCIODriver.process(LCIODriver.java:116)
						      at org.lcsim.util.Driver.doProcess(Driver.java:261)
						      at org.lcsim.util.Driver.processChildren(Driver.java:271)
						      at org.lcsim.util.Driver.process(Driver.java:187)
						      at
						org.lcsim.util.DriverAdapter.recordSupplied(DriverAdapter.java:74)
						      at
						org.freehep.record.loop.DefaultRecordLoop.consumeRecord(DefaultRecordLoop.java:832)
						      at
						org.freehep.record.loop.DefaultRecordLoop.loop(DefaultRecordLoop.java:668)
						      at
						org.freehep.record.loop.DefaultRecordLoop.execute(DefaultRecordLoop.java:566)
						      at org.lcsim.util.loop.LCSimLoop.loop(LCSimLoop.java:151)
						      at org.lcsim.job.JobControlManager.run(JobControlManager.java:431)
						      at org.hps.job.JobManager.run(JobManager.java:71)
						      at org.lcsim.job.JobControlManager.run(JobControlManager.java:189)
						      at org.hps.job.JobManager.main(JobManager.java:26)
						Caused by: java.io.IOException: File too large
						      at java.io.FileOutputStream.writeBytes(Native Method)
						      at java.io.FileOutputStream.write(FileOutputStream.java:345)
						      at
						hep.io.xdr.XDROutputStream$CountedOutputStream.write(XDROutputStream.java:103)
						      at java.io.DataOutputStream.write(DataOutputStream.java:107)
						      at
						hep.io.sio.SIOWriter$SIOByteArrayOutputStream.writeTo(SIOWriter.java:286)
						      at hep.io.sio.SIOWriter.flushRecord(SIOWriter.java:208)
						      at hep.io.sio.SIOWriter.createRecord(SIOWriter.java:83)
						      at org.lcsim.lcio.LCIOWriter.write(LCIOWriter.java:251)
						      at org.lcsim.util.loop.LCIODriver.process(LCIODriver.java:114)
						      ... 12 more
						
						
						Thanks. No rush on these, I imagine that even if the problems were fixed
						before/during the collaboration meeting we would not have time to use the
						files.
						
						
						


				########################################################################
				Use REPLY-ALL to reply to list
				
				To unsubscribe from the HPS-SOFTWARE list, click the following link:
				https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1
				
				



		########################################################################
		Use REPLY-ALL to reply to list
		
		To unsubscribe from the HPS-SOFTWARE list, click the following link:
		https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1
		
		


	########################################################################
	Use REPLY-ALL to reply to list
	
	To unsubscribe from the HPS-SOFTWARE list, click the following link:
	https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1
	
	
	-- 
	BEGIN-ANTISPAM-VOTING-LINKS
	------------------------------------------------------
	
	NOTE: This message was trained as non-spam.  If this is wrong,
	please correct the training as soon as possible.
	
	
	Teach CanIt if this mail (ID 01PCksWCB) is spam:
	Spam:        https://www.spamtrap.odu.edu/canit/b.php?i=01PCksWCB&m=ed38d853dedf&t=20151106&c=s
	Not spam:    https://www.spamtrap.odu.edu/canit/b.php?i=01PCksWCB&m=ed38d853dedf&t=20151106&c=n
	Forget vote: https://www.spamtrap.odu.edu/canit/b.php?i=01PCksWCB&m=ed38d853dedf&t=20151106&c=f
	------------------------------------------------------
	END-ANTISPAM-VOTING-LINKS
	
	



########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the HPS-SOFTWARE list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1
Top of Message | Previous Page | Permalink
Search Archives

Advanced Options
Options

		Log In
		Get Password

		Search Archives

		Subscribe or Unsubscribe