LISTSERV 16.5 - HPS-SOFTWARE Archives

Subscriber's Corner
Email Lists
HPS-SOFTWARE Archives

HPS-SOFTWARE@LISTSERV.SLAC.STANFORD.EDU

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		HPS-SOFTWARE Home
		HPS-SOFTWARE November 2015
Subject:
Re: beam-tri singles MC problems
From:
Bradley T Yale <[log in to unmask]>
Reply-To:
Software for the Heavy Photon Search Experiment <[log in to unmask]>
Date:
Thu, 5 Nov 2015 23:26:18 +0000
Content-Type:
text/plain
Parts/Attachments:
text/plain (326 lines)
First, I submitted a report about those otherwise successful jobs not being written to tape, and it turned out to be a system glitch. It appears fixed now and unrelated to the following, 
which only affects ~15% of Pass3 beam-tri and tritrig-beam-tri files but no other Pass3 MC components.

The beam-tri files that were readout 10-to-1 have the same problem with an inconsistent # of events, so it wasn't a problem with time/space allottment for the jobs.
A few recon files with no time limit set for the jobs (100-to-1, labelled 'NOTIMELIMIT') made it through before the tape-writing glitch as well, and have the same problem.

Digging a little further, it appears that this issue with readout event inconsistency is likely related to the stdhep file-reading problem that Jeremy found while fixing SLIC for v3-fieldmap, so I brought him into this.
Let me motivate that conclusion...

About 85% of Pass3 beam-tri readout files look fine, and then:
cat /work/hallb/hps/mc_production/pass3/data_quality/readout/beam-tri/1pt05/egsv3-triv2-g4v1_10to1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_*.txt | grep "^Read "
..........
Read 41911 events
Read 42775 events
Read 41551 events
Read 42055 events
Read 42556 events
Read 9 events
Read 7 events
Read 7 events
Read 3 events
Read 9 events
Read 10 events
Read 2 events
Read 13 events
Read 7 events
Read 41529 events
Read 8 events
Read 42149 events
Read 42141 events
Read 41933 events
Read 41856 events
Read 41711 events
Read 42038 events
Read 42004 events
Read 41997 events
Read 42029 events
Read 41764 events
Read 42156 events
Read 42245 events
Read 41732 events
Read 42060 events
Read 42070 events
Read 42060 events
Read 41962 events
Read 41967 events
Read 42071 events
Read 42067 events
Read 42017 events
Read 42046 events
Read 42614 events
Read 42655 events
Read 42337 events
Read 42342 events
Read 42503 events
Read 42454 events
Read 42237 events
Read 42338 events
Read 42607 events
Read 41791 events
Read 42309 events
Read 3 events
Read 4 events
Read 7 events
Read 7 events
Read 4 events
Read 6 events
Read 7 events
Read 7 events
Read 4 events
Read 41993 events

The affected 10-to-1 readout files are #51-60 and #91-100, which were made from SLIC files #501-600, and #901-1000.
For example:
cat /work/hallb/hps/mc_production/pass3/data_quality/readout/beam-tri/1pt05/egsv3-triv2-g4v1_10to1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_96.txt | grep "^Read "
/work/hallb/hps/mc_production/pass3/logs/readout/beam-tri/1pt05/egsv3-triv2-g4v1_10to1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_96.out

Looking at the SLIC files that were used for readout (e.g. #951-960):
/work/hallb/hps/mc_production/pass3/logs/slic/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_952.out

This shows that stdhep is not reading the events from 1 out of every 25 beam.stdhep files using the Pass3 setup. 
The actual beam.stdhep files from this problem (#51-60 and #91-100 in /mss/hallb/hps/production/stdhep/beam/1pt05/) look fine.

Also, Pass3 tritrig-beam-tri, which are readout 1-to-1, have occasional files which contain no events. This means that when the beam-tri files are readout in larger quantities, these files without events shave off ~4000 events for each affected SLIC file used. This is probably why some of the original 100-to-1 beam-tri files appear light on events, and are a lot worse with 10-to-1.

The corresponding Pass2 readout/recon, which used the same seed and files as the problem ones, seem correct though: 
cat /work/hallb/hps/mc_production/data_quality/readout/beam-tri/1pt05/egsv3-triv2-g4v1_s2d6_HPS-EngRun2015-Nominal-v1_3.4.0-20150710_singles1_9*.txt | grep "^Read "
cat /work/hallb/hps/mc_production/pass2/logs/readout/beam-tri/1pt05/egsv3-triv2-g4v1_s2d6_HPS-EngRun2015-Nominal-v3_3.4.0_singles1_*.out | grep "events "

In summary, this inconsistency at readout is due to beam.stdhep files occasionally not being able to be read during Pass3 SLIC jobs.
It only affects beam-tri made using the updated SLIC and v3-fieldmap detector. 
I'll make a Jira item about it.

________________________________________
From: Nathan Baltzell <[log in to unmask]>
Sent: Thursday, November 5, 2015 4:26 AM
To: Bradley T Yale
Cc: Sho Uemura; Omar Moreno; Matthew Solt; Mathew Thomas Graham
Subject: Re: beam-tri singles MC problems

Probably should submit a CCPR on the failing to write to tape (including
an example failed jobid/url).  I don't notice any related CCPRs in the system,
and no corresponding errors in the farm_outs.


On Nov 5, 2015, at 9:00 AM, Bradley T Yale <[log in to unmask]> wrote:

> Ok, I'll do those 10to1 as well to match everything else.
>
> By the way, the "failed" job status you see is because the trigger plots fail for some reason and so the entire job gets classified that way.
> All other output is fine though, and just can't be written to tape. That has never been an issue before, but I disabled the trigger plots for the latest batch just in case.
> It could just be something with the system. I'll see if it's resolved tomorrow.
>
> ________________________________________
> From: Sho Uemura <[log in to unmask]>
> Sent: Thursday, November 5, 2015 1:49 AM
> To: Bradley T Yale
> Cc: Omar Moreno; Matthew Solt; Mathew Thomas Graham; Nathan Baltzell
> Subject: Re: beam-tri singles MC problems
>
> pairs1 seems better - there are still quite a few files that run under,
> but maybe 75% have the right number (1 ms/file * 100 files * 20 kHz =
> 2000) of events.
>
> cat
> /work/hallb/hps/mc_production/pass3/data_quality/recon/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_pairs1_*.txt
> | grep "^Read "
> Read 111 events
> Read 1987 events
> Read 2014 events
> Read 2013 events
> Read 2094 events
> Read 2094 events
> Read 1989 events
> Read 2083 events
> Read 2070 events
> Read 1887 events
> Read 2007 events
> Read 1955 events
> Read 2037 events
> Read 2013 events
> Read 1991 events
> Read 1900 events
> Read 2002 events
> Read 1996 events
> Read 1835 events
> Read 85 events
> Read 1914 events
> Read 111 events
> Read 98 events
> Read 202 events
> Read 114 events
> Read 155 events
> Read 2007 events
> Read 59 events
> Read 1800 events
> Read 2052 events
>
>
> On Thu, 5 Nov 2015, Bradley T Yale wrote:
>
>> Everything is failing to write to tape.
>>
>> Maybe this is also the cause of the badly cached dst files you were seeing as well.
>>
>> I have no idea what is causing this. That's why I included Nathan in this.
>>
>>
>> On a side note, are you seeing the same inconsistency in pairs1 beam-tri, or just singles?
>>
>>
>> ________________________________
>> From: Bradley T Yale
>> Sent: Thursday, November 5, 2015 1:13 AM
>> To: Omar Moreno; Sho Uemura
>> Cc: Matthew Solt; Mathew Thomas Graham; Nathan Baltzell
>> Subject: Re: beam-tri singles MC problems
>>
>>
>> So, the 10to1 readout jobs successfully completed, but failed to write to tape:
>>
>> http://scicomp.jlab.org/scicomp/#/jasmine/jobs?requested=details&id=115214062
>>
>>
>> I'm trying again after setting 'Memory space' back to "1024 MB", which is what it had been before.
>>
>> Is there anything else that could be causing this?
>>
>>
>> ________________________________
>> From: Bradley T Yale
>> Sent: Wednesday, November 4, 2015 7:41 PM
>> To: Omar Moreno; Sho Uemura
>> Cc: Matthew Solt; Mathew Thomas Graham
>> Subject: Re: beam-tri singles MC problems
>>
>>
>> Sorry. The latest ones are being reconstructed now and labelled 'NOTIMELIMIT'. They shouldn't take long once active. Their readout did not have a time limit to try to fix the problem, but just in case, I'm also reading out others 10-to-1 (labelled '10to1') and will probably start doing it that way so readout doesn't take forever.
>>
>>
>>
>> ________________________________
>> From: [log in to unmask] <[log in to unmask]> on behalf of Omar Moreno <[log in to unmask]>
>> Sent: Wednesday, November 4, 2015 4:24 PM
>> To: Sho Uemura
>> Cc: Bradley T Yale; Omar Moreno; Matthew Solt; Mathew Thomas Graham
>> Subject: Re: beam-tri singles MC problems
>>
>> Any news on this?  I'm transferring all of the beam-tri files over to SLAC and I'm noticing that they are still all random sizes.
>>
>> On Fri, Oct 23, 2015 at 3:33 PM, Sho Uemura <[log in to unmask]<mailto:[log in to unmask]>> wrote:
>> Hi Brad,
>>
>> 1. readout files seem to be really random lengths:
>>
>> cat /work/hallb/hps/mc_production/pass3/data_quality/readout/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_*|grep "^Read "|less
>>
>> Read 52 events
>> Read 16814 events
>> Read 17062 events
>> Read 12543 events
>> Read 328300 events
>> Read 355896 events
>> Read 12912 events
>> Read 309460 events
>> Read 306093 events
>> Read 313868 events
>> Read 325727 events
>> Read 298129 events
>> Read 417300 events
>> Read 423734 events
>> Read 308954 events
>> Read 365261 events
>> Read 301648 events
>> Read 316249 events
>> Read 340949 events
>> Read 319316 events
>> Read 424033 events
>> Read 308746 events
>> Read 317204 events
>> Read 12363 events
>> Read 355813 events
>> Read 329739 events
>> Read 298601 events
>> Read 29700 events
>> Read 12675 events
>> Read 287237 events
>> Read 311071 events
>> Read 12406 events
>> Read 12719 events
>> Read 30428 events
>> Read 324795 events
>> Read 345850 events
>> Read 25765 events
>> Read 29806 events
>> Read 77 events
>> Read 12544 events
>> Read 372642 events
>> Read 12779 events
>>
>> which makes it seem like jobs are failing randomly or something - I think normally we see most files have the same length, and a minority of files (missing some input files, or whatever) are shorter. In this case I think the expected number of events (number of triggers from 100 SLIC output files) is roughly 420k, and as you can see only a few files get there.
>>
>> I looked at log files and I don't see any obvious error messages, but maybe you have ideas? I'll keep digging.
>>
>> 2. Looks like the singles recon jobs are running into the job disk space limit, so that while readout files can have as many as 420k events, recon files never have more than 240k. Looks like the disk limit is set to 5 GB (and a 240k-event LCIO recon file is 5.5 GB), but it needs to be at least doubled - or the number of SLIC files per readout job needs to be reduced?
>>
>> cat /work/hallb/hps/mc_production/pass3/data_quality/recon/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_*|grep "^Read "|less
>> Read 1 events
>> Read 16814 events
>> Read 17062 events
>> Read 242359 events
>> Read 243949 events
>> Read 242153 events
>> Read 12776 events
>> Read 242666 events
>> Read 244165 events
>> Read 243592 events
>> Read 243433 events
>> Read 242878 events
>> Read 241861 events
>> Read 242055 events
>> Read 30428 events
>> Read 243156 events
>> Read 241638 events
>> Read 4 events
>> Read 241882 events
>>
>>> From /work/hallb/hps/mc_production/pass3/logs/recon/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_22.err:
>>
>> java.lang.RuntimeException: Error writing LCIO file
>>       at org.lcsim.util.loop.LCIODriver.process(LCIODriver.java:116)
>>       at org.lcsim.util.Driver.doProcess(Driver.java:261)
>>       at org.lcsim.util.Driver.processChildren(Driver.java:271)
>>       at org.lcsim.util.Driver.process(Driver.java:187)
>>       at org.lcsim.util.DriverAdapter.recordSupplied(DriverAdapter.java:74)
>>       at org.freehep.record.loop.DefaultRecordLoop.consumeRecord(DefaultRecordLoop.java:832)
>>       at org.freehep.record.loop.DefaultRecordLoop.loop(DefaultRecordLoop.java:668)
>>       at org.freehep.record.loop.DefaultRecordLoop.execute(DefaultRecordLoop.java:566)
>>       at org.lcsim.util.loop.LCSimLoop.loop(LCSimLoop.java:151)
>>       at org.lcsim.job.JobControlManager.run(JobControlManager.java:431)
>>       at org.hps.job.JobManager.run(JobManager.java:71)
>>       at org.lcsim.job.JobControlManager.run(JobControlManager.java:189)
>>       at org.hps.job.JobManager.main(JobManager.java:26)
>> Caused by: java.io.IOException: File too large
>>       at java.io.FileOutputStream.writeBytes(Native Method)
>>       at java.io.FileOutputStream.write(FileOutputStream.java:345)
>>       at hep.io.xdr.XDROutputStream$CountedOutputStream.write(XDROutputStream.java:103)
>>       at java.io.DataOutputStream.write(DataOutputStream.java:107)
>>       at hep.io.sio.SIOWriter$SIOByteArrayOutputStream.writeTo(SIOWriter.java:286)
>>       at hep.io.sio.SIOWriter.flushRecord(SIOWriter.java:208)
>>       at hep.io.sio.SIOWriter.createRecord(SIOWriter.java:83)
>>       at org.lcsim.lcio.LCIOWriter.write(LCIOWriter.java:251)
>>       at org.lcsim.util.loop.LCIODriver.process(LCIODriver.java:114)
>>       ... 12 more
>>
>>
>> Thanks. No rush on these, I imagine that even if the problems were fixed before/during the collaboration meeting we would not have time to use the files.
>>
>>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the HPS-SOFTWARE list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1
Top of Message | Previous Page | Permalink
Search Archives

Advanced Options
Options

		Log In
		Get Password

		Search Archives

		Subscribe or Unsubscribe