Print

Print


Re-running the problem files in quarantine, it looks like the same stdhep files are being read now:
/work/hallb/hps/mc_production/pass3/test/logs/slic/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_952.out

Maybe the latest SLIC update worked. I can redo the (tritrig)-beam-tri and see if it's fixed.
As mentioned, I don't see this problem in the other MC components, only beam-tri. 
If you REALLY want to be safe, I can re-run everything, but would at least like to do it with a post-release jar (3.4.2-SNAPSHOT or 3.4.2-20151014.013425-5) so we can test current things.

________________________________________
From: Sho Uemura <[log in to unmask]>
Sent: Thursday, November 5, 2015 9:39 PM
To: Bradley T Yale
Subject: Re: beam-tri singles MC problems

I tried running beam_coords on the farm
(/work/hallb/hps/uemura/bradtest/beam-tri_100.xml, logfiles and output in
same directory) and it works fine.

I looked at beam-tri logs for pass2 for the same files, and they are fine.

So this stuff worked in pass2, broke in pass3, works again now, but
nothing has changed - same stdhep file, and the beam_coords binary hasn't
changed.

Can you try rerunning the slic beam-tri job? It could be something weird
like jcache screwing up and not copying the file correctly from tape -
that would affect every job in that run but not runs before or after.

On Thu, 5 Nov 2015, Sho Uemura wrote:

> It looks like the problem is that beam_coords is having trouble reading the
> beam.stdhep file and crashes, and so the beam-tri.stdhep file that goes into
> SLIC is missing all the beam background, and the trigger rate ends up being
> ridiculously low. Of course this affects every SLIC run that uses that
> beam.stdhep file.
>
> I get that from looking at
> /work/hallb/hps/mc_production/pass3/logs/slic/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_952.err
> and
> /work/hallb/hps/mc_production/pass3/logs/slic/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_952.out
> and comparing to other runs - you'll see that the .out file is missing some
> printouts after "Rotating Beam" and rot_beam.stdhep is missing from the file
> list. For example, one of the first things beam_coords should print is the
> number of events in the input file.
>
> So there must be something wrong with that stdhep file, but it has nothing to
> do with SLIC. Is it possible that this has always been happening, in pass2
> and earlier? I'll look at log files.
>
> Weirdly I have no difficulty running beam_coords on egsv3_10.stdhep on ifarm.
> Maybe there's something different about the batch farm environment?
>
> The bad news is that this must affect every MC that has beam background or
> beam-tri mixed in.
>
> On Thu, 5 Nov 2015, Bradley T Yale wrote:
>
>> First, I submitted a report about those otherwise successful jobs not being
>> written to tape, and it turned out to be a system glitch. It appears fixed
>> now and unrelated to the following,
>> which only affects ~15% of Pass3 beam-tri and tritrig-beam-tri files but no
>> other Pass3 MC components.
>>
>> The beam-tri files that were readout 10-to-1 have the same problem with an
>> inconsistent # of events, so it wasn't a problem with time/space allottment
>> for the jobs.
>> A few recon files with no time limit set for the jobs (100-to-1, labelled
>> 'NOTIMELIMIT') made it through before the tape-writing glitch as well, and
>> have the same problem.
>>
>> Digging a little further, it appears that this issue with readout event
>> inconsistency is likely related to the stdhep file-reading problem that
>> Jeremy found while fixing SLIC for v3-fieldmap, so I brought him into this.
>> Let me motivate that conclusion...
>>
>> About 85% of Pass3 beam-tri readout files look fine, and then:
>> cat
>> /work/hallb/hps/mc_production/pass3/data_quality/readout/beam-tri/1pt05/egsv3-triv2-g4v1_10to1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_*.txt
>> | grep "^Read "
>> ..........
>> Read 41911 events
>> Read 42775 events
>> Read 41551 events
>> Read 42055 events
>> Read 42556 events
>> Read 9 events
>> Read 7 events
>> Read 7 events
>> Read 3 events
>> Read 9 events
>> Read 10 events
>> Read 2 events
>> Read 13 events
>> Read 7 events
>> Read 41529 events
>> Read 8 events
>> Read 42149 events
>> Read 42141 events
>> Read 41933 events
>> Read 41856 events
>> Read 41711 events
>> Read 42038 events
>> Read 42004 events
>> Read 41997 events
>> Read 42029 events
>> Read 41764 events
>> Read 42156 events
>> Read 42245 events
>> Read 41732 events
>> Read 42060 events
>> Read 42070 events
>> Read 42060 events
>> Read 41962 events
>> Read 41967 events
>> Read 42071 events
>> Read 42067 events
>> Read 42017 events
>> Read 42046 events
>> Read 42614 events
>> Read 42655 events
>> Read 42337 events
>> Read 42342 events
>> Read 42503 events
>> Read 42454 events
>> Read 42237 events
>> Read 42338 events
>> Read 42607 events
>> Read 41791 events
>> Read 42309 events
>> Read 3 events
>> Read 4 events
>> Read 7 events
>> Read 7 events
>> Read 4 events
>> Read 6 events
>> Read 7 events
>> Read 7 events
>> Read 4 events
>> Read 41993 events
>>
>> The affected 10-to-1 readout files are #51-60 and #91-100, which were made
>> from SLIC files #501-600, and #901-1000.
>> For example:
>> cat
>> /work/hallb/hps/mc_production/pass3/data_quality/readout/beam-tri/1pt05/egsv3-triv2-g4v1_10to1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_96.txt
>> | grep "^Read "
>> /work/hallb/hps/mc_production/pass3/logs/readout/beam-tri/1pt05/egsv3-triv2-g4v1_10to1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_96.out
>>
>> Looking at the SLIC files that were used for readout (e.g. #951-960):
>> /work/hallb/hps/mc_production/pass3/logs/slic/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_952.out
>>
>> This shows that stdhep is not reading the events from 1 out of every 25
>> beam.stdhep files using the Pass3 setup.
>> The actual beam.stdhep files from this problem (#51-60 and #91-100 in
>> /mss/hallb/hps/production/stdhep/beam/1pt05/) look fine.
>>
>> Also, Pass3 tritrig-beam-tri, which are readout 1-to-1, have occasional
>> files which contain no events. This means that when the beam-tri files are
>> readout in larger quantities, these files without events shave off ~4000
>> events for each affected SLIC file used. This is probably why some of the
>> original 100-to-1 beam-tri files appear light on events, and are a lot
>> worse with 10-to-1.
>>
>> The corresponding Pass2 readout/recon, which used the same seed and files
>> as the problem ones, seem correct though:
>> cat
>> /work/hallb/hps/mc_production/data_quality/readout/beam-tri/1pt05/egsv3-triv2-g4v1_s2d6_HPS-EngRun2015-Nominal-v1_3.4.0-20150710_singles1_9*.txt
>> | grep "^Read "
>> cat
>> /work/hallb/hps/mc_production/pass2/logs/readout/beam-tri/1pt05/egsv3-triv2-g4v1_s2d6_HPS-EngRun2015-Nominal-v3_3.4.0_singles1_*.out
>> | grep "events "
>>
>> In summary, this inconsistency at readout is due to beam.stdhep files
>> occasionally not being able to be read during Pass3 SLIC jobs.
>> It only affects beam-tri made using the updated SLIC and v3-fieldmap
>> detector.
>> I'll make a Jira item about it.
>>
>> ________________________________________
>> From: Nathan Baltzell <[log in to unmask]>
>> Sent: Thursday, November 5, 2015 4:26 AM
>> To: Bradley T Yale
>> Cc: Sho Uemura; Omar Moreno; Matthew Solt; Mathew Thomas Graham
>> Subject: Re: beam-tri singles MC problems
>>
>> Probably should submit a CCPR on the failing to write to tape (including
>> an example failed jobid/url).  I don't notice any related CCPRs in the
>> system,
>> and no corresponding errors in the farm_outs.
>>
>>
>> On Nov 5, 2015, at 9:00 AM, Bradley T Yale <[log in to unmask]> wrote:
>>
>>> Ok, I'll do those 10to1 as well to match everything else.
>>>
>>> By the way, the "failed" job status you see is because the trigger plots
>>> fail for some reason and so the entire job gets classified that way.
>>> All other output is fine though, and just can't be written to tape. That
>>> has never been an issue before, but I disabled the trigger plots for the
>>> latest batch just in case.
>>> It could just be something with the system. I'll see if it's resolved
>>> tomorrow.
>>>
>>> ________________________________________
>>> From: Sho Uemura <[log in to unmask]>
>>> Sent: Thursday, November 5, 2015 1:49 AM
>>> To: Bradley T Yale
>>> Cc: Omar Moreno; Matthew Solt; Mathew Thomas Graham; Nathan Baltzell
>>> Subject: Re: beam-tri singles MC problems
>>>
>>> pairs1 seems better - there are still quite a few files that run under,
>>> but maybe 75% have the right number (1 ms/file * 100 files * 20 kHz =
>>> 2000) of events.
>>>
>>> cat
>>> /work/hallb/hps/mc_production/pass3/data_quality/recon/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_pairs1_*.txt
>>> | grep "^Read "
>>> Read 111 events
>>> Read 1987 events
>>> Read 2014 events
>>> Read 2013 events
>>> Read 2094 events
>>> Read 2094 events
>>> Read 1989 events
>>> Read 2083 events
>>> Read 2070 events
>>> Read 1887 events
>>> Read 2007 events
>>> Read 1955 events
>>> Read 2037 events
>>> Read 2013 events
>>> Read 1991 events
>>> Read 1900 events
>>> Read 2002 events
>>> Read 1996 events
>>> Read 1835 events
>>> Read 85 events
>>> Read 1914 events
>>> Read 111 events
>>> Read 98 events
>>> Read 202 events
>>> Read 114 events
>>> Read 155 events
>>> Read 2007 events
>>> Read 59 events
>>> Read 1800 events
>>> Read 2052 events
>>>
>>>
>>> On Thu, 5 Nov 2015, Bradley T Yale wrote:
>>>
>>>> Everything is failing to write to tape.
>>>>
>>>> Maybe this is also the cause of the badly cached dst files you were
>>>> seeing as well.
>>>>
>>>> I have no idea what is causing this. That's why I included Nathan in
>>>> this.
>>>>
>>>>
>>>> On a side note, are you seeing the same inconsistency in pairs1 beam-tri,
>>>> or just singles?
>>>>
>>>>
>>>> ________________________________
>>>> From: Bradley T Yale
>>>> Sent: Thursday, November 5, 2015 1:13 AM
>>>> To: Omar Moreno; Sho Uemura
>>>> Cc: Matthew Solt; Mathew Thomas Graham; Nathan Baltzell
>>>> Subject: Re: beam-tri singles MC problems
>>>>
>>>>
>>>> So, the 10to1 readout jobs successfully completed, but failed to write to
>>>> tape:
>>>>
>>>> http://scicomp.jlab.org/scicomp/#/jasmine/jobs?requested=details&id=115214062
>>>>
>>>>
>>>> I'm trying again after setting 'Memory space' back to "1024 MB", which is
>>>> what it had been before.
>>>>
>>>> Is there anything else that could be causing this?
>>>>
>>>>
>>>> ________________________________
>>>> From: Bradley T Yale
>>>> Sent: Wednesday, November 4, 2015 7:41 PM
>>>> To: Omar Moreno; Sho Uemura
>>>> Cc: Matthew Solt; Mathew Thomas Graham
>>>> Subject: Re: beam-tri singles MC problems
>>>>
>>>>
>>>> Sorry. The latest ones are being reconstructed now and labelled
>>>> 'NOTIMELIMIT'. They shouldn't take long once active. Their readout did
>>>> not have a time limit to try to fix the problem, but just in case, I'm
>>>> also reading out others 10-to-1 (labelled '10to1') and will probably
>>>> start doing it that way so readout doesn't take forever.
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> From: [log in to unmask] <[log in to unmask]> on behalf of Omar
>>>> Moreno <[log in to unmask]>
>>>> Sent: Wednesday, November 4, 2015 4:24 PM
>>>> To: Sho Uemura
>>>> Cc: Bradley T Yale; Omar Moreno; Matthew Solt; Mathew Thomas Graham
>>>> Subject: Re: beam-tri singles MC problems
>>>>
>>>> Any news on this?  I'm transferring all of the beam-tri files over to
>>>> SLAC and I'm noticing that they are still all random sizes.
>>>>
>>>> On Fri, Oct 23, 2015 at 3:33 PM, Sho Uemura
>>>> <[log in to unmask]<mailto:[log in to unmask]>> wrote:
>>>> Hi Brad,
>>>>
>>>> 1. readout files seem to be really random lengths:
>>>>
>>>> cat
>>>> /work/hallb/hps/mc_production/pass3/data_quality/readout/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_*|grep
>>>> "^Read "|less
>>>>
>>>> Read 52 events
>>>> Read 16814 events
>>>> Read 17062 events
>>>> Read 12543 events
>>>> Read 328300 events
>>>> Read 355896 events
>>>> Read 12912 events
>>>> Read 309460 events
>>>> Read 306093 events
>>>> Read 313868 events
>>>> Read 325727 events
>>>> Read 298129 events
>>>> Read 417300 events
>>>> Read 423734 events
>>>> Read 308954 events
>>>> Read 365261 events
>>>> Read 301648 events
>>>> Read 316249 events
>>>> Read 340949 events
>>>> Read 319316 events
>>>> Read 424033 events
>>>> Read 308746 events
>>>> Read 317204 events
>>>> Read 12363 events
>>>> Read 355813 events
>>>> Read 329739 events
>>>> Read 298601 events
>>>> Read 29700 events
>>>> Read 12675 events
>>>> Read 287237 events
>>>> Read 311071 events
>>>> Read 12406 events
>>>> Read 12719 events
>>>> Read 30428 events
>>>> Read 324795 events
>>>> Read 345850 events
>>>> Read 25765 events
>>>> Read 29806 events
>>>> Read 77 events
>>>> Read 12544 events
>>>> Read 372642 events
>>>> Read 12779 events
>>>>
>>>> which makes it seem like jobs are failing randomly or something - I think
>>>> normally we see most files have the same length, and a minority of files
>>>> (missing some input files, or whatever) are shorter. In this case I think
>>>> the expected number of events (number of triggers from 100 SLIC output
>>>> files) is roughly 420k, and as you can see only a few files get there.
>>>>
>>>> I looked at log files and I don't see any obvious error messages, but
>>>> maybe you have ideas? I'll keep digging.
>>>>
>>>> 2. Looks like the singles recon jobs are running into the job disk space
>>>> limit, so that while readout files can have as many as 420k events, recon
>>>> files never have more than 240k. Looks like the disk limit is set to 5 GB
>>>> (and a 240k-event LCIO recon file is 5.5 GB), but it needs to be at least
>>>> doubled - or the number of SLIC files per readout job needs to be
>>>> reduced?
>>>>
>>>> cat
>>>> /work/hallb/hps/mc_production/pass3/data_quality/recon/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_*|grep
>>>> "^Read "|less
>>>> Read 1 events
>>>> Read 16814 events
>>>> Read 17062 events
>>>> Read 242359 events
>>>> Read 243949 events
>>>> Read 242153 events
>>>> Read 12776 events
>>>> Read 242666 events
>>>> Read 244165 events
>>>> Read 243592 events
>>>> Read 243433 events
>>>> Read 242878 events
>>>> Read 241861 events
>>>> Read 242055 events
>>>> Read 30428 events
>>>> Read 243156 events
>>>> Read 241638 events
>>>> Read 4 events
>>>> Read 241882 events
>>>>
>>>>> From
>>>>> /work/hallb/hps/mc_production/pass3/logs/recon/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_22.err:
>>>>
>>>> java.lang.RuntimeException: Error writing LCIO file
>>>>       at org.lcsim.util.loop.LCIODriver.process(LCIODriver.java:116)
>>>>       at org.lcsim.util.Driver.doProcess(Driver.java:261)
>>>>       at org.lcsim.util.Driver.processChildren(Driver.java:271)
>>>>       at org.lcsim.util.Driver.process(Driver.java:187)
>>>>       at
>>>> org.lcsim.util.DriverAdapter.recordSupplied(DriverAdapter.java:74)
>>>>       at
>>>> org.freehep.record.loop.DefaultRecordLoop.consumeRecord(DefaultRecordLoop.java:832)
>>>>       at
>>>> org.freehep.record.loop.DefaultRecordLoop.loop(DefaultRecordLoop.java:668)
>>>>       at
>>>> org.freehep.record.loop.DefaultRecordLoop.execute(DefaultRecordLoop.java:566)
>>>>       at org.lcsim.util.loop.LCSimLoop.loop(LCSimLoop.java:151)
>>>>       at org.lcsim.job.JobControlManager.run(JobControlManager.java:431)
>>>>       at org.hps.job.JobManager.run(JobManager.java:71)
>>>>       at org.lcsim.job.JobControlManager.run(JobControlManager.java:189)
>>>>       at org.hps.job.JobManager.main(JobManager.java:26)
>>>> Caused by: java.io.IOException: File too large
>>>>       at java.io.FileOutputStream.writeBytes(Native Method)
>>>>       at java.io.FileOutputStream.write(FileOutputStream.java:345)
>>>>       at
>>>> hep.io.xdr.XDROutputStream$CountedOutputStream.write(XDROutputStream.java:103)
>>>>       at java.io.DataOutputStream.write(DataOutputStream.java:107)
>>>>       at
>>>> hep.io.sio.SIOWriter$SIOByteArrayOutputStream.writeTo(SIOWriter.java:286)
>>>>       at hep.io.sio.SIOWriter.flushRecord(SIOWriter.java:208)
>>>>       at hep.io.sio.SIOWriter.createRecord(SIOWriter.java:83)
>>>>       at org.lcsim.lcio.LCIOWriter.write(LCIOWriter.java:251)
>>>>       at org.lcsim.util.loop.LCIODriver.process(LCIODriver.java:114)
>>>>       ... 12 more
>>>>
>>>>
>>>> Thanks. No rush on these, I imagine that even if the problems were fixed
>>>> before/during the collaboration meeting we would not have time to use the
>>>> files.
>>>>
>>>>
>>
>> ########################################################################
>> Use REPLY-ALL to reply to list
>>
>> To unsubscribe from the HPS-SOFTWARE list, click the following link:
>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1
>>
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the HPS-SOFTWARE list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1