LISTSERV mailing list manager LISTSERV 16.5

Help for HPS-SOFTWARE Archives


HPS-SOFTWARE Archives

HPS-SOFTWARE Archives


HPS-SOFTWARE@LISTSERV.SLAC.STANFORD.EDU


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

HPS-SOFTWARE Home

HPS-SOFTWARE Home

HPS-SOFTWARE  November 2015

HPS-SOFTWARE November 2015

Subject:

Re: beam-tri singles MC problems

From:

Bradley T Yale <[log in to unmask]>

Reply-To:

Software for the Heavy Photon Search Experiment <[log in to unmask]>

Date:

Thu, 5 Nov 2015 23:26:18 +0000

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (326 lines)

First, I submitted a report about those otherwise successful jobs not being written to tape, and it turned out to be a system glitch. It appears fixed now and unrelated to the following, 
which only affects ~15% of Pass3 beam-tri and tritrig-beam-tri files but no other Pass3 MC components.

The beam-tri files that were readout 10-to-1 have the same problem with an inconsistent # of events, so it wasn't a problem with time/space allottment for the jobs.
A few recon files with no time limit set for the jobs (100-to-1, labelled 'NOTIMELIMIT') made it through before the tape-writing glitch as well, and have the same problem.

Digging a little further, it appears that this issue with readout event inconsistency is likely related to the stdhep file-reading problem that Jeremy found while fixing SLIC for v3-fieldmap, so I brought him into this.
Let me motivate that conclusion...

About 85% of Pass3 beam-tri readout files look fine, and then:
cat /work/hallb/hps/mc_production/pass3/data_quality/readout/beam-tri/1pt05/egsv3-triv2-g4v1_10to1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_*.txt | grep "^Read "
..........
Read 41911 events
Read 42775 events
Read 41551 events
Read 42055 events
Read 42556 events
Read 9 events
Read 7 events
Read 7 events
Read 3 events
Read 9 events
Read 10 events
Read 2 events
Read 13 events
Read 7 events
Read 41529 events
Read 8 events
Read 42149 events
Read 42141 events
Read 41933 events
Read 41856 events
Read 41711 events
Read 42038 events
Read 42004 events
Read 41997 events
Read 42029 events
Read 41764 events
Read 42156 events
Read 42245 events
Read 41732 events
Read 42060 events
Read 42070 events
Read 42060 events
Read 41962 events
Read 41967 events
Read 42071 events
Read 42067 events
Read 42017 events
Read 42046 events
Read 42614 events
Read 42655 events
Read 42337 events
Read 42342 events
Read 42503 events
Read 42454 events
Read 42237 events
Read 42338 events
Read 42607 events
Read 41791 events
Read 42309 events
Read 3 events
Read 4 events
Read 7 events
Read 7 events
Read 4 events
Read 6 events
Read 7 events
Read 7 events
Read 4 events
Read 41993 events

The affected 10-to-1 readout files are #51-60 and #91-100, which were made from SLIC files #501-600, and #901-1000.
For example:
cat /work/hallb/hps/mc_production/pass3/data_quality/readout/beam-tri/1pt05/egsv3-triv2-g4v1_10to1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_96.txt | grep "^Read "
/work/hallb/hps/mc_production/pass3/logs/readout/beam-tri/1pt05/egsv3-triv2-g4v1_10to1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_96.out

Looking at the SLIC files that were used for readout (e.g. #951-960):
/work/hallb/hps/mc_production/pass3/logs/slic/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_952.out

This shows that stdhep is not reading the events from 1 out of every 25 beam.stdhep files using the Pass3 setup. 
The actual beam.stdhep files from this problem (#51-60 and #91-100 in /mss/hallb/hps/production/stdhep/beam/1pt05/) look fine.

Also, Pass3 tritrig-beam-tri, which are readout 1-to-1, have occasional files which contain no events. This means that when the beam-tri files are readout in larger quantities, these files without events shave off ~4000 events for each affected SLIC file used. This is probably why some of the original 100-to-1 beam-tri files appear light on events, and are a lot worse with 10-to-1.

The corresponding Pass2 readout/recon, which used the same seed and files as the problem ones, seem correct though: 
cat /work/hallb/hps/mc_production/data_quality/readout/beam-tri/1pt05/egsv3-triv2-g4v1_s2d6_HPS-EngRun2015-Nominal-v1_3.4.0-20150710_singles1_9*.txt | grep "^Read "
cat /work/hallb/hps/mc_production/pass2/logs/readout/beam-tri/1pt05/egsv3-triv2-g4v1_s2d6_HPS-EngRun2015-Nominal-v3_3.4.0_singles1_*.out | grep "events "

In summary, this inconsistency at readout is due to beam.stdhep files occasionally not being able to be read during Pass3 SLIC jobs.
It only affects beam-tri made using the updated SLIC and v3-fieldmap detector. 
I'll make a Jira item about it.

________________________________________
From: Nathan Baltzell <[log in to unmask]>
Sent: Thursday, November 5, 2015 4:26 AM
To: Bradley T Yale
Cc: Sho Uemura; Omar Moreno; Matthew Solt; Mathew Thomas Graham
Subject: Re: beam-tri singles MC problems

Probably should submit a CCPR on the failing to write to tape (including
an example failed jobid/url).  I don't notice any related CCPRs in the system,
and no corresponding errors in the farm_outs.


On Nov 5, 2015, at 9:00 AM, Bradley T Yale <[log in to unmask]> wrote:

> Ok, I'll do those 10to1 as well to match everything else.
>
> By the way, the "failed" job status you see is because the trigger plots fail for some reason and so the entire job gets classified that way.
> All other output is fine though, and just can't be written to tape. That has never been an issue before, but I disabled the trigger plots for the latest batch just in case.
> It could just be something with the system. I'll see if it's resolved tomorrow.
>
> ________________________________________
> From: Sho Uemura <[log in to unmask]>
> Sent: Thursday, November 5, 2015 1:49 AM
> To: Bradley T Yale
> Cc: Omar Moreno; Matthew Solt; Mathew Thomas Graham; Nathan Baltzell
> Subject: Re: beam-tri singles MC problems
>
> pairs1 seems better - there are still quite a few files that run under,
> but maybe 75% have the right number (1 ms/file * 100 files * 20 kHz =
> 2000) of events.
>
> cat
> /work/hallb/hps/mc_production/pass3/data_quality/recon/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_pairs1_*.txt
> | grep "^Read "
> Read 111 events
> Read 1987 events
> Read 2014 events
> Read 2013 events
> Read 2094 events
> Read 2094 events
> Read 1989 events
> Read 2083 events
> Read 2070 events
> Read 1887 events
> Read 2007 events
> Read 1955 events
> Read 2037 events
> Read 2013 events
> Read 1991 events
> Read 1900 events
> Read 2002 events
> Read 1996 events
> Read 1835 events
> Read 85 events
> Read 1914 events
> Read 111 events
> Read 98 events
> Read 202 events
> Read 114 events
> Read 155 events
> Read 2007 events
> Read 59 events
> Read 1800 events
> Read 2052 events
>
>
> On Thu, 5 Nov 2015, Bradley T Yale wrote:
>
>> Everything is failing to write to tape.
>>
>> Maybe this is also the cause of the badly cached dst files you were seeing as well.
>>
>> I have no idea what is causing this. That's why I included Nathan in this.
>>
>>
>> On a side note, are you seeing the same inconsistency in pairs1 beam-tri, or just singles?
>>
>>
>> ________________________________
>> From: Bradley T Yale
>> Sent: Thursday, November 5, 2015 1:13 AM
>> To: Omar Moreno; Sho Uemura
>> Cc: Matthew Solt; Mathew Thomas Graham; Nathan Baltzell
>> Subject: Re: beam-tri singles MC problems
>>
>>
>> So, the 10to1 readout jobs successfully completed, but failed to write to tape:
>>
>> http://scicomp.jlab.org/scicomp/#/jasmine/jobs?requested=details&id=115214062
>>
>>
>> I'm trying again after setting 'Memory space' back to "1024 MB", which is what it had been before.
>>
>> Is there anything else that could be causing this?
>>
>>
>> ________________________________
>> From: Bradley T Yale
>> Sent: Wednesday, November 4, 2015 7:41 PM
>> To: Omar Moreno; Sho Uemura
>> Cc: Matthew Solt; Mathew Thomas Graham
>> Subject: Re: beam-tri singles MC problems
>>
>>
>> Sorry. The latest ones are being reconstructed now and labelled 'NOTIMELIMIT'. They shouldn't take long once active. Their readout did not have a time limit to try to fix the problem, but just in case, I'm also reading out others 10-to-1 (labelled '10to1') and will probably start doing it that way so readout doesn't take forever.
>>
>>
>>
>> ________________________________
>> From: [log in to unmask] <[log in to unmask]> on behalf of Omar Moreno <[log in to unmask]>
>> Sent: Wednesday, November 4, 2015 4:24 PM
>> To: Sho Uemura
>> Cc: Bradley T Yale; Omar Moreno; Matthew Solt; Mathew Thomas Graham
>> Subject: Re: beam-tri singles MC problems
>>
>> Any news on this?  I'm transferring all of the beam-tri files over to SLAC and I'm noticing that they are still all random sizes.
>>
>> On Fri, Oct 23, 2015 at 3:33 PM, Sho Uemura <[log in to unmask]<mailto:[log in to unmask]>> wrote:
>> Hi Brad,
>>
>> 1. readout files seem to be really random lengths:
>>
>> cat /work/hallb/hps/mc_production/pass3/data_quality/readout/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_*|grep "^Read "|less
>>
>> Read 52 events
>> Read 16814 events
>> Read 17062 events
>> Read 12543 events
>> Read 328300 events
>> Read 355896 events
>> Read 12912 events
>> Read 309460 events
>> Read 306093 events
>> Read 313868 events
>> Read 325727 events
>> Read 298129 events
>> Read 417300 events
>> Read 423734 events
>> Read 308954 events
>> Read 365261 events
>> Read 301648 events
>> Read 316249 events
>> Read 340949 events
>> Read 319316 events
>> Read 424033 events
>> Read 308746 events
>> Read 317204 events
>> Read 12363 events
>> Read 355813 events
>> Read 329739 events
>> Read 298601 events
>> Read 29700 events
>> Read 12675 events
>> Read 287237 events
>> Read 311071 events
>> Read 12406 events
>> Read 12719 events
>> Read 30428 events
>> Read 324795 events
>> Read 345850 events
>> Read 25765 events
>> Read 29806 events
>> Read 77 events
>> Read 12544 events
>> Read 372642 events
>> Read 12779 events
>>
>> which makes it seem like jobs are failing randomly or something - I think normally we see most files have the same length, and a minority of files (missing some input files, or whatever) are shorter. In this case I think the expected number of events (number of triggers from 100 SLIC output files) is roughly 420k, and as you can see only a few files get there.
>>
>> I looked at log files and I don't see any obvious error messages, but maybe you have ideas? I'll keep digging.
>>
>> 2. Looks like the singles recon jobs are running into the job disk space limit, so that while readout files can have as many as 420k events, recon files never have more than 240k. Looks like the disk limit is set to 5 GB (and a 240k-event LCIO recon file is 5.5 GB), but it needs to be at least doubled - or the number of SLIC files per readout job needs to be reduced?
>>
>> cat /work/hallb/hps/mc_production/pass3/data_quality/recon/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_*|grep "^Read "|less
>> Read 1 events
>> Read 16814 events
>> Read 17062 events
>> Read 242359 events
>> Read 243949 events
>> Read 242153 events
>> Read 12776 events
>> Read 242666 events
>> Read 244165 events
>> Read 243592 events
>> Read 243433 events
>> Read 242878 events
>> Read 241861 events
>> Read 242055 events
>> Read 30428 events
>> Read 243156 events
>> Read 241638 events
>> Read 4 events
>> Read 241882 events
>>
>>> From /work/hallb/hps/mc_production/pass3/logs/recon/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_22.err:
>>
>> java.lang.RuntimeException: Error writing LCIO file
>>       at org.lcsim.util.loop.LCIODriver.process(LCIODriver.java:116)
>>       at org.lcsim.util.Driver.doProcess(Driver.java:261)
>>       at org.lcsim.util.Driver.processChildren(Driver.java:271)
>>       at org.lcsim.util.Driver.process(Driver.java:187)
>>       at org.lcsim.util.DriverAdapter.recordSupplied(DriverAdapter.java:74)
>>       at org.freehep.record.loop.DefaultRecordLoop.consumeRecord(DefaultRecordLoop.java:832)
>>       at org.freehep.record.loop.DefaultRecordLoop.loop(DefaultRecordLoop.java:668)
>>       at org.freehep.record.loop.DefaultRecordLoop.execute(DefaultRecordLoop.java:566)
>>       at org.lcsim.util.loop.LCSimLoop.loop(LCSimLoop.java:151)
>>       at org.lcsim.job.JobControlManager.run(JobControlManager.java:431)
>>       at org.hps.job.JobManager.run(JobManager.java:71)
>>       at org.lcsim.job.JobControlManager.run(JobControlManager.java:189)
>>       at org.hps.job.JobManager.main(JobManager.java:26)
>> Caused by: java.io.IOException: File too large
>>       at java.io.FileOutputStream.writeBytes(Native Method)
>>       at java.io.FileOutputStream.write(FileOutputStream.java:345)
>>       at hep.io.xdr.XDROutputStream$CountedOutputStream.write(XDROutputStream.java:103)
>>       at java.io.DataOutputStream.write(DataOutputStream.java:107)
>>       at hep.io.sio.SIOWriter$SIOByteArrayOutputStream.writeTo(SIOWriter.java:286)
>>       at hep.io.sio.SIOWriter.flushRecord(SIOWriter.java:208)
>>       at hep.io.sio.SIOWriter.createRecord(SIOWriter.java:83)
>>       at org.lcsim.lcio.LCIOWriter.write(LCIOWriter.java:251)
>>       at org.lcsim.util.loop.LCIODriver.process(LCIODriver.java:114)
>>       ... 12 more
>>
>>
>> Thanks. No rush on these, I imagine that even if the problems were fixed before/during the collaboration meeting we would not have time to use the files.
>>
>>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the HPS-SOFTWARE list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1

Top of Message | Previous Page | Permalink

Advanced Options


Options

Log In

Log In

Get Password

Get Password


Search Archives

Search Archives


Subscribe or Unsubscribe

Subscribe or Unsubscribe


Archives

April 2024
March 2024
February 2024
January 2024
December 2023
November 2023
October 2023
September 2023
August 2023
July 2023
June 2023
May 2023
April 2023
March 2023
February 2023
January 2023
December 2022
November 2022
October 2022
September 2022
August 2022
June 2022
April 2022
March 2022
February 2022
January 2022
December 2021
November 2021
October 2021
September 2021
August 2021
July 2021
June 2021
May 2021
April 2021
March 2021
February 2021
January 2021
December 2020
November 2020
October 2020
September 2020
August 2020
July 2020
June 2020
May 2020
April 2020
March 2020
February 2020
January 2020
December 2019
November 2019
October 2019
September 2019
August 2019
July 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
October 2018
September 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
December 2017
November 2017
October 2017
September 2017
August 2017
July 2017
June 2017
May 2017
April 2017
March 2017
February 2017
January 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011

ATOM RSS1 RSS2



LISTSERV.SLAC.STANFORD.EDU

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager

Privacy Notice, Security Notice and Terms of Use