LISTSERV mailing list manager LISTSERV 16.5

Help for HPS-SOFTWARE Archives


HPS-SOFTWARE Archives

HPS-SOFTWARE Archives


HPS-SOFTWARE@LISTSERV.SLAC.STANFORD.EDU


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Monospaced Font

LISTSERV Archives

LISTSERV Archives

HPS-SOFTWARE Home

HPS-SOFTWARE Home

HPS-SOFTWARE  November 2015

HPS-SOFTWARE November 2015

Subject:

Re: beam-tri singles MC problems

From:

Bradley T Yale <[log in to unmask]>

Reply-To:

Software for the Heavy Photon Search Experiment <[log in to unmask]>

Date:

Fri, 6 Nov 2015 15:29:43 +0000

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (458 lines)

Re-running the problem files in quarantine, it looks like the same stdhep files are being read now:
/work/hallb/hps/mc_production/pass3/test/logs/slic/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_952.out

Maybe the latest SLIC update worked. I can redo the (tritrig)-beam-tri and see if it's fixed.
As mentioned, I don't see this problem in the other MC components, only beam-tri.
If you REALLY want to be safe, I can re-run everything, but would at least like to do it with a post-release jar (3.4.2-SNAPSHOT or 3.4.2-20151014.013425-5) so we can test current things.

________________________________________
From: Sho Uemura <[log in to unmask]>
Sent: Thursday, November 5, 2015 9:39 PM
To: Bradley T Yale
Subject: Re: beam-tri singles MC problems

I tried running beam_coords on the farm
(/work/hallb/hps/uemura/bradtest/beam-tri_100.xml, logfiles and output in
same directory) and it works fine.

I looked at beam-tri logs for pass2 for the same files, and they are fine.

So this stuff worked in pass2, broke in pass3, works again now, but
nothing has changed - same stdhep file, and the beam_coords binary hasn't
changed.

Can you try rerunning the slic beam-tri job? It could be something weird
like jcache screwing up and not copying the file correctly from tape -
that would affect every job in that run but not runs before or after.

On Thu, 5 Nov 2015, Sho Uemura wrote:

> It looks like the problem is that beam_coords is having trouble reading the
> beam.stdhep file and crashes, and so the beam-tri.stdhep file that goes into
> SLIC is missing all the beam background, and the trigger rate ends up being
> ridiculously low. Of course this affects every SLIC run that uses that
> beam.stdhep file.
>
> I get that from looking at
> /work/hallb/hps/mc_production/pass3/logs/slic/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_952.err
> and
> /work/hallb/hps/mc_production/pass3/logs/slic/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_952.out
> and comparing to other runs - you'll see that the .out file is missing some
> printouts after "Rotating Beam" and rot_beam.stdhep is missing from the file
> list. For example, one of the first things beam_coords should print is the
> number of events in the input file.
>
> So there must be something wrong with that stdhep file, but it has nothing to
> do with SLIC. Is it possible that this has always been happening, in pass2
> and earlier? I'll look at log files.
>
> Weirdly I have no difficulty running beam_coords on egsv3_10.stdhep on ifarm.
> Maybe there's something different about the batch farm environment?
>
> The bad news is that this must affect every MC that has beam background or
> beam-tri mixed in.
>
> On Thu, 5 Nov 2015, Bradley T Yale wrote:
>
>> First, I submitted a report about those otherwise successful jobs not being
>> written to tape, and it turned out to be a system glitch. It appears fixed
>> now and unrelated to the following,
>> which only affects ~15% of Pass3 beam-tri and tritrig-beam-tri files but no
>> other Pass3 MC components.
>>
>> The beam-tri files that were readout 10-to-1 have the same problem with an
>> inconsistent # of events, so it wasn't a problem with time/space allottment
>> for the jobs.
>> A few recon files with no time limit set for the jobs (100-to-1, labelled
>> 'NOTIMELIMIT') made it through before the tape-writing glitch as well, and
>> have the same problem.
>>
>> Digging a little further, it appears that this issue with readout event
>> inconsistency is likely related to the stdhep file-reading problem that
>> Jeremy found while fixing SLIC for v3-fieldmap, so I brought him into this.
>> Let me motivate that conclusion...
>>
>> About 85% of Pass3 beam-tri readout files look fine, and then:
>> cat
>> /work/hallb/hps/mc_production/pass3/data_quality/readout/beam-tri/1pt05/egsv3-triv2-g4v1_10to1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_*.txt
>> | grep "^Read "
>> ..........
>> Read 41911 events
>> Read 42775 events
>> Read 41551 events
>> Read 42055 events
>> Read 42556 events
>> Read 9 events
>> Read 7 events
>> Read 7 events
>> Read 3 events
>> Read 9 events
>> Read 10 events
>> Read 2 events
>> Read 13 events
>> Read 7 events
>> Read 41529 events
>> Read 8 events
>> Read 42149 events
>> Read 42141 events
>> Read 41933 events
>> Read 41856 events
>> Read 41711 events
>> Read 42038 events
>> Read 42004 events
>> Read 41997 events
>> Read 42029 events
>> Read 41764 events
>> Read 42156 events
>> Read 42245 events
>> Read 41732 events
>> Read 42060 events
>> Read 42070 events
>> Read 42060 events
>> Read 41962 events
>> Read 41967 events
>> Read 42071 events
>> Read 42067 events
>> Read 42017 events
>> Read 42046 events
>> Read 42614 events
>> Read 42655 events
>> Read 42337 events
>> Read 42342 events
>> Read 42503 events
>> Read 42454 events
>> Read 42237 events
>> Read 42338 events
>> Read 42607 events
>> Read 41791 events
>> Read 42309 events
>> Read 3 events
>> Read 4 events
>> Read 7 events
>> Read 7 events
>> Read 4 events
>> Read 6 events
>> Read 7 events
>> Read 7 events
>> Read 4 events
>> Read 41993 events
>>
>> The affected 10-to-1 readout files are #51-60 and #91-100, which were made
>> from SLIC files #501-600, and #901-1000.
>> For example:
>> cat
>> /work/hallb/hps/mc_production/pass3/data_quality/readout/beam-tri/1pt05/egsv3-triv2-g4v1_10to1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_96.txt
>> | grep "^Read "
>> /work/hallb/hps/mc_production/pass3/logs/readout/beam-tri/1pt05/egsv3-triv2-g4v1_10to1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_96.out
>>
>> Looking at the SLIC files that were used for readout (e.g. #951-960):
>> /work/hallb/hps/mc_production/pass3/logs/slic/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_952.out
>>
>> This shows that stdhep is not reading the events from 1 out of every 25
>> beam.stdhep files using the Pass3 setup.
>> The actual beam.stdhep files from this problem (#51-60 and #91-100 in
>> /mss/hallb/hps/production/stdhep/beam/1pt05/) look fine.
>>
>> Also, Pass3 tritrig-beam-tri, which are readout 1-to-1, have occasional
>> files which contain no events. This means that when the beam-tri files are
>> readout in larger quantities, these files without events shave off ~4000
>> events for each affected SLIC file used. This is probably why some of the
>> original 100-to-1 beam-tri files appear light on events, and are a lot
>> worse with 10-to-1.
>>
>> The corresponding Pass2 readout/recon, which used the same seed and files
>> as the problem ones, seem correct though:
>> cat
>> /work/hallb/hps/mc_production/data_quality/readout/beam-tri/1pt05/egsv3-triv2-g4v1_s2d6_HPS-EngRun2015-Nominal-v1_3.4.0-20150710_singles1_9*.txt
>> | grep "^Read "
>> cat
>> /work/hallb/hps/mc_production/pass2/logs/readout/beam-tri/1pt05/egsv3-triv2-g4v1_s2d6_HPS-EngRun2015-Nominal-v3_3.4.0_singles1_*.out
>> | grep "events "
>>
>> In summary, this inconsistency at readout is due to beam.stdhep files
>> occasionally not being able to be read during Pass3 SLIC jobs.
>> It only affects beam-tri made using the updated SLIC and v3-fieldmap
>> detector.
>> I'll make a Jira item about it.
>>
>> ________________________________________
>> From: Nathan Baltzell <[log in to unmask]>
>> Sent: Thursday, November 5, 2015 4:26 AM
>> To: Bradley T Yale
>> Cc: Sho Uemura; Omar Moreno; Matthew Solt; Mathew Thomas Graham
>> Subject: Re: beam-tri singles MC problems
>>
>> Probably should submit a CCPR on the failing to write to tape (including
>> an example failed jobid/url). I don't notice any related CCPRs in the
>> system,
>> and no corresponding errors in the farm_outs.
>>
>>
>> On Nov 5, 2015, at 9:00 AM, Bradley T Yale <[log in to unmask]> wrote:
>>
>>> Ok, I'll do those 10to1 as well to match everything else.
>>>
>>> By the way, the "failed" job status you see is because the trigger plots
>>> fail for some reason and so the entire job gets classified that way.
>>> All other output is fine though, and just can't be written to tape. That
>>> has never been an issue before, but I disabled the trigger plots for the
>>> latest batch just in case.
>>> It could just be something with the system. I'll see if it's resolved
>>> tomorrow.
>>>
>>> ________________________________________
>>> From: Sho Uemura <[log in to unmask]>
>>> Sent: Thursday, November 5, 2015 1:49 AM
>>> To: Bradley T Yale
>>> Cc: Omar Moreno; Matthew Solt; Mathew Thomas Graham; Nathan Baltzell
>>> Subject: Re: beam-tri singles MC problems
>>>
>>> pairs1 seems better - there are still quite a few files that run under,
>>> but maybe 75% have the right number (1 ms/file * 100 files * 20 kHz =
>>> 2000) of events.
>>>
>>> cat
>>> /work/hallb/hps/mc_production/pass3/data_quality/recon/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_pairs1_*.txt
>>> | grep "^Read "
>>> Read 111 events
>>> Read 1987 events
>>> Read 2014 events
>>> Read 2013 events
>>> Read 2094 events
>>> Read 2094 events
>>> Read 1989 events
>>> Read 2083 events
>>> Read 2070 events
>>> Read 1887 events
>>> Read 2007 events
>>> Read 1955 events
>>> Read 2037 events
>>> Read 2013 events
>>> Read 1991 events
>>> Read 1900 events
>>> Read 2002 events
>>> Read 1996 events
>>> Read 1835 events
>>> Read 85 events
>>> Read 1914 events
>>> Read 111 events
>>> Read 98 events
>>> Read 202 events
>>> Read 114 events
>>> Read 155 events
>>> Read 2007 events
>>> Read 59 events
>>> Read 1800 events
>>> Read 2052 events
>>>
>>>
>>> On Thu, 5 Nov 2015, Bradley T Yale wrote:
>>>
>>>> Everything is failing to write to tape.
>>>>
>>>> Maybe this is also the cause of the badly cached dst files you were
>>>> seeing as well.
>>>>
>>>> I have no idea what is causing this. That's why I included Nathan in
>>>> this.
>>>>
>>>>
>>>> On a side note, are you seeing the same inconsistency in pairs1 beam-tri,
>>>> or just singles?
>>>>
>>>>
>>>> ________________________________
>>>> From: Bradley T Yale
>>>> Sent: Thursday, November 5, 2015 1:13 AM
>>>> To: Omar Moreno; Sho Uemura
>>>> Cc: Matthew Solt; Mathew Thomas Graham; Nathan Baltzell
>>>> Subject: Re: beam-tri singles MC problems
>>>>
>>>>
>>>> So, the 10to1 readout jobs successfully completed, but failed to write to
>>>> tape:
>>>>
>>>> http://scicomp.jlab.org/scicomp/#/jasmine/jobs?requested=details&id=115214062
>>>>
>>>>
>>>> I'm trying again after setting 'Memory space' back to "1024 MB", which is
>>>> what it had been before.
>>>>
>>>> Is there anything else that could be causing this?
>>>>
>>>>
>>>> ________________________________
>>>> From: Bradley T Yale
>>>> Sent: Wednesday, November 4, 2015 7:41 PM
>>>> To: Omar Moreno; Sho Uemura
>>>> Cc: Matthew Solt; Mathew Thomas Graham
>>>> Subject: Re: beam-tri singles MC problems
>>>>
>>>>
>>>> Sorry. The latest ones are being reconstructed now and labelled
>>>> 'NOTIMELIMIT'. They shouldn't take long once active. Their readout did
>>>> not have a time limit to try to fix the problem, but just in case, I'm
>>>> also reading out others 10-to-1 (labelled '10to1') and will probably
>>>> start doing it that way so readout doesn't take forever.
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> From: [log in to unmask] <[log in to unmask]> on behalf of Omar
>>>> Moreno <[log in to unmask]>
>>>> Sent: Wednesday, November 4, 2015 4:24 PM
>>>> To: Sho Uemura
>>>> Cc: Bradley T Yale; Omar Moreno; Matthew Solt; Mathew Thomas Graham
>>>> Subject: Re: beam-tri singles MC problems
>>>>
>>>> Any news on this? I'm transferring all of the beam-tri files over to
>>>> SLAC and I'm noticing that they are still all random sizes.
>>>>
>>>> On Fri, Oct 23, 2015 at 3:33 PM, Sho Uemura
>>>> <[log in to unmask]<mailto:[log in to unmask]>> wrote:
>>>> Hi Brad,
>>>>
>>>> 1. readout files seem to be really random lengths:
>>>>
>>>> cat
>>>> /work/hallb/hps/mc_production/pass3/data_quality/readout/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_*|grep
>>>> "^Read "|less
>>>>
>>>> Read 52 events
>>>> Read 16814 events
>>>> Read 17062 events
>>>> Read 12543 events
>>>> Read 328300 events
>>>> Read 355896 events
>>>> Read 12912 events
>>>> Read 309460 events
>>>> Read 306093 events
>>>> Read 313868 events
>>>> Read 325727 events
>>>> Read 298129 events
>>>> Read 417300 events
>>>> Read 423734 events
>>>> Read 308954 events
>>>> Read 365261 events
>>>> Read 301648 events
>>>> Read 316249 events
>>>> Read 340949 events
>>>> Read 319316 events
>>>> Read 424033 events
>>>> Read 308746 events
>>>> Read 317204 events
>>>> Read 12363 events
>>>> Read 355813 events
>>>> Read 329739 events
>>>> Read 298601 events
>>>> Read 29700 events
>>>> Read 12675 events
>>>> Read 287237 events
>>>> Read 311071 events
>>>> Read 12406 events
>>>> Read 12719 events
>>>> Read 30428 events
>>>> Read 324795 events
>>>> Read 345850 events
>>>> Read 25765 events
>>>> Read 29806 events
>>>> Read 77 events
>>>> Read 12544 events
>>>> Read 372642 events
>>>> Read 12779 events
>>>>
>>>> which makes it seem like jobs are failing randomly or something - I think
>>>> normally we see most files have the same length, and a minority of files
>>>> (missing some input files, or whatever) are shorter. In this case I think
>>>> the expected number of events (number of triggers from 100 SLIC output
>>>> files) is roughly 420k, and as you can see only a few files get there.
>>>>
>>>> I looked at log files and I don't see any obvious error messages, but
>>>> maybe you have ideas? I'll keep digging.
>>>>
>>>> 2. Looks like the singles recon jobs are running into the job disk space
>>>> limit, so that while readout files can have as many as 420k events, recon
>>>> files never have more than 240k. Looks like the disk limit is set to 5 GB
>>>> (and a 240k-event LCIO recon file is 5.5 GB), but it needs to be at least
>>>> doubled - or the number of SLIC files per readout job needs to be
>>>> reduced?
>>>>
>>>> cat
>>>> /work/hallb/hps/mc_production/pass3/data_quality/recon/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_*|grep
>>>> "^Read "|less
>>>> Read 1 events
>>>> Read 16814 events
>>>> Read 17062 events
>>>> Read 242359 events
>>>> Read 243949 events
>>>> Read 242153 events
>>>> Read 12776 events
>>>> Read 242666 events
>>>> Read 244165 events
>>>> Read 243592 events
>>>> Read 243433 events
>>>> Read 242878 events
>>>> Read 241861 events
>>>> Read 242055 events
>>>> Read 30428 events
>>>> Read 243156 events
>>>> Read 241638 events
>>>> Read 4 events
>>>> Read 241882 events
>>>>
>>>>> From
>>>>> /work/hallb/hps/mc_production/pass3/logs/recon/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_22.err:
>>>>
>>>> java.lang.RuntimeException: Error writing LCIO file
>>>> at org.lcsim.util.loop.LCIODriver.process(LCIODriver.java:116)
>>>> at org.lcsim.util.Driver.doProcess(Driver.java:261)
>>>> at org.lcsim.util.Driver.processChildren(Driver.java:271)
>>>> at org.lcsim.util.Driver.process(Driver.java:187)
>>>> at
>>>> org.lcsim.util.DriverAdapter.recordSupplied(DriverAdapter.java:74)
>>>> at
>>>> org.freehep.record.loop.DefaultRecordLoop.consumeRecord(DefaultRecordLoop.java:832)
>>>> at
>>>> org.freehep.record.loop.DefaultRecordLoop.loop(DefaultRecordLoop.java:668)
>>>> at
>>>> org.freehep.record.loop.DefaultRecordLoop.execute(DefaultRecordLoop.java:566)
>>>> at org.lcsim.util.loop.LCSimLoop.loop(LCSimLoop.java:151)
>>>> at org.lcsim.job.JobControlManager.run(JobControlManager.java:431)
>>>> at org.hps.job.JobManager.run(JobManager.java:71)
>>>> at org.lcsim.job.JobControlManager.run(JobControlManager.java:189)
>>>> at org.hps.job.JobManager.main(JobManager.java:26)
>>>> Caused by: java.io.IOException: File too large
>>>> at java.io.FileOutputStream.writeBytes(Native Method)
>>>> at java.io.FileOutputStream.write(FileOutputStream.java:345)
>>>> at
>>>> hep.io.xdr.XDROutputStream$CountedOutputStream.write(XDROutputStream.java:103)
>>>> at java.io.DataOutputStream.write(DataOutputStream.java:107)
>>>> at
>>>> hep.io.sio.SIOWriter$SIOByteArrayOutputStream.writeTo(SIOWriter.java:286)
>>>> at hep.io.sio.SIOWriter.flushRecord(SIOWriter.java:208)
>>>> at hep.io.sio.SIOWriter.createRecord(SIOWriter.java:83)
>>>> at org.lcsim.lcio.LCIOWriter.write(LCIOWriter.java:251)
>>>> at org.lcsim.util.loop.LCIODriver.process(LCIODriver.java:114)
>>>> ... 12 more
>>>>
>>>>
>>>> Thanks. No rush on these, I imagine that even if the problems were fixed
>>>> before/during the collaboration meeting we would not have time to use the
>>>> files.
>>>>
>>>>
>>
>> ########################################################################
>> Use REPLY-ALL to reply to list
>>
>> To unsubscribe from the HPS-SOFTWARE list, click the following link:
>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1
>>
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the HPS-SOFTWARE list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1

Top of Message | Previous Page | Permalink

Advanced Options


Options

Log In

Log In

Get Password

Get Password


Search Archives

Search Archives


Subscribe or Unsubscribe

Subscribe or Unsubscribe


Archives

June 2024
May 2024
April 2024
March 2024
February 2024
January 2024
December 2023
November 2023
October 2023
September 2023
August 2023
July 2023
June 2023
May 2023
April 2023
March 2023
February 2023
January 2023
December 2022
November 2022
October 2022
September 2022
August 2022
June 2022
April 2022
March 2022
February 2022
January 2022
December 2021
November 2021
October 2021
September 2021
August 2021
July 2021
June 2021
May 2021
April 2021
March 2021
February 2021
January 2021
December 2020
November 2020
October 2020
September 2020
August 2020
July 2020
June 2020
May 2020
April 2020
March 2020
February 2020
January 2020
December 2019
November 2019
October 2019
September 2019
August 2019
July 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
October 2018
September 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
December 2017
November 2017
October 2017
September 2017
August 2017
July 2017
June 2017
May 2017
April 2017
March 2017
February 2017
January 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011

ATOM RSS1 RSS2



LISTSERV.SLAC.STANFORD.EDU

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager

Privacy Notice, Security Notice and Terms of Use