On Fri, Nov 6, 2015 at 3:27 PM, Sho Uemura <[log in to unmask]> wrote:

Looks good to me.

For people who care: This is nothing to do with SLIC, since the error was coming from one of the standalone stdhep utilities (/u/group/hps/hps_soft/stdhep/bin/beam_coords) that reads in a stdhep file from tape. But neither the utility nor the input stdhep files have changed since pass2, and this error did not happen in pass2. My theory is that the cache copy of the stdhep file was corrupt.

Anyway, I think this problem is fixed now. The affected beam-tri and tritrig-beam-tri must be rerun, but we should decide if the ECal scoring plane fix merits rerunning all of the pass3 MC.

On Fri, 6 Nov 2015, Bradley T Yale wrote:

Re-running the problem files in quarantine, it looks like the same stdhep files are being read now:
/work/hallb/hps/mc_production/pass3/test/logs/slic/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_952.out

Maybe the latest SLIC update worked. I can redo the (tritrig)-beam-tri and see if it's fixed.
As mentioned, I don't see this problem in the other MC components, only beam-tri.
If you REALLY want to be safe, I can re-run everything, but would at least like to do it with a post-release jar (3.4.2-SNAPSHOT or 3.4.2-20151014.013425-5) so we can test current things.

________________________________________
From: Sho Uemura <[log in to unmask]>
Sent: Thursday, November 5, 2015 9:39 PM
To: Bradley T Yale
Subject: Re: beam-tri singles MC problems

I tried running beam_coords on the farm
(/work/hallb/hps/uemura/bradtest/beam-tri_100.xml, logfiles and output in
same directory) and it works fine.

I looked at beam-tri logs for pass2 for the same files, and they are fine.

So this stuff worked in pass2, broke in pass3, works again now, but
nothing has changed - same stdhep file, and the beam_coords binary hasn't
changed.

Can you try rerunning the slic beam-tri job? It could be something weird
like jcache screwing up and not copying the file correctly from tape -
that would affect every job in that run but not runs before or after.

On Thu, 5 Nov 2015, Sho Uemura wrote:

It looks like the problem is that beam_coords is having trouble reading the
beam.stdhep file and crashes, and so the beam-tri.stdhep file that goes into
SLIC is missing all the beam background, and the trigger rate ends up being
ridiculously low. Of course this affects every SLIC run that uses that
beam.stdhep file.

I get that from looking at
/work/hallb/hps/mc_production/pass3/logs/slic/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_952.err
and
/work/hallb/hps/mc_production/pass3/logs/slic/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_952.out
and comparing to other runs - you'll see that the .out file is missing some
printouts after "Rotating Beam" and rot_beam.stdhep is missing from the file
list. For example, one of the first things beam_coords should print is the
number of events in the input file.

So there must be something wrong with that stdhep file, but it has nothing to
do with SLIC. Is it possible that this has always been happening, in pass2
and earlier? I'll look at log files.

Weirdly I have no difficulty running beam_coords on egsv3_10.stdhep on ifarm.
Maybe there's something different about the batch farm environment?

The bad news is that this must affect every MC that has beam background or
beam-tri mixed in.

On Thu, 5 Nov 2015, Bradley T Yale wrote:

First, I submitted a report about those otherwise successful jobs not being
written to tape, and it turned out to be a system glitch. It appears fixed
now and unrelated to the following,
which only affects ~15% of Pass3 beam-tri and tritrig-beam-tri files but no
other Pass3 MC components.

The beam-tri files that were readout 10-to-1 have the same problem with an
inconsistent # of events, so it wasn't a problem with time/space allottment
for the jobs.
A few recon files with no time limit set for the jobs (100-to-1, labelled
'NOTIMELIMIT') made it through before the tape-writing glitch as well, and
have the same problem.

Digging a little further, it appears that this issue with readout event
inconsistency is likely related to the stdhep file-reading problem that
Jeremy found while fixing SLIC for v3-fieldmap, so I brought him into this.
Let me motivate that conclusion...

About 85% of Pass3 beam-tri readout files look fine, and then:
cat
/work/hallb/hps/mc_production/pass3/data_quality/readout/beam-tri/1pt05/egsv3-triv2-g4v1_10to1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_*.txt
| grep "^Read "
..........
Read 41911 events
Read 42775 events
Read 41551 events
Read 42055 events
Read 42556 events
Read 9 events
Read 7 events
Read 7 events
Read 3 events
Read 9 events
Read 10 events
Read 2 events
Read 13 events
Read 7 events
Read 41529 events
Read 8 events
Read 42149 events
Read 42141 events
Read 41933 events
Read 41856 events
Read 41711 events
Read 42038 events
Read 42004 events
Read 41997 events
Read 42029 events
Read 41764 events
Read 42156 events
Read 42245 events
Read 41732 events
Read 42060 events
Read 42070 events
Read 42060 events
Read 41962 events
Read 41967 events
Read 42071 events
Read 42067 events
Read 42017 events
Read 42046 events
Read 42614 events
Read 42655 events
Read 42337 events
Read 42342 events
Read 42503 events
Read 42454 events
Read 42237 events
Read 42338 events
Read 42607 events
Read 41791 events
Read 42309 events
Read 3 events
Read 4 events
Read 7 events
Read 7 events
Read 4 events
Read 6 events
Read 7 events
Read 7 events
Read 4 events
Read 41993 events

The affected 10-to-1 readout files are #51-60 and #91-100, which were made
from SLIC files #501-600, and #901-1000.
For example:
cat
/work/hallb/hps/mc_production/pass3/data_quality/readout/beam-tri/1pt05/egsv3-triv2-g4v1_10to1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_96.txt
| grep "^Read "
/work/hallb/hps/mc_production/pass3/logs/readout/beam-tri/1pt05/egsv3-triv2-g4v1_10to1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_96.out

Looking at the SLIC files that were used for readout (e.g. #951-960):
/work/hallb/hps/mc_production/pass3/logs/slic/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_952.out

This shows that stdhep is not reading the events from 1 out of every 25
beam.stdhep files using the Pass3 setup.
The actual beam.stdhep files from this problem (#51-60 and #91-100 in
/mss/hallb/hps/production/stdhep/beam/1pt05/) look fine.

Also, Pass3 tritrig-beam-tri, which are readout 1-to-1, have occasional
files which contain no events. This means that when the beam-tri files are
readout in larger quantities, these files without events shave off ~4000
events for each affected SLIC file used. This is probably why some of the
original 100-to-1 beam-tri files appear light on events, and are a lot
worse with 10-to-1.

The corresponding Pass2 readout/recon, which used the same seed and files
as the problem ones, seem correct though:
cat
/work/hallb/hps/mc_production/data_quality/readout/beam-tri/1pt05/egsv3-triv2-g4v1_s2d6_HPS-EngRun2015-Nominal-v1_3.4.0-20150710_singles1_9*.txt
| grep "^Read "
cat
/work/hallb/hps/mc_production/pass2/logs/readout/beam-tri/1pt05/egsv3-triv2-g4v1_s2d6_HPS-EngRun2015-Nominal-v3_3.4.0_singles1_*.out
| grep "events "

In summary, this inconsistency at readout is due to beam.stdhep files
occasionally not being able to be read during Pass3 SLIC jobs.
It only affects beam-tri made using the updated SLIC and v3-fieldmap
detector.
I'll make a Jira item about it.

________________________________________
From: Nathan Baltzell <[log in to unmask]>
Sent: Thursday, November 5, 2015 4:26 AM
To: Bradley T Yale
Cc: Sho Uemura; Omar Moreno; Matthew Solt; Mathew Thomas Graham
Subject: Re: beam-tri singles MC problems

Probably should submit a CCPR on the failing to write to tape (including
an example failed jobid/url). I don't notice any related CCPRs in the
system,
and no corresponding errors in the farm_outs.

On Nov 5, 2015, at 9:00 AM, Bradley T Yale <[log in to unmask]> wrote:

Ok, I'll do those 10to1 as well to match everything else.

By the way, the "failed" job status you see is because the trigger plots
fail for some reason and so the entire job gets classified that way.
All other output is fine though, and just can't be written to tape. That
has never been an issue before, but I disabled the trigger plots for the
latest batch just in case.
It could just be something with the system. I'll see if it's resolved
tomorrow.

________________________________________
From: Sho Uemura <[log in to unmask]>
Sent: Thursday, November 5, 2015 1:49 AM
To: Bradley T Yale
Cc: Omar Moreno; Matthew Solt; Mathew Thomas Graham; Nathan Baltzell
Subject: Re: beam-tri singles MC problems

pairs1 seems better - there are still quite a few files that run under,
but maybe 75% have the right number (1 ms/file * 100 files * 20 kHz =
2000) of events.

cat
/work/hallb/hps/mc_production/pass3/data_quality/recon/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_pairs1_*.txt
| grep "^Read "
Read 111 events
Read 1987 events
Read 2014 events
Read 2013 events
Read 2094 events
Read 2094 events
Read 1989 events
Read 2083 events
Read 2070 events
Read 1887 events
Read 2007 events
Read 1955 events
Read 2037 events
Read 2013 events
Read 1991 events
Read 1900 events
Read 2002 events
Read 1996 events
Read 1835 events
Read 85 events
Read 1914 events
Read 111 events
Read 98 events
Read 202 events
Read 114 events
Read 155 events
Read 2007 events
Read 59 events
Read 1800 events
Read 2052 events

On Thu, 5 Nov 2015, Bradley T Yale wrote:

Everything is failing to write to tape.

Maybe this is also the cause of the badly cached dst files you were
seeing as well.

I have no idea what is causing this. That's why I included Nathan in
this.

On a side note, are you seeing the same inconsistency in pairs1 beam-tri,
or just singles?

________________________________
From: Bradley T Yale
Sent: Thursday, November 5, 2015 1:13 AM
To: Omar Moreno; Sho Uemura
Cc: Matthew Solt; Mathew Thomas Graham; Nathan Baltzell
Subject: Re: beam-tri singles MC problems

So, the 10to1 readout jobs successfully completed, but failed to write to
tape:

http://scicomp.jlab.org/scicomp/#/jasmine/jobs?requested=details&id=115214062

I'm trying again after setting 'Memory space' back to "1024 MB", which is
what it had been before.

Is there anything else that could be causing this?

________________________________
From: Bradley T Yale
Sent: Wednesday, November 4, 2015 7:41 PM
To: Omar Moreno; Sho Uemura
Cc: Matthew Solt; Mathew Thomas Graham
Subject: Re: beam-tri singles MC problems

Sorry. The latest ones are being reconstructed now and labelled
'NOTIMELIMIT'. They shouldn't take long once active. Their readout did
not have a time limit to try to fix the problem, but just in case, I'm
also reading out others 10-to-1 (labelled '10to1') and will probably
start doing it that way so readout doesn't take forever.

________________________________
From: [log in to unmask] <[log in to unmask]> on behalf of Omar
Moreno <[log in to unmask]>
Sent: Wednesday, November 4, 2015 4:24 PM
To: Sho Uemura
Cc: Bradley T Yale; Omar Moreno; Matthew Solt; Mathew Thomas Graham
Subject: Re: beam-tri singles MC problems

Any news on this? I'm transferring all of the beam-tri files over to
SLAC and I'm noticing that they are still all random sizes.

On Fri, Oct 23, 2015 at 3:33 PM, Sho Uemura
<[log in to unmask]<mailto:[log in to unmask]>> wrote:
Hi Brad,

1. readout files seem to be really random lengths:

cat
/work/hallb/hps/mc_production/pass3/data_quality/readout/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_*|grep
"^Read "|less

Read 52 events
Read 16814 events
Read 17062 events
Read 12543 events
Read 328300 events
Read 355896 events
Read 12912 events
Read 309460 events
Read 306093 events
Read 313868 events
Read 325727 events
Read 298129 events
Read 417300 events
Read 423734 events
Read 308954 events
Read 365261 events
Read 301648 events
Read 316249 events
Read 340949 events
Read 319316 events
Read 424033 events
Read 308746 events
Read 317204 events
Read 12363 events
Read 355813 events
Read 329739 events
Read 298601 events
Read 29700 events
Read 12675 events
Read 287237 events
Read 311071 events
Read 12406 events
Read 12719 events
Read 30428 events
Read 324795 events
Read 345850 events
Read 25765 events
Read 29806 events
Read 77 events
Read 12544 events
Read 372642 events
Read 12779 events

which makes it seem like jobs are failing randomly or something - I think
normally we see most files have the same length, and a minority of files
(missing some input files, or whatever) are shorter. In this case I think
the expected number of events (number of triggers from 100 SLIC output
files) is roughly 420k, and as you can see only a few files get there.

I looked at log files and I don't see any obvious error messages, but
maybe you have ideas? I'll keep digging.

2. Looks like the singles recon jobs are running into the job disk space
limit, so that while readout files can have as many as 420k events, recon
files never have more than 240k. Looks like the disk limit is set to 5 GB
(and a 240k-event LCIO recon file is 5.5 GB), but it needs to be at least
doubled - or the number of SLIC files per readout job needs to be
reduced?

cat
/work/hallb/hps/mc_production/pass3/data_quality/recon/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_*|grep
"^Read "|less
Read 1 events
Read 16814 events
Read 17062 events
Read 242359 events
Read 243949 events
Read 242153 events
Read 12776 events
Read 242666 events
Read 244165 events
Read 243592 events
Read 243433 events
Read 242878 events
Read 241861 events
Read 242055 events
Read 30428 events
Read 243156 events
Read 241638 events
Read 4 events
Read 241882 events

From
/work/hallb/hps/mc_production/pass3/logs/recon/beam-tri/1pt05/egsv3-triv2-g4v1_HPS-EngRun2015-Nominal-v3-fieldmap_3.4.1_singles1_22.err:

java.lang.RuntimeException: Error writing LCIO file
at org.lcsim.util.loop.LCIODriver.process(LCIODriver.java:116)
at org.lcsim.util.Driver.doProcess(Driver.java:261)
at org.lcsim.util.Driver.processChildren(Driver.java:271)
at org.lcsim.util.Driver.process(Driver.java:187)
at
org.lcsim.util.DriverAdapter.recordSupplied(DriverAdapter.java:74)
at
org.freehep.record.loop.DefaultRecordLoop.consumeRecord(DefaultRecordLoop.java:832)
at
org.freehep.record.loop.DefaultRecordLoop.loop(DefaultRecordLoop.java:668)
at
org.freehep.record.loop.DefaultRecordLoop.execute(DefaultRecordLoop.java:566)
at org.lcsim.util.loop.LCSimLoop.loop(LCSimLoop.java:151)
at org.lcsim.job.JobControlManager.run(JobControlManager.java:431)
at org.hps.job.JobManager.run(JobManager.java:71)
at org.lcsim.job.JobControlManager.run(JobControlManager.java:189)
at org.hps.job.JobManager.main(JobManager.java:26)
Caused by: java.io.IOException: File too large
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:345)
at
hep.io.xdr.XDROutputStream$CountedOutputStream.write(XDROutputStream.java:103)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at
hep.io.sio.SIOWriter$SIOByteArrayOutputStream.writeTo(SIOWriter.java:286)
at hep.io.sio.SIOWriter.flushRecord(SIOWriter.java:208)
at hep.io.sio.SIOWriter.createRecord(SIOWriter.java:83)
at org.lcsim.lcio.LCIOWriter.write(LCIOWriter.java:251)
at org.lcsim.util.loop.LCIODriver.process(LCIODriver.java:114)
... 12 more

Thanks. No rush on these, I imagine that even if the problems were fixed
before/during the collaboration meeting we would not have time to use the
files.

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the HPS-SOFTWARE list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the HPS-SOFTWARE list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the HPS-SOFTWARE list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1

--
BEGIN-ANTISPAM-VOTING-LINKS
------------------------------------------------------

NOTE: This message was trained as non-spam. If this is wrong,
please correct the training as soon as possible.

Teach CanIt if this mail (ID 01PCksWCB) is spam:
Spam: https://www.spamtrap.odu.edu/canit/b.php?i=01PCksWCB&m=ed38d853dedf&t=20151106&c=s
Not spam: https://www.spamtrap.odu.edu/canit/b.php?i=01PCksWCB&m=ed38d853dedf&t=20151106&c=n
Forget vote: https://www.spamtrap.odu.edu/canit/b.php?i=01PCksWCB&m=ed38d853dedf&t=20151106&c=f
------------------------------------------------------
END-ANTISPAM-VOTING-LINKS