Print

Print


Omar,

Removing noisy channels save us only about ~10%, is not it.
So, if this is the case, then nosy channels are not the main
problem with huge event size.

Thanks, Stepan

On 3/6/13 11:47 PM, Omar Moreno wrote:
> I don't think the issue is with the LCIO overhead I think we are just 
> storing a lot of collections we can do without.  For example, a the 
> moment the final reconstructed LCIO file contains a collection of all 
> raw tracker hits, a collection of all fitted raw tracker hits and 
> collection containing all of the fit parameters from the fit to the 
> six samples of each of the hits.  This is a lot of extra information 
> that we probably don't want to save in the final recon LCIO.
>
> I took a quick look at the contents of the reconstructed lcio file and 
> found that all of the noisy channels weren't being filtered which 
> explains the huge file size.  I went ahead and reran the 
> reconstruction but this time filtering noisy channels and the file 
> size dropped to 5GB.  It still a bit large, but I'm sure we can drop 
> the size down quite a bit once we remove all there 
> extra unneeded collections.
>
> --Omar Moreno
>
>
> On Wed, Mar 6, 2013 at 7:54 PM, Stepan Stepanyan <[log in to unmask] 
> <mailto:[log in to unmask]>> wrote:
>
>     Hi Omar,
>
>     Thanks for a quick response. It will be very important to really know
>     what is the final size of the reconstructed event. Number you have
>     is x10
>     larger than the original event size. In the proposal we have x5
>     inflation
>     of the event size after reconstruction. At the meeting today Matt
>     explained
>     that the number in the proposal was not well motivated, but what you
>     have, it seems like a good motivation. Even with removal of FPGS data
>     I am not sure size will go down by x10, or what we probably want x50.
>     Is this large size due to the overhead of the LCIO format?
>
>     We are having these discussions about formats and analysis, I
>     think event
>     size will play important role in these discussions. I do not think
>     analysis
>     of data that HPS will get can be done on event that will be x10 or
>     even
>     x5 larger than the original event.
>
>     Regards, Stepan
>
>
>     On 3/6/13 10:18 PM, Omar Moreno wrote:
>>     Stepan,
>>
>>     The original EVIO file is 1.5 Gb but I only ran reconstruction on
>>     half the file.  There is a lot of extra information that is being
>>     stored in the final reconstructed LCIO file, such as FPGA Data,
>>     that should be removed so I'm sure that the file size is a bit
>>     inflated.  I'm sure once we filter out junk events and remove
>>     some unnecessary collections the file size will decrease
>>     significantly.
>>
>>     --Omar Moreno
>>
>>
>>     On Wed, Mar 6, 2013 at 7:05 PM, Stepan Stepanyan
>>     <[log in to unmask] <mailto:[log in to unmask]>> wrote:
>>
>>         Omar,
>>
>>         How big is the original file, before reconstruction.
>>
>>         Thanks, Stepan
>>
>>
>>         On 3/6/13 9:03 PM, Omar Moreno wrote:
>>>         Hello Everyone,
>>>
>>>         Just to give everyone an idea, a micro DST with basic track
>>>         information, hit information and Ecal cluster info is
>>>         approx. 29 Mb/500,000 test run events.  The reconstructed
>>>         LCIO file used to generate the root file was approx. 5.4
>>>         Gigs and it took about 4 minutes.  I expect the size to
>>>         increase for data from an electron run but it shouldn't be
>>>         by much.  I'll go ahead and study this using MC data and see
>>>         how much bigger the file gets.
>>>
>>>         --Omar Moreno
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>         On Wed, Mar 6, 2013 at 4:29 PM, Nelson, Timothy Knight
>>>         <[log in to unmask]
>>>         <mailto:[log in to unmask]>> wrote:
>>>
>>>             Hi Stepan,
>>>
>>>             I agree 100%.  I think we want exactly what you proposed
>>>             a year ago; a format with physics objects suitable for
>>>             physics analysis (the proposed "micro-DST").  This kind
>>>             of thing is relatively easy to provide and will be a
>>>             very useful thing to have.  In fact, the kind of "flat
>>>             ntuple" format that Omar began with can, I believe, be
>>>             read in and operated on with PAW, since the .rz format
>>>             is the same.  However, if he goes the next step as has
>>>             been recommended in the software group, and writes
>>>             classes to the ROOT file that require a dictionary to
>>>             read back, the data format will be ROOT only.
>>>
>>>             A couple of points that are important to understand...
>>>
>>>             1) Homer brings up an important point, which is the fact
>>>             that the only way we have to write these ROOT files is
>>>             to use the LCIO C++ API.  That is to say, one does the
>>>             java reconstruction in lcsim that creates LCIO objects
>>>             and writes out an LCIO file.  Then one runs a separate
>>>             C++ program that reads in the LCIO objects with the LCIO
>>>             C++ API and outputs this NTuple using root classes.
>>>             Therefore, no information that is currently not
>>>             persisted in the LCIO EDM by our reconstruction will
>>>             ever be available in the ROOT Ntuple.  So, this business
>>>             of writing out text files for vertexing and other
>>>             information not currently being written to LCIO does not
>>>             go away by creating ROOT Ntuples.  The only way to
>>>             eliminate that issue is to improve the completeness of
>>>             our LCIO-based EDM.  For example, Matt has been writing
>>>             out vertexing information to text files and reading it
>>>             back into ROOT.  However, LCIO DOES include vertex
>>>             objects and if we created these during reconstruction,
>>>             we would get that information in the LCIO file
>>>             automatically, and it would then easily be accessible
>>>             later on via LCIO.  There are a few examples of data
>>>             types we might want to persist that don't have an LCIO
>>>             class, but LCIO includes a "Generic Object" class that
>>>             can be used to encapsulate anything we might want to
>>>             add.  Again, only by getting the data we want in LCIO
>>>             will it ever be accessible in ROOT.  So, in my opinion,
>>>             this is where we should be focusing our attention.
>>>
>>>             2) As far as how to do ROOT-based analysis, Homer again
>>>             touched on the heart of the matter.  One can create a
>>>             ROOT Ntuple and perform analysis on that.  In practice,
>>>             this rarely means using ROOT on the command line, or
>>>             even CINT macros since ROOT's C interpreter is so badly
>>>             broken that it is not really usable for anything other
>>>             than making final plots from already-analyzed data.  In
>>>             practice, one usually runs some standalone compiled C++
>>>             that uses the ROOT libraries to do the analysis on a
>>>             ROOT DST.  For this reason, it is just as easy to have
>>>             that compiled C++ use the LCIO C++ API to access the
>>>             LCIO objects directly from the LCIO DST, and then use
>>>             all of the familiar ROOT tools in that code to do the
>>>             analysis, writing out whatever final histograms or
>>>             post-analysis ntuples one might want in to a ROOT file
>>>             for later plotting.  The only difference is that in the
>>>             former scenario, one learns the ROOT EDM that we invent
>>>             for the DST, and for the latter, one learns the LCIO
>>>             EDM.  To the extent that one is a mirror reflection of
>>>             the other, one has to do just as much work writing the
>>>             C++ analysis code either way.  That is why it doesn't
>>>             make any sense to duplicate the entire LCIO EDM in ROOT
>>>             (one file for the price of two!) and why we should
>>>             really only be considering creation of a new ROOT-based
>>>             "micro-DST" format aimed at physics analysis that will
>>>             be much slimmer than the LCIO.  Those that need more
>>>             than is in the "micro-DST" can very easily run their
>>>             C++/ROOT analysis code accessing the data directly from
>>>             LCIO using the LCIO C++ API.
>>>
>>>             Cheers,
>>>             Tim
>>>
>>>             On Mar 6, 2013, at 3:49 PM, Stepan Stepanyan
>>>             <[log in to unmask] <mailto:[log in to unmask]>> wrote:
>>>
>>>             > Hello Homer and Jeremy,
>>>             >
>>>             > It seems we all have right ideas and looks like very
>>>             similar ideas on
>>>             > how analysis of data must be done.
>>>             > The confusion looks to me comes from definitions of
>>>             "analysis" and
>>>             > "DST"s. When about a year ago I
>>>             > brought up the question of DSTs, and even sent out
>>>             possible format
>>>             > (attached document), I basically
>>>             > wanted what Jeremy said in the second sentence after
>>>             (3), physics
>>>             > objects only. What Omar showed
>>>             > today was very different from what I could describe as
>>>             DSTs. I
>>>             > understand Matt's point that in some
>>>             > cases you will need fine details, but I am not sure if
>>>             everyone will
>>>             > need that level of details.
>>>             > So I still think if we are talking about DSTs, the
>>>             format should be
>>>             > "physics objects only". And if Omar
>>>             > can make use of what I proposed a year ago will be great.
>>>             >
>>>             > As for general analysis, if we stick with (1), than we
>>>             will make large
>>>             > number of collaborators who are
>>>             > used to do analysis in ROOT quite unhappy. I
>>>             understand that duplicating
>>>             > processed data in many
>>>             > formats is also not a reasonable approach. So, if (2)
>>>             means (sorry for
>>>             > my ignorance) we can have some
>>>             > kind of "portal" that can connect LCIO recon file to
>>>             ROOT, then it is
>>>             > probably the best way to go.
>>>             >
>>>             > Again, sorry if I am misinterpreting the issue and/or
>>>             repeating what was
>>>             > already clear from your emails.
>>>             >
>>>             > Regards, Stepan
>>>             >
>>>             > On 3/6/13 6:10 PM, McCormick, Jeremy I. wrote:
>>>             >> Hi, Homer.
>>>             >>
>>>             >> Thanks for the thoughts.
>>>             >>
>>>             >> My view is that user analysis has three possible
>>>             pathways which make sense to consider:
>>>             >>
>>>             >> 1) Pure Java analysis using lcsim and outputting
>>>             histograms to AIDA files, viewable in JAS.
>>>             >>
>>>             >> 2) LCIO/ROOT analysis, reading in the LCIO recon
>>>             files, looping over these events, and making histograms
>>>             from a ROOT script.
>>>             >>
>>>             >> 3) Pure ROOT analysis, operating on a ROOT DST file.
>>>             >>
>>>             >> I don't really think that we need a DST containing
>>>             all of the information which is already present in the
>>>             final LCIO recon file.  This level of duplication is not
>>>             desirable.  Rather, the ROOT DST should contain physics
>>>             objects only, e.g. the equivalent of LCIO
>>>             ReconstructedParticles, Tracks, and Clusters, along with
>>>             event information.  This should be sufficient for doing
>>>             a pure physics analysis, e.g. good enough for most
>>>             users.  It is also likely that it could be represented
>>>             using simple arrays rather than classes, which to me is
>>>             desirable for this kind of format.
>>>             >>
>>>             >> If one wants to look at the associated hits of the
>>>             tracks, or something similarly detailed, then it seems
>>>             to me that it would be better to use the #1 and #2
>>>             approaches, as we can then avoid "reinventing the wheel"
>>>             by making ROOT files that mimic the structure of the
>>>             existing LCIO output.  This approach would require
>>>             working from the LCIO output, but I really don't see a
>>>             problem there.  It is not onerous at all.  The API is
>>>             straightforward and well-documented, and examples can be
>>>             provided.  There is already a simple analysis script in
>>>             my examples that you linked which plots information from
>>>             Tracks in an LCIO file using ROOT histogramming.
>>>              Similar plots could easily be made for the hits, etc.
>>>             >>
>>>             >> I suppose one could demand that all this data be put
>>>             into ROOT including the hits, but you're left with the
>>>             same problem.  Someone still has to learn the API of
>>>             whatever classes are used to store the data, and the
>>>             class headers also need to be loaded to interpret the
>>>             data.  Whether that format is LCIO or ROOT, it is
>>>             essentially the same level of knowledge that would be
>>>             required.  My feeling is actually that this will be more
>>>             difficult/cumbersome to work with in ROOT rather than
>>>             LCIO.  I wonder why we can't just go with what we
>>>             already have, e.g. the LCIO API, rather than invent
>>>             something analogous which does not seem to serve a very
>>>             clear purpose.  One can already use what's there in the
>>>             linked example to look at the full events, so can we
>>>             start there and see how far we get?
>>>             >>
>>>             >> If someone has a clear use case where pure ROOT data
>>>             is needed at the lowest level of detail, I would
>>>             consider this request, but I have seen nothing concrete
>>>             so far along these lines.
>>>             >>
>>>             >> --Jeremy
>>>             >>
>>>             >> -----Original Message-----
>>>             >> From: Homer [mailto:[log in to unmask]
>>>             <mailto:[log in to unmask]>]
>>>             >> Sent: Wednesday, March 06, 2013 2:51 PM
>>>             >> To: Jaros, John A.; Graham, Mathew Thomas; McCormick,
>>>             Jeremy I.; Graf, Norman A.; Moreno, Omar; Nelson,
>>>             Timothy Knight
>>>             >> Subject: DSTs and work on slcio files using C++
>>>             >>
>>>             >> Hi,
>>>             >>
>>>             >> I decided not to comment during the meeting because
>>>             it might have created more contention and I also wanted
>>>             to hear Jeremy's, Norman's and Omar's responses first
>>>             before throwing this out there. That said, from the
>>>             point of view of someone who has been doing lcsim SiD
>>>             analysis on slcio files I find the problems with using
>>>             the two formats in HPS a little strange. For SiD we take
>>>             slcio files and then run jet clustering and flavor
>>>             tagging using C++ code in the lcfi and
>>>             >> lcfi+ packages. For the flavor tagging we write out
>>>             root files for
>>>             >> lcfi+ running the
>>>             >> TMVA training and then for both the jet clustering
>>>             and the flavor tagging we write out slcio files. I
>>>             believe Malachi has done his whole analysis in C++ as a
>>>             Marlin processor. I had also successfully tested reading
>>>             slcio files in ROOT using a recipe provided by Jeremy. I
>>>             dropped using it when I realized that it was quite
>>>             simple to write the analysis in java. Perhaps one
>>>             solution is to stick to doing all development, even for
>>>             the DST, in java/lcsim and to just provide examples of
>>>             how to access the data from C++/ROOT reading slcio
>>>             files. Jeremy had documented much of this long ago at:
>>>             >>
>>>             >>
>>>             https://confluence.slac.stanford.edu/display/hpsg/Loading+LCIO+Files+into+ROOT
>>>             >>
>>>             >> If we just provide some examples, wouldn't that help
>>>             to at least put out the current fires? This would also
>>>             avoid having to support numerous extra sets of data
>>>             (DSTs and microDSTs in both formats with multiple passes
>>>             and subsets)??
>>>             >> Maybe I'm wrong but I think one can provide simple
>>>             recipes or modules for accessing any of the slcio file
>>>             contents in ROOT.
>>>             >>
>>>             >>     Homer
>>>             >>
>>>             >>
>>>             >>
>>>             ########################################################################
>>>             >> Use REPLY-ALL to reply to list
>>>             >>
>>>             >> To unsubscribe from the HPS-SOFTWARE list, click the
>>>             following link:
>>>             >>
>>>             https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1
>>>             >
>>>             > <dst.pdf>
>>>
>>>
>>>
>>>         ------------------------------------------------------------------------
>>>
>>>         Use REPLY-ALL to reply to list
>>>
>>>         To unsubscribe from the HPS-SOFTWARE list, click the
>>>         following link:
>>>         https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1
>>>
>>>
>>
>>
>>     ------------------------------------------------------------------------
>>
>>     Use REPLY-ALL to reply to list
>>
>>     To unsubscribe from the HPS-SOFTWARE list, click the following link:
>>     https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1
>>
>>
>


########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the HPS-SOFTWARE list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1