Print

Print


I don't think the issue is with the LCIO overhead I think we are just
storing a lot of collections we can do without.  For example, a the moment
the final reconstructed LCIO file contains a collection of all raw tracker
hits, a collection of all fitted raw tracker hits and collection containing
all of the fit parameters from the fit to the six samples of each of the
hits.  This is a lot of extra information that we probably don't want to
save in the final recon LCIO.

I took a quick look at the contents of the reconstructed lcio file and
found that all of the noisy channels weren't being filtered which explains
the huge file size.  I went ahead and reran the reconstruction but this
time filtering noisy channels and the file size dropped to 5GB.  It still a
bit large, but I'm sure we can drop the size down quite a bit once we
remove all there extra unneeded collections.

--Omar Moreno


On Wed, Mar 6, 2013 at 7:54 PM, Stepan Stepanyan <[log in to unmask]> wrote:

>  Hi Omar,
>
> Thanks for a quick response. It will be very important to really know
> what is the final size of the reconstructed event. Number you have is x10
> larger than the original event size. In the proposal we have x5 inflation
> of the event size after reconstruction. At the meeting today Matt
> explained
> that the number in the proposal was not well motivated, but what you
> have, it seems like a good motivation. Even with removal of FPGS data
> I am not sure size will go down by x10, or what we probably want x50.
> Is this large size due to the overhead of the LCIO format?
>
> We are having these discussions about formats and analysis, I think event
> size will play important role in these discussions. I do not think analysis
> of data that HPS will get can be done on event that will be x10 or even
> x5 larger than the original event.
>
> Regards, Stepan
>
>
> On 3/6/13 10:18 PM, Omar Moreno wrote:
>
>  Stepan,
>
>  The original EVIO file is 1.5 Gb but I only ran reconstruction on half
> the file.  There is a lot of extra information that is being stored in the
> final reconstructed LCIO file, such as FPGA Data, that should be removed
> so I'm sure that the file size is a bit inflated.  I'm sure once we filter
> out junk events and remove some unnecessary collections the file size will
> decrease significantly.
>
>  --Omar Moreno
>
>
> On Wed, Mar 6, 2013 at 7:05 PM, Stepan Stepanyan <[log in to unmask]>wrote:
>
>>  Omar,
>>
>> How big is the original file, before reconstruction.
>>
>> Thanks, Stepan
>>
>>
>> On 3/6/13 9:03 PM, Omar Moreno wrote:
>>
>>  Hello Everyone,
>>
>>  Just to give everyone an idea, a micro DST with basic track
>> information, hit information and Ecal cluster info is approx. 29 Mb/500,000
>> test run events.  The reconstructed LCIO file used to generate the root
>> file was approx. 5.4 Gigs and it took about 4 minutes.  I expect the size
>> to increase for data from an electron run but it shouldn't be by much.
>>  I'll go ahead and study this using MC data and see how much bigger the
>> file gets.
>>
>>  --Omar Moreno
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Mar 6, 2013 at 4:29 PM, Nelson, Timothy Knight <
>> [log in to unmask]> wrote:
>>
>>> Hi Stepan,
>>>
>>> I agree 100%.  I think we want exactly what you proposed a year ago; a
>>> format with physics objects suitable for physics analysis (the proposed
>>> "micro-DST").  This kind of thing is relatively easy to provide and will be
>>> a very useful thing to have.  In fact, the kind of "flat ntuple" format
>>> that Omar began with can, I believe, be read in and operated on with PAW,
>>> since the .rz format is the same.  However, if he goes the next step as has
>>> been recommended in the software group, and writes classes to the ROOT file
>>> that require a dictionary to read back, the data format will be ROOT only.
>>>
>>> A couple of points that are important to understand...
>>>
>>> 1) Homer brings up an important point, which is the fact that the only
>>> way we have to write these ROOT files is to use the LCIO C++ API.  That is
>>> to say, one does the java reconstruction in lcsim that creates LCIO objects
>>> and writes out an LCIO file.  Then one runs a separate C++ program that
>>> reads in the LCIO objects with the LCIO C++ API and outputs this NTuple
>>> using root classes. Therefore, no information that is currently not
>>> persisted in the LCIO EDM by our reconstruction will ever be available in
>>> the ROOT Ntuple.  So, this business of writing out text files for vertexing
>>> and other information not currently being written to LCIO does not go away
>>> by creating ROOT Ntuples.  The only way to eliminate that issue is to
>>> improve the completeness of our LCIO-based EDM.  For example, Matt has been
>>> writing out vertexing information to text files and reading it back into
>>> ROOT.  However, LCIO DOES include vertex objects and if we created these
>>> during reconstruction, we would get that information in the LCIO file
>>> automatically, and it would then easily be accessible later on via LCIO.
>>>  There are a few examples of data types we might want to persist that don't
>>> have an LCIO class, but LCIO includes a "Generic Object" class that can be
>>> used to encapsulate anything we might want to add.  Again, only by getting
>>> the data we want in LCIO will it ever be accessible in ROOT.  So, in my
>>> opinion, this is where we should be focusing our attention.
>>>
>>> 2) As far as how to do ROOT-based analysis, Homer again touched on the
>>> heart of the matter.  One can create a ROOT Ntuple and perform analysis on
>>> that.  In practice, this rarely means using ROOT on the command line, or
>>> even CINT macros since ROOT's C interpreter is so badly broken that it is
>>> not really usable for anything other than making final plots from
>>> already-analyzed data.  In practice, one usually runs some standalone
>>> compiled C++ that uses the ROOT libraries to do the analysis on a ROOT DST.
>>>  For this reason, it is just as easy to have that compiled C++ use the LCIO
>>> C++ API to access the LCIO objects directly from the LCIO DST, and then use
>>> all of the familiar ROOT tools in that code to do the analysis, writing out
>>> whatever final histograms or post-analysis ntuples one might want in to a
>>> ROOT file for later plotting.  The only difference is that in the former
>>> scenario, one learns the ROOT EDM that we invent for the DST, and for the
>>> latter, one learns the LCIO EDM.  To the extent that one is a mirror
>>> reflection of the other, one has to do just as much work writing the C++
>>> analysis code either way.  That is why it doesn't make any sense to
>>> duplicate the entire LCIO EDM in ROOT (one file for the price of two!) and
>>> why we should really only be considering creation of a new ROOT-based
>>> "micro-DST" format aimed at physics analysis that will be much slimmer than
>>> the LCIO.  Those that need more than is in the "micro-DST" can very easily
>>> run their C++/ROOT analysis code accessing the data directly from LCIO
>>> using the LCIO C++ API.
>>>
>>> Cheers,
>>> Tim
>>>
>>> On Mar 6, 2013, at 3:49 PM, Stepan Stepanyan <[log in to unmask]> wrote:
>>>
>>> > Hello Homer and Jeremy,
>>> >
>>> > It seems we all have right ideas and looks like very similar ideas on
>>> > how analysis of data must be done.
>>> > The confusion looks to me comes from definitions of "analysis" and
>>> > "DST"s. When about a year ago I
>>> > brought up the question of DSTs, and even sent out possible format
>>> > (attached document), I basically
>>> > wanted what Jeremy said in the second sentence after (3), physics
>>> > objects only. What Omar showed
>>> > today was very different from what I could describe as DSTs. I
>>> > understand Matt's point that in some
>>> > cases you will need fine details, but I am not sure if everyone will
>>> > need that level of details.
>>> > So I still think if we are talking about DSTs, the format should be
>>> > "physics objects only". And if Omar
>>> > can make use of what I proposed a year ago will be great.
>>> >
>>> > As for general analysis, if we stick with (1), than we will make large
>>> > number of collaborators who are
>>> > used to do analysis in ROOT quite unhappy. I understand that
>>> duplicating
>>> > processed data in many
>>> > formats is also not a reasonable approach. So, if (2) means (sorry for
>>> > my ignorance) we can have some
>>> > kind of "portal" that can connect LCIO recon file to ROOT, then it is
>>> > probably the best way to go.
>>> >
>>> > Again, sorry if I am misinterpreting the issue and/or repeating what
>>> was
>>> > already clear from your emails.
>>> >
>>> > Regards, Stepan
>>> >
>>> > On 3/6/13 6:10 PM, McCormick, Jeremy I. wrote:
>>> >> Hi, Homer.
>>> >>
>>> >> Thanks for the thoughts.
>>> >>
>>> >> My view is that user analysis has three possible pathways which make
>>> sense to consider:
>>> >>
>>> >> 1) Pure Java analysis using lcsim and outputting histograms to AIDA
>>> files, viewable in JAS.
>>> >>
>>> >> 2) LCIO/ROOT analysis, reading in the LCIO recon files, looping over
>>> these events, and making histograms from a ROOT script.
>>> >>
>>> >> 3) Pure ROOT analysis, operating on a ROOT DST file.
>>> >>
>>> >> I don't really think that we need a DST containing all of the
>>> information which is already present in the final LCIO recon file.  This
>>> level of duplication is not desirable.  Rather, the ROOT DST should contain
>>> physics objects only, e.g. the equivalent of LCIO ReconstructedParticles,
>>> Tracks, and Clusters, along with event information.  This should be
>>> sufficient for doing a pure physics analysis, e.g. good enough for most
>>> users.  It is also likely that it could be represented using simple arrays
>>> rather than classes, which to me is desirable for this kind of format.
>>> >>
>>> >> If one wants to look at the associated hits of the tracks, or
>>> something similarly detailed, then it seems to me that it would be better
>>> to use the #1 and #2 approaches, as we can then avoid "reinventing the
>>> wheel" by making ROOT files that mimic the structure of the existing LCIO
>>> output.  This approach would require working from the LCIO output, but I
>>> really don't see a problem there.  It is not onerous at all.  The API is
>>> straightforward and well-documented, and examples can be provided.  There
>>> is already a simple analysis script in my examples that you linked which
>>> plots information from Tracks in an LCIO file using ROOT histogramming.
>>>  Similar plots could easily be made for the hits, etc.
>>> >>
>>> >> I suppose one could demand that all this data be put into ROOT
>>> including the hits, but you're left with the same problem.  Someone still
>>> has to learn the API of whatever classes are used to store the data, and
>>> the class headers also need to be loaded to interpret the data.  Whether
>>> that format is LCIO or ROOT, it is essentially the same level of knowledge
>>> that would be required.  My feeling is actually that this will be more
>>> difficult/cumbersome to work with in ROOT rather than LCIO.  I wonder why
>>> we can't just go with what we already have, e.g. the LCIO API, rather than
>>> invent something analogous which does not seem to serve a very clear
>>> purpose.  One can already use what's there in the linked example to look at
>>> the full events, so can we start there and see how far we get?
>>> >>
>>> >> If someone has a clear use case where pure ROOT data is needed at the
>>> lowest level of detail, I would consider this request, but I have seen
>>> nothing concrete so far along these lines.
>>> >>
>>> >> --Jeremy
>>> >>
>>> >> -----Original Message-----
>>> >> From: Homer [mailto:[log in to unmask]]
>>> >> Sent: Wednesday, March 06, 2013 2:51 PM
>>> >> To: Jaros, John A.; Graham, Mathew Thomas; McCormick, Jeremy I.;
>>> Graf, Norman A.; Moreno, Omar; Nelson, Timothy Knight
>>> >> Subject: DSTs and work on slcio files using C++
>>> >>
>>> >> Hi,
>>> >>
>>> >> I decided not to comment during the meeting because it might have
>>> created more contention and I also wanted to hear Jeremy's, Norman's and
>>> Omar's responses first before throwing this out there. That said, from the
>>> point of view of someone who has been doing lcsim SiD analysis on slcio
>>> files I find the problems with using the two formats in HPS a little
>>> strange. For SiD we take slcio files and then run jet clustering and flavor
>>> tagging using C++ code in the lcfi and
>>> >> lcfi+ packages. For the flavor tagging we write out root files for
>>> >> lcfi+ running the
>>> >> TMVA training and then for both the jet clustering and the flavor
>>> tagging we write out slcio files. I believe Malachi has done his whole
>>> analysis in C++ as a Marlin processor. I had also successfully tested
>>> reading slcio files in ROOT using a recipe provided by Jeremy. I dropped
>>> using it when I realized that it was quite simple to write the analysis in
>>> java. Perhaps one solution is to stick to doing all development, even for
>>> the DST, in java/lcsim and to just provide examples of how to access the
>>> data from C++/ROOT reading slcio files. Jeremy had documented much of this
>>> long ago at:
>>> >>
>>> >>
>>> https://confluence.slac.stanford.edu/display/hpsg/Loading+LCIO+Files+into+ROOT
>>> >>
>>> >> If we just provide some examples, wouldn't that help to at least put
>>> out the current fires? This would also avoid having to support numerous
>>> extra sets of data (DSTs and microDSTs in both formats with multiple passes
>>> and subsets)??
>>> >> Maybe I'm wrong but I think one can provide simple recipes or modules
>>> for accessing any of the slcio file contents in ROOT.
>>> >>
>>> >>     Homer
>>> >>
>>> >>
>>> >>
>>> ########################################################################
>>> >> Use REPLY-ALL to reply to list
>>> >>
>>> >> To unsubscribe from the HPS-SOFTWARE list, click the following link:
>>> >> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1
>>> >
>>>  > <dst.pdf>
>>>
>>>
>>
>>  ------------------------------
>>
>> Use REPLY-ALL to reply to list
>>
>> To unsubscribe from the HPS-SOFTWARE list, click the following link:
>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1
>>
>>
>
> ------------------------------
>
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the HPS-SOFTWARE list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1
>
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the HPS-SOFTWARE list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1