I don't think the issue is with the LCIO overhead I think we are just storing a lot of collections we can do without.  For example, a the moment the final reconstructed LCIO file contains a collection of all raw tracker hits, a collection of all fitted raw tracker hits and collection containing all of the fit parameters from the fit to the six samples of each of the hits.  This is a lot of extra information that we probably don't want to save in the final recon LCIO.

I took a quick look at the contents of the reconstructed lcio file and found that all of the noisy channels weren't being filtered which explains the huge file size.  I went ahead and reran the reconstruction but this time filtering noisy channels and the file size dropped to 5GB.  It still a bit large, but I'm sure we can drop the size down quite a bit once we remove all there extra unneeded collections.

--Omar Moreno

On Wed, Mar 6, 2013 at 7:54 PM, Stepan Stepanyan <[log in to unmask]> wrote:
Hi Omar,

Thanks for a quick response. It will be very important to really know
what is the final size of the reconstructed event. Number you have is x10
larger than the original event size. In the proposal we have x5 inflation
of the event size after reconstruction. At the meeting today Matt explained
that the number in the proposal was not well motivated, but what you
have, it seems like a good motivation. Even with removal of FPGS data
I am not sure size will go down by x10, or what we probably want x50.
Is this large size due to the overhead of the LCIO format?

We are having these discussions about formats and analysis, I think event
size will play important role in these discussions. I do not think analysis
of data that HPS will get can be done on event that will be x10 or even
x5 larger than the original event.

Regards, Stepan

On 3/6/13 10:18 PM, Omar Moreno wrote:

The original EVIO file is 1.5 Gb but I only ran reconstruction on half the file.  There is a lot of extra information that is being stored in the final reconstructed LCIO file, such as FPGA Data, that should be removed so I'm sure that the file size is a bit inflated.  I'm sure once we filter out junk events and remove some unnecessary collections the file size will decrease significantly. 

--Omar Moreno

On Wed, Mar 6, 2013 at 7:05 PM, Stepan Stepanyan <[log in to unmask]> wrote:

How big is the original file, before reconstruction.

Thanks, Stepan

On 3/6/13 9:03 PM, Omar Moreno wrote:
Hello Everyone, 

Just to give everyone an idea, a micro DST with basic track information, hit information and Ecal cluster info is approx. 29 Mb/500,000 test run events.  The reconstructed LCIO file used to generate the root file was approx. 5.4 Gigs and it took about 4 minutes.  I expect the size to increase for data from an electron run but it shouldn't be by much.  I'll go ahead and study this using MC data and see how much bigger the file gets. 

--Omar Moreno

On Wed, Mar 6, 2013 at 4:29 PM, Nelson, Timothy Knight <[log in to unmask]> wrote:
Hi Stepan,

I agree 100%.  I think we want exactly what you proposed a year ago; a format with physics objects suitable for physics analysis (the proposed "micro-DST").  This kind of thing is relatively easy to provide and will be a very useful thing to have.  In fact, the kind of "flat ntuple" format that Omar began with can, I believe, be read in and operated on with PAW, since the .rz format is the same.  However, if he goes the next step as has been recommended in the software group, and writes classes to the ROOT file that require a dictionary to read back, the data format will be ROOT only.

A couple of points that are important to understand...

1) Homer brings up an important point, which is the fact that the only way we have to write these ROOT files is to use the LCIO C++ API.  That is to say, one does the java reconstruction in lcsim that creates LCIO objects and writes out an LCIO file.  Then one runs a separate C++ program that reads in the LCIO objects with the LCIO C++ API and outputs this NTuple using root classes. Therefore, no information that is currently not persisted in the LCIO EDM by our reconstruction will ever be available in the ROOT Ntuple.  So, this business of writing out text files for vertexing and other information not currently being written to LCIO does not go away by creating ROOT Ntuples.  The only way to eliminate that issue is to improve the completeness of our LCIO-based EDM.  For example, Matt has been writing out vertexing information to text files and reading it back into ROOT.  However, LCIO DOES include vertex objects and if we created these during reconstruction, we would get that information in the LCIO file automatically, and it would then easily be accessible later on via LCIO.  There are a few examples of data types we might want to persist that don't have an LCIO class, but LCIO includes a "Generic Object" class that can be used to encapsulate anything we might want to add.  Again, only by getting the data we want in LCIO will it ever be accessible in ROOT.  So, in my opinion, this is where we should be focusing our attention.

2) As far as how to do ROOT-based analysis, Homer again touched on the heart of the matter.  One can create a ROOT Ntuple and perform analysis on that.  In practice, this rarely means using ROOT on the command line, or even CINT macros since ROOT's C interpreter is so badly broken that it is not really usable for anything other than making final plots from already-analyzed data.  In practice, one usually runs some standalone compiled C++ that uses the ROOT libraries to do the analysis on a ROOT DST.  For this reason, it is just as easy to have that compiled C++ use the LCIO C++ API to access the LCIO objects directly from the LCIO DST, and then use all of the familiar ROOT tools in that code to do the analysis, writing out whatever final histograms or post-analysis ntuples one might want in to a ROOT file for later plotting.  The only difference is that in the former scenario, one learns the ROOT EDM that we invent for the DST, and for the latter, one learns the LCIO EDM.  To the extent that one is a mirror reflection of the other, one has to do just as much work writing the C++ analysis code either way.  That is why it doesn't make any sense to duplicate the entire LCIO EDM in ROOT (one file for the price of two!) and why we should really only be considering creation of a new ROOT-based "micro-DST" format aimed at physics analysis that will be much slimmer than the LCIO.  Those that need more than is in the "micro-DST" can very easily run their C++/ROOT analysis code accessing the data directly from LCIO using the LCIO C++ API.


On Mar 6, 2013, at 3:49 PM, Stepan Stepanyan <[log in to unmask]> wrote:

> Hello Homer and Jeremy,
> It seems we all have right ideas and looks like very similar ideas on
> how analysis of data must be done.
> The confusion looks to me comes from definitions of "analysis" and
> "DST"s. When about a year ago I
> brought up the question of DSTs, and even sent out possible format
> (attached document), I basically
> wanted what Jeremy said in the second sentence after (3), physics
> objects only. What Omar showed
> today was very different from what I could describe as DSTs. I
> understand Matt's point that in some
> cases you will need fine details, but I am not sure if everyone will
> need that level of details.
> So I still think if we are talking about DSTs, the format should be
> "physics objects only". And if Omar
> can make use of what I proposed a year ago will be great.
> As for general analysis, if we stick with (1), than we will make large
> number of collaborators who are
> used to do analysis in ROOT quite unhappy. I understand that duplicating
> processed data in many
> formats is also not a reasonable approach. So, if (2) means (sorry for
> my ignorance) we can have some
> kind of "portal" that can connect LCIO recon file to ROOT, then it is
> probably the best way to go.
> Again, sorry if I am misinterpreting the issue and/or repeating what was
> already clear from your emails.
> Regards, Stepan
> On 3/6/13 6:10 PM, McCormick, Jeremy I. wrote:
>> Hi, Homer.
>> Thanks for the thoughts.
>> My view is that user analysis has three possible pathways which make sense to consider:
>> 1) Pure Java analysis using lcsim and outputting histograms to AIDA files, viewable in JAS.
>> 2) LCIO/ROOT analysis, reading in the LCIO recon files, looping over these events, and making histograms from a ROOT script.
>> 3) Pure ROOT analysis, operating on a ROOT DST file.
>> I don't really think that we need a DST containing all of the information which is already present in the final LCIO recon file.  This level of duplication is not desirable.  Rather, the ROOT DST should contain physics objects only, e.g. the equivalent of LCIO ReconstructedParticles, Tracks, and Clusters, along with event information.  This should be sufficient for doing a pure physics analysis, e.g. good enough for most users.  It is also likely that it could be represented using simple arrays rather than classes, which to me is desirable for this kind of format.
>> If one wants to look at the associated hits of the tracks, or something similarly detailed, then it seems to me that it would be better to use the #1 and #2 approaches, as we can then avoid "reinventing the wheel" by making ROOT files that mimic the structure of the existing LCIO output.  This approach would require working from the LCIO output, but I really don't see a problem there.  It is not onerous at all.  The API is straightforward and well-documented, and examples can be provided.  There is already a simple analysis script in my examples that you linked which plots information from Tracks in an LCIO file using ROOT histogramming.  Similar plots could easily be made for the hits, etc.
>> I suppose one could demand that all this data be put into ROOT including the hits, but you're left with the same problem.  Someone still has to learn the API of whatever classes are used to store the data, and the class headers also need to be loaded to interpret the data.  Whether that format is LCIO or ROOT, it is essentially the same level of knowledge that would be required.  My feeling is actually that this will be more difficult/cumbersome to work with in ROOT rather than LCIO.  I wonder why we can't just go with what we already have, e.g. the LCIO API, rather than invent something analogous which does not seem to serve a very clear purpose.  One can already use what's there in the linked example to look at the full events, so can we start there and see how far we get?
>> If someone has a clear use case where pure ROOT data is needed at the lowest level of detail, I would consider this request, but I have seen nothing concrete so far along these lines.
>> --Jeremy
>> -----Original Message-----
>> From: Homer [mailto:[log in to unmask]]
>> Sent: Wednesday, March 06, 2013 2:51 PM
>> To: Jaros, John A.; Graham, Mathew Thomas; McCormick, Jeremy I.; Graf, Norman A.; Moreno, Omar; Nelson, Timothy Knight
>> Subject: DSTs and work on slcio files using C++
>> Hi,
>> I decided not to comment during the meeting because it might have created more contention and I also wanted to hear Jeremy's, Norman's and Omar's responses first before throwing this out there. That said, from the point of view of someone who has been doing lcsim SiD analysis on slcio files I find the problems with using the two formats in HPS a little strange. For SiD we take slcio files and then run jet clustering and flavor tagging using C++ code in the lcfi and
>> lcfi+ packages. For the flavor tagging we write out root files for
>> lcfi+ running the
>> TMVA training and then for both the jet clustering and the flavor tagging we write out slcio files. I believe Malachi has done his whole analysis in C++ as a Marlin processor. I had also successfully tested reading slcio files in ROOT using a recipe provided by Jeremy. I dropped using it when I realized that it was quite simple to write the analysis in java. Perhaps one solution is to stick to doing all development, even for the DST, in java/lcsim and to just provide examples of how to access the data from C++/ROOT reading slcio files. Jeremy had documented much of this long ago at:
>> https://confluence.slac.stanford.edu/display/hpsg/Loading+LCIO+Files+into+ROOT
>> If we just provide some examples, wouldn't that help to at least put out the current fires? This would also avoid having to support numerous extra sets of data (DSTs and microDSTs in both formats with multiple passes and subsets)??
>> Maybe I'm wrong but I think one can provide simple recipes or modules for accessing any of the slcio file contents in ROOT.
>>     Homer
>> ########################################################################
>> Use REPLY-ALL to reply to list
>> To unsubscribe from the HPS-SOFTWARE list, click the following link:
>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1
> <dst.pdf>

Use REPLY-ALL to reply to list

To unsubscribe from the HPS-SOFTWARE list, click the following link:

Use REPLY-ALL to reply to list

To unsubscribe from the HPS-SOFTWARE list, click the following link:

Use REPLY-ALL to reply to list

To unsubscribe from the HPS-SOFTWARE list, click the following link: