I don't think the issue is with the LCIO overhead I think we are just storing a lot of collections we can do without. For example, a the moment the final reconstructed LCIO file contains a collection of all raw tracker hits, a collection of all fitted raw tracker hits and collection containing all of the fit parameters from the fit to the six samples of each of the hits. This is a lot of extra information that we probably don't want to save in the final recon LCIO. I took a quick look at the contents of the reconstructed lcio file and found that all of the noisy channels weren't being filtered which explains the huge file size. I went ahead and reran the reconstruction but this time filtering noisy channels and the file size dropped to 5GB. It still a bit large, but I'm sure we can drop the size down quite a bit once we remove all there extra unneeded collections. --Omar Moreno On Wed, Mar 6, 2013 at 7:54 PM, Stepan Stepanyan <[log in to unmask]> wrote: > Hi Omar, > > Thanks for a quick response. It will be very important to really know > what is the final size of the reconstructed event. Number you have is x10 > larger than the original event size. In the proposal we have x5 inflation > of the event size after reconstruction. At the meeting today Matt > explained > that the number in the proposal was not well motivated, but what you > have, it seems like a good motivation. Even with removal of FPGS data > I am not sure size will go down by x10, or what we probably want x50. > Is this large size due to the overhead of the LCIO format? > > We are having these discussions about formats and analysis, I think event > size will play important role in these discussions. I do not think analysis > of data that HPS will get can be done on event that will be x10 or even > x5 larger than the original event. > > Regards, Stepan > > > On 3/6/13 10:18 PM, Omar Moreno wrote: > > Stepan, > > The original EVIO file is 1.5 Gb but I only ran reconstruction on half > the file. There is a lot of extra information that is being stored in the > final reconstructed LCIO file, such as FPGA Data, that should be removed > so I'm sure that the file size is a bit inflated. I'm sure once we filter > out junk events and remove some unnecessary collections the file size will > decrease significantly. > > --Omar Moreno > > > On Wed, Mar 6, 2013 at 7:05 PM, Stepan Stepanyan <[log in to unmask]>wrote: > >> Omar, >> >> How big is the original file, before reconstruction. >> >> Thanks, Stepan >> >> >> On 3/6/13 9:03 PM, Omar Moreno wrote: >> >> Hello Everyone, >> >> Just to give everyone an idea, a micro DST with basic track >> information, hit information and Ecal cluster info is approx. 29 Mb/500,000 >> test run events. The reconstructed LCIO file used to generate the root >> file was approx. 5.4 Gigs and it took about 4 minutes. I expect the size >> to increase for data from an electron run but it shouldn't be by much. >> I'll go ahead and study this using MC data and see how much bigger the >> file gets. >> >> --Omar Moreno >> >> >> >> >> >> >> >> >> >> On Wed, Mar 6, 2013 at 4:29 PM, Nelson, Timothy Knight < >> [log in to unmask]> wrote: >> >>> Hi Stepan, >>> >>> I agree 100%. I think we want exactly what you proposed a year ago; a >>> format with physics objects suitable for physics analysis (the proposed >>> "micro-DST"). This kind of thing is relatively easy to provide and will be >>> a very useful thing to have. In fact, the kind of "flat ntuple" format >>> that Omar began with can, I believe, be read in and operated on with PAW, >>> since the .rz format is the same. However, if he goes the next step as has >>> been recommended in the software group, and writes classes to the ROOT file >>> that require a dictionary to read back, the data format will be ROOT only. >>> >>> A couple of points that are important to understand... >>> >>> 1) Homer brings up an important point, which is the fact that the only >>> way we have to write these ROOT files is to use the LCIO C++ API. That is >>> to say, one does the java reconstruction in lcsim that creates LCIO objects >>> and writes out an LCIO file. Then one runs a separate C++ program that >>> reads in the LCIO objects with the LCIO C++ API and outputs this NTuple >>> using root classes. Therefore, no information that is currently not >>> persisted in the LCIO EDM by our reconstruction will ever be available in >>> the ROOT Ntuple. So, this business of writing out text files for vertexing >>> and other information not currently being written to LCIO does not go away >>> by creating ROOT Ntuples. The only way to eliminate that issue is to >>> improve the completeness of our LCIO-based EDM. For example, Matt has been >>> writing out vertexing information to text files and reading it back into >>> ROOT. However, LCIO DOES include vertex objects and if we created these >>> during reconstruction, we would get that information in the LCIO file >>> automatically, and it would then easily be accessible later on via LCIO. >>> There are a few examples of data types we might want to persist that don't >>> have an LCIO class, but LCIO includes a "Generic Object" class that can be >>> used to encapsulate anything we might want to add. Again, only by getting >>> the data we want in LCIO will it ever be accessible in ROOT. So, in my >>> opinion, this is where we should be focusing our attention. >>> >>> 2) As far as how to do ROOT-based analysis, Homer again touched on the >>> heart of the matter. One can create a ROOT Ntuple and perform analysis on >>> that. In practice, this rarely means using ROOT on the command line, or >>> even CINT macros since ROOT's C interpreter is so badly broken that it is >>> not really usable for anything other than making final plots from >>> already-analyzed data. In practice, one usually runs some standalone >>> compiled C++ that uses the ROOT libraries to do the analysis on a ROOT DST. >>> For this reason, it is just as easy to have that compiled C++ use the LCIO >>> C++ API to access the LCIO objects directly from the LCIO DST, and then use >>> all of the familiar ROOT tools in that code to do the analysis, writing out >>> whatever final histograms or post-analysis ntuples one might want in to a >>> ROOT file for later plotting. The only difference is that in the former >>> scenario, one learns the ROOT EDM that we invent for the DST, and for the >>> latter, one learns the LCIO EDM. To the extent that one is a mirror >>> reflection of the other, one has to do just as much work writing the C++ >>> analysis code either way. That is why it doesn't make any sense to >>> duplicate the entire LCIO EDM in ROOT (one file for the price of two!) and >>> why we should really only be considering creation of a new ROOT-based >>> "micro-DST" format aimed at physics analysis that will be much slimmer than >>> the LCIO. Those that need more than is in the "micro-DST" can very easily >>> run their C++/ROOT analysis code accessing the data directly from LCIO >>> using the LCIO C++ API. >>> >>> Cheers, >>> Tim >>> >>> On Mar 6, 2013, at 3:49 PM, Stepan Stepanyan <[log in to unmask]> wrote: >>> >>> > Hello Homer and Jeremy, >>> > >>> > It seems we all have right ideas and looks like very similar ideas on >>> > how analysis of data must be done. >>> > The confusion looks to me comes from definitions of "analysis" and >>> > "DST"s. When about a year ago I >>> > brought up the question of DSTs, and even sent out possible format >>> > (attached document), I basically >>> > wanted what Jeremy said in the second sentence after (3), physics >>> > objects only. What Omar showed >>> > today was very different from what I could describe as DSTs. I >>> > understand Matt's point that in some >>> > cases you will need fine details, but I am not sure if everyone will >>> > need that level of details. >>> > So I still think if we are talking about DSTs, the format should be >>> > "physics objects only". And if Omar >>> > can make use of what I proposed a year ago will be great. >>> > >>> > As for general analysis, if we stick with (1), than we will make large >>> > number of collaborators who are >>> > used to do analysis in ROOT quite unhappy. I understand that >>> duplicating >>> > processed data in many >>> > formats is also not a reasonable approach. So, if (2) means (sorry for >>> > my ignorance) we can have some >>> > kind of "portal" that can connect LCIO recon file to ROOT, then it is >>> > probably the best way to go. >>> > >>> > Again, sorry if I am misinterpreting the issue and/or repeating what >>> was >>> > already clear from your emails. >>> > >>> > Regards, Stepan >>> > >>> > On 3/6/13 6:10 PM, McCormick, Jeremy I. wrote: >>> >> Hi, Homer. >>> >> >>> >> Thanks for the thoughts. >>> >> >>> >> My view is that user analysis has three possible pathways which make >>> sense to consider: >>> >> >>> >> 1) Pure Java analysis using lcsim and outputting histograms to AIDA >>> files, viewable in JAS. >>> >> >>> >> 2) LCIO/ROOT analysis, reading in the LCIO recon files, looping over >>> these events, and making histograms from a ROOT script. >>> >> >>> >> 3) Pure ROOT analysis, operating on a ROOT DST file. >>> >> >>> >> I don't really think that we need a DST containing all of the >>> information which is already present in the final LCIO recon file. This >>> level of duplication is not desirable. Rather, the ROOT DST should contain >>> physics objects only, e.g. the equivalent of LCIO ReconstructedParticles, >>> Tracks, and Clusters, along with event information. This should be >>> sufficient for doing a pure physics analysis, e.g. good enough for most >>> users. It is also likely that it could be represented using simple arrays >>> rather than classes, which to me is desirable for this kind of format. >>> >> >>> >> If one wants to look at the associated hits of the tracks, or >>> something similarly detailed, then it seems to me that it would be better >>> to use the #1 and #2 approaches, as we can then avoid "reinventing the >>> wheel" by making ROOT files that mimic the structure of the existing LCIO >>> output. This approach would require working from the LCIO output, but I >>> really don't see a problem there. It is not onerous at all. The API is >>> straightforward and well-documented, and examples can be provided. There >>> is already a simple analysis script in my examples that you linked which >>> plots information from Tracks in an LCIO file using ROOT histogramming. >>> Similar plots could easily be made for the hits, etc. >>> >> >>> >> I suppose one could demand that all this data be put into ROOT >>> including the hits, but you're left with the same problem. Someone still >>> has to learn the API of whatever classes are used to store the data, and >>> the class headers also need to be loaded to interpret the data. Whether >>> that format is LCIO or ROOT, it is essentially the same level of knowledge >>> that would be required. My feeling is actually that this will be more >>> difficult/cumbersome to work with in ROOT rather than LCIO. I wonder why >>> we can't just go with what we already have, e.g. the LCIO API, rather than >>> invent something analogous which does not seem to serve a very clear >>> purpose. One can already use what's there in the linked example to look at >>> the full events, so can we start there and see how far we get? >>> >> >>> >> If someone has a clear use case where pure ROOT data is needed at the >>> lowest level of detail, I would consider this request, but I have seen >>> nothing concrete so far along these lines. >>> >> >>> >> --Jeremy >>> >> >>> >> -----Original Message----- >>> >> From: Homer [mailto:[log in to unmask]] >>> >> Sent: Wednesday, March 06, 2013 2:51 PM >>> >> To: Jaros, John A.; Graham, Mathew Thomas; McCormick, Jeremy I.; >>> Graf, Norman A.; Moreno, Omar; Nelson, Timothy Knight >>> >> Subject: DSTs and work on slcio files using C++ >>> >> >>> >> Hi, >>> >> >>> >> I decided not to comment during the meeting because it might have >>> created more contention and I also wanted to hear Jeremy's, Norman's and >>> Omar's responses first before throwing this out there. That said, from the >>> point of view of someone who has been doing lcsim SiD analysis on slcio >>> files I find the problems with using the two formats in HPS a little >>> strange. For SiD we take slcio files and then run jet clustering and flavor >>> tagging using C++ code in the lcfi and >>> >> lcfi+ packages. For the flavor tagging we write out root files for >>> >> lcfi+ running the >>> >> TMVA training and then for both the jet clustering and the flavor >>> tagging we write out slcio files. I believe Malachi has done his whole >>> analysis in C++ as a Marlin processor. I had also successfully tested >>> reading slcio files in ROOT using a recipe provided by Jeremy. I dropped >>> using it when I realized that it was quite simple to write the analysis in >>> java. Perhaps one solution is to stick to doing all development, even for >>> the DST, in java/lcsim and to just provide examples of how to access the >>> data from C++/ROOT reading slcio files. Jeremy had documented much of this >>> long ago at: >>> >> >>> >> >>> https://confluence.slac.stanford.edu/display/hpsg/Loading+LCIO+Files+into+ROOT >>> >> >>> >> If we just provide some examples, wouldn't that help to at least put >>> out the current fires? This would also avoid having to support numerous >>> extra sets of data (DSTs and microDSTs in both formats with multiple passes >>> and subsets)?? >>> >> Maybe I'm wrong but I think one can provide simple recipes or modules >>> for accessing any of the slcio file contents in ROOT. >>> >> >>> >> Homer >>> >> >>> >> >>> >> >>> ######################################################################## >>> >> Use REPLY-ALL to reply to list >>> >> >>> >> To unsubscribe from the HPS-SOFTWARE list, click the following link: >>> >> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1 >>> > >>> > <dst.pdf> >>> >>> >> >> ------------------------------ >> >> Use REPLY-ALL to reply to list >> >> To unsubscribe from the HPS-SOFTWARE list, click the following link: >> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1 >> >> > > ------------------------------ > > Use REPLY-ALL to reply to list > > To unsubscribe from the HPS-SOFTWARE list, click the following link: > https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1 > > ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the HPS-SOFTWARE list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1