Print

Print


Hello Jeremy,

I use the data sample "hps_003206.evio.0" as the input for my tests. You can see the text based checks I did in the code; they are located under the "VERBOSE :: " comments and separated into "if(verbose)" conditional statements. The EVIO file should be in the data at JLab.

Thanks,

Kyle

On Wed, Dec 17, 2014 at 2:09 PM, McCormick, Jeremy I. <[log in to unmask]> wrote:
Can you point me to a good data sample for testing your clustering driver?  MC or data, doesn't matter....

On Dec 16, 2014, at 8:59 PM, Holly Vance <[log in to unmask]> wrote:

Hi,

I thought I would mention a comment on this-the energy corrections no longer exist at all in the offline recon clustering. The only driver where this gets called is in ReconParticleDriver (which occurs separately from clustering but uses the particle id in track cluster matching). I am working to add another driver that could be used after clustering and separately from the SVT. This would allow someone to run clustering and then run a cluster analysis driver which could perhaps make a good guess about particle id in a loose pairs trigger and then apply corrections. 

-Holly

On Tue, Dec 16, 2014 at 9:42 PM, Kyle McCarty <[log in to unmask]> wrote:
Hello Jeremy,

Okay, I think that answers most of my concerns. I will let you know if I have any further questions, probably after I have had a chance to really look over the actual code.

I do think we should make the wrapper classes for the clustering drivers to allow for regular parameter names in the steering files and use this as the default, though. Otherwise, we actually make it harder for users to figure out what is going on even compared to the inconsistent parameter names we have now. Otherwise, I think this should work; we'll just want to make sure we do some side-by-side tests with the old driver versions to make sure everything is running smoothly.

Thanks,

Kyle

On Tue, Dec 16, 2014 at 9:12 PM, McCormick, Jeremy I. <[log in to unmask]> wrote:
Wow Kyle, you type fast.  :)

Let me address these questions individually below...

> For the API, are you basically just passing the events to the clustering algorithm and then mandating that they have a means to get a list of cluster objects?

The Clusterer classes must implement this method...

List<Cluster> createClusters(EventHeader event, List<CalorimeterHit> hits);

This returns a list of clusters from the hits and may access/use event data as needed.  The ClusterDriver calls this automatically from its process method.

> i.e., the API is essentially an extra layer between the event stream and the, previously, diver object? This would alleviate most the issues with trying to standardize the clustering methods, since it would allow them to remain significantly different internally.

Yes, this is exactly right.  It is an extra layer so that the clustering algorithms can vary independently of the common Driver settings.  The Driver, for instance, determines the output collection name, input hit collection nam,e logging, etc.  Each of the algorithms can be completely different internally.

> Also, how does a driver specify to the API what parameters it needs?

There are two ways to do this, one generic that uses the double array, and one specific that uses the Driver API.

I have implemented in the Clusterer interface a way to set generic numerical cuts by passing a double array.  In this way, the common ClusterDriver can generically pass these to the clustering algorithm without needing specific "set" methods for each one.  Internally, the Clusterer may set instance variable values from these cuts in its initialize method, which is called from the Driver's startOfData method.  These cuts also have names so you can call for instance setCut("someCut", 1.0) rather than needing to know the numerical indices.

The other way is writing a specialized implementation of ClusterDriver, that will directly pass arguments to its Clusterer object.  I have done this already for the IC Clusterer, so that can provide an example.  It is pretty straightforward, and I don't think it is confusing.  These Drivers may also add any additional "set" methods that are needed as well as override/augment any of the standard Driver methods like process, startOfData, etc.

> If that is the case, do we still gain much cleanliness of code if we need almost as many wrapper classes as we have drivers?

Yes, I think the code is still much cleaner even if there is a wrapper Driver for every Clusterer, because these Driver classes will then only implement required specializations for a particular Clusterer class.  This is much better than having every clustering algorithm extend Driver directly, which has already lead to a lot of code duplication and inconsistency between them.

> One of my concerns with this is with steering file input. [etc.]

I agree that the double array of cuts is perhaps less obvious and more prone to error than having individual "set" methods which are accessed via the steering file.  This is a side effect of having the Clusterer not implement Driver, but in practice it can be easily alleviated by writing a simple ClusterDriver speciailization class (as described above) that has these setters on it and passes the values to its Clusterer.  This requires writing very little code.

> I'm not against this, but it is worth noting that Holly's corrections are for full-sized clusters with shared hits while the other algorithms use the hardware's 3x3 spatial window. The sampling fraction between is also different for this reason. (My plots
> suggest it is around 0.86 for the GTP clustering algorithm, though this is extracted from distribution plots and not a formal study.) Also, it's important not to apply energy corrections for offline analysis algorithms to the hardware algorithms, as this would
> not accurately represent what the hardware actually does anymore.

Okay, thanks for the information here.  As far as which sampling numbers are applied in which algorithms, I have not changed any of this core behavior, and all the algorithms I have ported to ecal.cluster are left intact.

I still think it would be useful to pull out the energy correction code so that if desired one can easily look at the uncorrected energies.  Currently, I don't think it is really possible to do this because the corrections occur inline with the actual clustering.

By the way, I'm not looking at the implementations in ecal.cluster as some kind of "set in stone" API.  We can certainly change and add features as needed, and your suggestions/questions are quite helpful here.  Thanks!

--Jeremy

-----Original Message-----
From: Kyle McCarty [mailto:[log in to unmask]]
Sent: Tuesday, December 16, 2014 5:50 PM
To: McCormick, Jeremy I.
Cc: hps-software
Subject: Re: FW: cleaning up the ECAL clustering code

Hello Jeremy,


The driver sounds good to me. That is a lot of useful options and it would be better to have them centralized. I have a few more questions about the API, but this sounds good to have regardless.

For the API, are you basically just passing the events to the clustering algorithm and then mandating that they have a means to get a list of cluster objects? i.e., the API is essentially an extra layer between the event stream and the, previously, diver object? This would alleviate most the issues with trying to standardize the clustering methods, since it would allow them to remain significantly different internally.

Also, how does a driver specify to the API what parameters it needs? Each driver, especially the GTP algorithm and Holly's algorithm, will need certain fields that are unique to them. Is accounting for this what you meant to use a wrapper class for? If that is the case, do we still gain much cleanliness of code if we need almost as many wrapper classes as we have drivers?


One of my concerns with this is with steering file input. You mentioned using an array of doubles as an input. While this is acceptable internally, since only code experts need to understand it, I think we should avoid using this as the input to a steering file. Non-experts need to make steering files for customized studies or for testing different settings and it is very non-intuitive to input a series of numbers with no indication what those represent. For example, if this were the case with the trigger, we would have input like "0.125 1.300 0.200 1.700 2 0.500 2.000 1.200 1.0 30" which, while I can understand it because I tend to use the same sequence for trigger cuts, it is completely non-obvious to anyone else. Additionally, it makes user input error a significant concern, since it is difficult to check whether everything is in the right position. I think it is important for the steering file to take parameter entries that are descriptive so anyone looking at the steering file can have some idea what it is they are looking at.



        I really would like to pull out the energy corrections code from EcalClusterIC which I've already discussed with Holly a bit (I think this is in progress actually).



I'm not against this, but it is worth noting that Holly's corrections are for full-sized clusters with shared hits while the other algorithms use the hardware's 3x3 spatial window. The sampling fraction between is also different for this reason. (My plots suggest it is around 0.86 for the GTP clustering algorithm, though this is extracted from distribution plots and not a formal study.) Also, it's important not to apply energy corrections for offline analysis algorithms to the hardware algorithms, as this would not accurately represent what the hardware actually does anymore.


Thanks,

Kyle


On Tue, Dec 16, 2014 at 7:51 PM, McCormick, Jeremy I. <[log in to unmask]> wrote:

        Hi, Kyle.

        Thanks for the response.  This is a very good discussion, and you make valid points.

        I have already written a Driver that can run any clustering algorithm which implements the Clusterer interface, and it is working on several of the simpler algorithms (EcalClusterICBasic and the legacy test run algorithm).

        ecal-recon/src/main/java/org/hps/recon/ecal/cluster/ClusterDriver.java

        This class implements the common Driver API for clustering and includes...

        -Setting up the detector and conditions related references in detectorChanged.

        -Setting of input hit and output cluster collection names.

        -Creating the clustering algorithm object from a factory class.

        -Flag to indicate whether empty cluster collections should be created and added to the event.

        -Flag to raise an error if the input hit collection is not found.

        -Flag to skip events where no clusters are created.

        -Flag to indicate whether clusters should be stored to the event or marked as transient.

        -Setting of numerical cuts in a double array that are passed to the Clusterer via the Driver's setCuts method.

        -Default initialization of the Clusterer in startOfData.

        -Logging to a logger its config and event processing info.

        The Driver implements a default process method which will perform clustering according to the above parameters.

        The Clusterer API has a createClusters method, as well as methods for setting/getting of numerical cuts by index or string.  (You can have an empty array if there are no cuts.)

        "This seems to me like it is greatly complicating the clustering code."

        What I've described is perhaps more complicated than implementing clustering directly via the Driver API, in that it is another level of abstraction, but it allows for far more flexibility and ease of use, and it is better organized.  It allows one to easily swap out a clustering algorithm by changing one argument to a Driver (the Clusterer name), rather than needing to understand exactly what a certain clustering Driver has implemented in its API (e.g. collection names, various flags, etc.).  If specialized behavior is needed then a specific Driver can be written to wrap that algorithm.  Having a common Driver that can run all clustering algorithms also easily allows debugging to be done in an organized/standard way.

        Also, the algorithms are not different in terms of their input and output.  They all take a list of hits and turn them into a collection of clusters.  I have also included a reference to the event in the API method as this might be useful/required by some algorithms.  All the algorithms we have now also include cuts.  So to my mind it useful to have an API that actually implements this "contract".  And the clustering algorithms themselves should not need to know all the Driver related parameters that I described above which can be commonly configured.

        "This same argument applies to making a utility class. etc."

        There are only a few utility methods I pulled out so far, but I think this can be greatly expanded.  As far as code duplication, I don't know for sure yet as this requires going over each algorithm in detail.  I really would like to pull out the energy corrections code from EcalClusterIC which I've already discussed with Holly a bit (I think this is in progress actually).

        --Jeremy

        -----Original Message-----
        From: Kyle McCarty [mailto:[log in to unmask]]
        Sent: Tuesday, December 16, 2014 4:13 PM
        To: McCormick, Jeremy I.
        Cc: hps-software
        Subject: Re: FW: cleaning up the ECAL clustering code

        Hello Jeremy,

        I wanted to respond to your proposal in JIRA, but the JIRA comments are not really very good for more detailed or long discussions, so I am putting here instead. Just so the software list doesn't have to reference JIRA to know what I am talking about, let me copy your proposal here.


                *       Move clustering related classes and Drivers to recon.ecal.cluster. (This is already started.)
                *       Create a Clusterer API which implements only the clustering algorithms. It would not extend Driver. It should have an API for generically setting cuts and their names, so that a Driver can easily configure it from an input array of doubles.
                *       Convert existing clustering algorithms from Drivers to extension classes of the new Clusterer interface and its abstract implementation.
                *       Add a generic ClusterDriver that is able to load different clusterer algorithms from a name string.
                *       Add specific clustering Drivers if they are needed to wrap certain clusterers (e.g. if the default behavior of ClusterDriver needs to be overridden).
                *       Add a ClusterUtilities class with common static utility methods that are used in many different clusterers.
                *       Try to remove as much as possible code duplication between the different types of clusterers (GTP, CTP, IC, cosmic, etc.).
                *       Move cosmic clustering Drivers from analysis.ecal.cosmic to the new ecal-recon clustering package.
                *       Fully document each of the clustering algorithms, and include relevant links in the javadoc. (Such as links to CLAS or HPS notes, etc.)



        I will respond to several of these points below:


                *       Move clustering related classes and Drivers to recon.ecal.cluster. (This is already started.)

        I think that this entirely reasonable. Grouping the clustering code together can make the package less messy.


                *       Add a ClusterUtilities class with common static utility methods that are used in many different clusterers.
                *       Try to remove as much as possible code duplication between the different types of clusterers (GTP, CTP, IC, cosmic, etc.).

        For this, I think we need have a serious look at the individual clustering algorithms and see how much shared code actually exists between them. I do think that the two GTP algorithms can be abstracted a lot more, and actually intended to do that once I finish the one that is in-progress. However, I'm not sure that there is really a lot of overlap between the remaining clustering algorithms. Obviously they all have "setHitCollectionName" or something along those lines, but what else do they have? Most of them likely have seed energy cuts, but does the cosmic clustering algorithm even have that? Before we start abstracting all clustering algorithms, we will need to compile a list of things that they all have in common and make sure that making an abstract class would actually cut down on the code and not just produce a new class with very few methods. If there is a lot of overlap, I agree that keeping all code centralized is useful because it means that updates or fixes can be propagated to all affected classes without having to manually make sure that they align, but I'm not sure that there is enough overlap for this to work.


        This same argument applies to making a utility class. We need to make sure that there is enough overlap to justify it.


                *       Move cosmic clustering Drivers from analysis.ecal.cosmic to the new ecal-recon clustering package.

        This would make sense. All the clustering code should probably be kept together.


                *       Fully document each of the clustering algorithms, and include relevant links in the JavaDoc. (Such as links to CLAS or HPS notes, etc.)

        I always support thorough documentation, so I am all for this. Just let me know if I need to add anything to mine. I agree with Holly's statement on JIRA that it should probably be the code developer's job to actually write the documentation.



        I grouped the last few points together.


                *       Create a Clusterer API which implements only the clustering algorithms. It would not extend Driver. It should have an API for generically setting cuts and their names, so that a Driver can easily configure it from an input array of doubles.
                *       Convert existing clustering algorithms from Drivers to extension classes of the new Clusterer interface and its abstract implementation.
                *       Add a generic ClusterDriver that is able to load different clusterer algorithms from a name string.
                *       Add specific clustering Drivers if they are needed to wrap certain clusterers (e.g. if the default behavior of ClusterDriver needs to be overridden).


        This seems to me like it is greatly complicating the clustering code. It creates a totally new API for clustering and tries to fit a bunch of what I see as fairly disparate algorithms and classes into a single box. I feel like this is going to be difficult to accomplish and take a fair amount of work and testing, but lacks an obvious advantage that I can see. Can you explain what the goal/benefit you are aiming for with this? Maybe I am misunderstanding what you are trying to do.


        Thanks,

        Kyle


        On Tue, Dec 16, 2014 at 6:12 PM, McCormick, Jeremy I. <[log in to unmask]> wrote:

                Thanks...very useful information!


                -----Original Message-----
                From: Kyle McCarty [mailto:[log in to unmask]]
                Sent: Tuesday, December 16, 2014 2:55 PM
                To: McCormick, Jeremy I.
                Cc: Holly Vance
                Subject: Re: cleaning up the ECAL clustering code

                Hello Jeremy,


                The two clustering algorithms that are mine are GTPEcalClusterer and GTPOnlineEcalClusterer. These are both implementations of the hardware clustering algorithm that is most current. The GTPEcalClusterer is the original algorithm and is used in the readout simulation to simulate the hardware clustering on Monte Carlo data. The GTPOnlineEcalClusterer is a work-in-progress version that is designed to run on readout data instead. The reason there are two is because the clustering algorithm uses a time window to analyze hits and determine which one falls into a cluster and which do not. For Monte Carlo, we treat each event as a 2 ns window, so the algorithm builds its time buffer of hits by storing events and treating each one as 2 ns. The readout just outputs a large number of hits that were within a certain time window and each individual event does not represent any particular time length. This means that each event must be considered independently and a time buffer must be generated from the hits within the event using their time stamp instead. Since this is a fairly significant difference in a fundamental aspect of the algorithm, I felt that it was not reasonable to try and make one algorithm that worked for both. This is particularly true because the simulation clusterer has already been tested thoroughly and added to the steering files, so changing it drastically now would risk breaking the Monte Carlo simulation.


                It might be better, when the online algorithm is finished, to rename them something like "GTPMonteCarloEcalClusterer" and "GTPReadoutEcalClusterer" since these more accurately represent their function, but I was holding off on renaming them until the online algorithm is working. Currently, it can not be completed because it crashes when building clusters due to the fact that "addHit" is HPSEcalCluster uses "getRawEnergy," and as we have been discussing on the mailing list, that is a problem. Once this issue is resolved, the algorithm will be completed and tested. Also, at this point I will see if I can abstract the two drivers at all to cut down on repeated code. I did this already for the trigger drivers, but it is trickier for the clustering.


                CTPEcalClusterer is the old clustering algorithm from the last run. I believe it is retained largely for legacy and reference purposes. I do not know if it is reasonable to keep. Perhaps it should be moved to a "test-run" package so that it doesn't clutter up the active code?

                All of the "IC" clustering codes are Holly's and she would be able to explain them better than I would.


                I do agree that it would be most reasonable to have one cluster object if that is possible, but I am not highly familiar with the regular HPSEcalCluster and only loosely familiar with Holly's version. Perhaps she could offer more insight into whether this is possible?


                Let me know if I can help with anything,

                Kyle


                On Tue, Dec 16, 2014 at 3:28 PM, McCormick, Jeremy I. <[log in to unmask]> wrote:

                        Hi,

                        I was looking at cleaning up the ECAL clustering code with some changes to packages etc.  Right now it is a bit of a mess, because there is quite a lot of code duplication between algorithms, as well as Drivers that are all doing the same thing (setting basic collection arguments, setting common cuts, etc.)

                        For more details, see this JIRA item where I have outlined a proposal to clean this up and do a heavy restructuring of the existing code.

                        https://jira.slac.stanford.edu/browse/HPSJAVA-363

                        I see in ecal.recon these clustering Drivers...

                        CTPEcalClusterer
                        EcalClusterIC
                        EcalClusterICBasic
                        GTPEcalClusterer
                        GTPOnlineClusterer
                        HPSEcalCluster

                        Could we get a brief description of each clustering Driver for some basic documentation that I can work from to try and do this?  This can go on the JIRA page.

                        I would also like some information about what are the different types of cuts these are using, a brief description of how the algorithm works, etc.

                        It is also not clear to me that we need or want so many different clustering engines in our recon.  Holly suggests discussing this in detail so we can identify common algorithms, and I agree with this.

                        Then there are now two types of clusters implemented...

                        HPSEcalCluster
                        HPSEcalClusterIC

                        I think we should be working from one cluster class, not two.  So I would propose merging them unless there is some technical reason not to do this.

                        Long term, I'd like to move everything to the new ecal.cluster sub-package and abandon/deprecate/remove the existing Drivers.  (I also have a few cosmic clustering Drivers that I will move to ecal.cluster too.)

                        If you need to make immediate changes (this week) to clustering code for the reconstruction to work, please just modify/fix the classes in ecal.recon for now.  I am very aware we need not break anything with the current data taking and recon steering files, so I am not modifying any of the existing Drivers in place.  Meanwhile, I'm working on making a sub-package where things can be reimplemented in a more structured way, including pulling out the core algorithms from the actual Driver classes.  As we verify each of the clustering algorithms with tests, we can move to the re-implementation class in the sub-package and then abandon the old Driver.

                        Any concerns/comments then please send to hps-software or write comments on the JIRA item.

                        Thanks.

                        --Jeremy








Use REPLY-ALL to reply to list

To unsubscribe from the HPS-SOFTWARE list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1



Use REPLY-ALL to reply to list

To unsubscribe from the HPS-SOFTWARE list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1