LISTSERV 16.5 - HPS-SOFTWARE Archives

Hello Jeremy,

I wanted to respond to your proposal in JIRA, but the JIRA comments are not really very good for more detailed or long discussions, so I am putting here instead. Just so the software list doesn't have to reference JIRA to know what I am talking about, let me copy your proposal here.

Move clustering related classes and Drivers to recon.ecal.cluster. (This is already started.)
Create a Clusterer API which implements only the clustering algorithms. It would not extend Driver. It should have an API for generically setting cuts and their names, so that a Driver can easily configure it from an input array of doubles.
Convert existing clustering algorithms from Drivers to extension classes of the new Clusterer interface and its abstract implementation.
Add a generic ClusterDriver that is able to load different clusterer algorithms from a name string.
Add specific clustering Drivers if they are needed to wrap certain clusterers (e.g. if the default behavior of ClusterDriver needs to be overridden).
Add a ClusterUtilities class with common static utility methods that are used in many different clusterers.
Try to remove as much as possible code duplication between the different types of clusterers (GTP, CTP, IC, cosmic, etc.).
Move cosmic clustering Drivers from analysis.ecal.cosmic to the new ecal-recon clustering package.
Fully document each of the clustering algorithms, and include relevant links in the javadoc. (Such as links to CLAS or HPS notes, etc.)

I will respond to several of these points below:

Move clustering related classes and Drivers to recon.ecal.cluster. (This is already started.)

I think that this entirely reasonable. Grouping the clustering code together can make the package less messy.

Add a ClusterUtilities class with common static utility methods that are used in many different clusterers.
Try to remove as much as possible code duplication between the different types of clusterers (GTP, CTP, IC, cosmic, etc.).

For this, I think we need have a serious look at the individual clustering algorithms and see how much shared code actually exists between them. I do think that the two GTP algorithms can be abstracted a lot more, and actually intended to do that once I finish the one that is in-progress. However, I'm not sure that there is really a lot of overlap between the remaining clustering algorithms. Obviously they all have "setHitCollectionName" or something along those lines, but what else do they have? Most of them likely have seed energy cuts, but does the cosmic clustering algorithm even have that? Before we start abstracting all clustering algorithms, we will need to compile a list of things that they all have in common and make sure that making an abstract class would actually cut down on the code and not just produce a new class with very few methods. If there is a lot of overlap, I agree that keeping all code centralized is useful because it means that updates or fixes can be propagated to all affected classes without having to manually make sure that they align, but I'm not sure that there is enough overlap for this to work.

This same argument applies to making a utility class. We need to make sure that there is enough overlap to justify it.

Move cosmic clustering Drivers from analysis.ecal.cosmic to the new ecal-recon clustering package.

This would make sense. All the clustering code should probably be kept together.

Fully document each of the clustering algorithms, and include relevant links in the JavaDoc. (Such as links to CLAS or HPS notes, etc.)

I always support thorough documentation, so I am all for this. Just let me know if I need to add anything to mine. I agree with Holly's statement on JIRA that it should probably be the code developer's job to actually write the documentation.

I grouped the last few points together.

Create a Clusterer API which implements only the clustering algorithms. It would not extend Driver. It should have an API for generically setting cuts and their names, so that a Driver can easily configure it from an input array of doubles.
Convert existing clustering algorithms from Drivers to extension classes of the new Clusterer interface and its abstract implementation.
Add a generic ClusterDriver that is able to load different clusterer algorithms from a name string.
Add specific clustering Drivers if they are needed to wrap certain clusterers (e.g. if the default behavior of ClusterDriver needs to be overridden).

This seems to me like it is greatly complicating the clustering code. It creates a totally new API for clustering and tries to fit a bunch of what I see as fairly disparate algorithms and classes into a single box. I feel like this is going to be difficult to accomplish and take a fair amount of work and testing, but lacks an obvious advantage that I can see. Can you explain what the goal/benefit you are aiming for with this? Maybe I am misunderstanding what you are trying to do.

Thanks,

Kyle

On Tue, Dec 16, 2014 at 6:12 PM, McCormick, Jeremy I. <[log in to unmask]> wrote:

Thanks...very useful information!

-----Original Message-----
From: Kyle McCarty [mailto:[log in to unmask]]
Sent: Tuesday, December 16, 2014 2:55 PM
To: McCormick, Jeremy I.
Cc: Holly Vance
Subject: Re: cleaning up the ECAL clustering code

Hello Jeremy,

The two clustering algorithms that are mine are GTPEcalClusterer and GTPOnlineEcalClusterer. These are both implementations of the hardware clustering algorithm that is most current. The GTPEcalClusterer is the original algorithm and is used in the readout simulation to simulate the hardware clustering on Monte Carlo data. The GTPOnlineEcalClusterer is a work-in-progress version that is designed to run on readout data instead. The reason there are two is because the clustering algorithm uses a time window to analyze hits and determine which one falls into a cluster and which do not. For Monte Carlo, we treat each event as a 2 ns window, so the algorithm builds its time buffer of hits by storing events and treating each one as 2 ns. The readout just outputs a large number of hits that were within a certain time window and each individual event does not represent any particular time length. This means that each event must be considered independently and a time buffer must be generated from the hits within the event using their time stamp instead. Since this is a fairly significant difference in a fundamental aspect of the algorithm, I felt that it was not reasonable to try and make one algorithm that worked for both. This is particularly true because the simulation clusterer has already been tested thoroughly and added to the steering files, so changing it drastically now would risk breaking the Monte Carlo simulation.

It might be better, when the online algorithm is finished, to rename them something like "GTPMonteCarloEcalClusterer" and "GTPReadoutEcalClusterer" since these more accurately represent their function, but I was holding off on renaming them until the online algorithm is working. Currently, it can not be completed because it crashes when building clusters due to the fact that "addHit" is HPSEcalCluster uses "getRawEnergy," and as we have been discussing on the mailing list, that is a problem. Once this issue is resolved, the algorithm will be completed and tested. Also, at this point I will see if I can abstract the two drivers at all to cut down on repeated code. I did this already for the trigger drivers, but it is trickier for the clustering.

CTPEcalClusterer is the old clustering algorithm from the last run. I believe it is retained largely for legacy and reference purposes. I do not know if it is reasonable to keep. Perhaps it should be moved to a "test-run" package so that it doesn't clutter up the active code?

All of the "IC" clustering codes are Holly's and she would be able to explain them better than I would.

I do agree that it would be most reasonable to have one cluster object if that is possible, but I am not highly familiar with the regular HPSEcalCluster and only loosely familiar with Holly's version. Perhaps she could offer more insight into whether this is possible?

Let me know if I can help with anything,

Kyle

On Tue, Dec 16, 2014 at 3:28 PM, McCormick, Jeremy I. <[log in to unmask]> wrote:

Hi,

I was looking at cleaning up the ECAL clustering code with some changes to packages etc. Right now it is a bit of a mess, because there is quite a lot of code duplication between algorithms, as well as Drivers that are all doing the same thing (setting basic collection arguments, setting common cuts, etc.)

For more details, see this JIRA item where I have outlined a proposal to clean this up and do a heavy restructuring of the existing code.

https://jira.slac.stanford.edu/browse/HPSJAVA-363

I see in ecal.recon these clustering Drivers...

CTPEcalClusterer
EcalClusterIC
EcalClusterICBasic
GTPEcalClusterer
GTPOnlineClusterer
HPSEcalCluster

Could we get a brief description of each clustering Driver for some basic documentation that I can work from to try and do this? This can go on the JIRA page.

I would also like some information about what are the different types of cuts these are using, a brief description of how the algorithm works, etc.

It is also not clear to me that we need or want so many different clustering engines in our recon. Holly suggests discussing this in detail so we can identify common algorithms, and I agree with this.

Then there are now two types of clusters implemented...

HPSEcalCluster
HPSEcalClusterIC

I think we should be working from one cluster class, not two. So I would propose merging them unless there is some technical reason not to do this.

Long term, I'd like to move everything to the new ecal.cluster sub-package and abandon/deprecate/remove the existing Drivers. (I also have a few cosmic clustering Drivers that I will move to ecal.cluster too.)

If you need to make immediate changes (this week) to clustering code for the reconstruction to work, please just modify/fix the classes in ecal.recon for now. I am very aware we need not break anything with the current data taking and recon steering files, so I am not modifying any of the existing Drivers in place. Meanwhile, I'm working on making a sub-package where things can be reimplemented in a more structured way, including pulling out the core algorithms from the actual Driver classes. As we verify each of the clustering algorithms with tests, we can move to the re-implementation class in the sub-package and then abandon the old Driver.

Any concerns/comments then please send to hps-software or write comments on the JIRA item.

Thanks.

--Jeremy