Print

Print



On Dec 5, 2017, at 3:47 PM, Nathan Baltzell <[log in to unmask]> wrote:

By the way, in the future can we try to use the HPS software
list or one of the Slack channels for these types of discussions?

I vote for software mailing list!  



Hello All,

So, on the mailing list it is.

From the very limited look that I have been able to have on this problem, I think we are looking for two pretty rare situations that cause recon to get stuck.

So far, the two distinct cases I spotted (I have not had the time to look at more files. I will later.)

  1. Recon gets stuck in a loop, and there is no output anymore, but CPU continues to be consumed. It is truely stuck on a single event. This is the case for hps_005783.evio.105
  2. We get “stuck" in a loop that occasionally still spits something out (hps_005796.evio.212 does this). However, it appears recon is not truly stuck,it is just in a funk, taking forever (as in 3 to 4 hours) to process a single event.  Here is the console message that comes out
    1. 2017-12-06 08:54:14 [WARNING] org.hps.recon.tracking.TrackerReconDriver process :: 5796 53569438 Discarding track with bad HelicalTrackHit (correction distance 0.000000, chisq penalty 0.000000)

I am first digging into the first case. Here I am seeing the following, perhaps somewhat obscure behavior, that if I create an evio file that tries to isolate the event, in this case event 27207416, by copying the evio file but skipping the first 222545 events, so I start recon on event 27207411, the code DOES NOT get stuck. If however I start on event 27207410 or earlier, the code DOES get stuck.

This type of state dependence on the way an event is processed looks really bad to me. Properly written analysis code should treat each event 100% independently, and so the outcome, in extremely great precision, should not depend at all on what came before, nor the time of day, etc.

I will dig into this deeper as time permits and try to identify the exact location in the code where it gets stuck, but a deeper “rabbit hole”, to quote Miriam, is that we need to investigate in detail why our code does not do the same thing consistently on the same event.

Best,
Maurik



Use REPLY-ALL to reply to list

To unsubscribe from the HPS-SOFTWARE list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1