OK, I will try again with smaller tcl files. Thanks for the tip, sheila On Tue, 7 Feb 2006, Roberto Sacco wrote: > Hi Sheila, > > I also had problems with jobs dying some time ago. Eventually I traced > it back to having too many events to process, as this may lead to the > memory swap being too large. I notice from your log messages that you > have run over more than 100000 events per job - in my jobs, I decided > to limit the number of events to 70000 per tcl file for MC in the > kanga queue. This was in analysis-23, but should help for analysis-30 > as well. > > Hope this helps, > > Roberto > > > I have been trying to produce some VubRecoilUser > > ntuples. Unfortunately, a very large fraction > > of my jobs crashed. > > > > My code is in the analysis-30 test release: > > > > ~penguin/vubrecoil/vub30 > > > > I did edit VubXlnu.cc a bit to make it keep events > > even if there was no best lepton, so that I could > > study the breco sample before and after the lepton > > requirement. However, the code did compile and link, > > and SOME of my jobs ran OK, so I don't think that's the > > problem. > > > > For SP-1235 and SP-1237, most of the errors were > > exit code 134. This usually means "aborted and core dumped." > > I have posted a sample of my core dump messages at: > > > > http://www.slac.stanford.edu/~penguin/cores.html > > > > The most common pre-core-dump message was: > > > > VubXlnu::VubRecoilHelper.cc(256):reco/recoil MC association is mixed > > > > However, this message appears frequently in the log files > > for the successful jobs, as well. > > > > For SP-2575, SP-3037, SP-6333, SP-6334, SP-3429, and SP-1005, > > the most common error was that the job simply exited without > > processing any events. A ROOT file is produced, but it is empty. > > > > Most of the data jobs ran successfully. > > > > My log files are in: > > > > ~penguin/vubrecoil/vub30/workdir/log > > > > You can see the results of all the jobs in: > > > > ~penguin/vub30/workdir/chklog.txt > > > > which is the output of the chklog script in: > > > > /nfs/farm/babar/AWG11/PID/users/penguin/owl/workdir/chklog > > > > run over my log files. > > > > I tried debugging one of the core dumped jobs, > > but as I had removed the actual core files this meant > > running the job interactively in gdb, and after > > two hours it still hadn't crashed, so I killed it. > > > > Does anyone know why so many of my jobs crashed? > > > > Thanks, > > > > sheila > > > > > > > > > > >