Print

Print


OK, I will try again with smaller tcl files.

Thanks for the tip,

sheila


On Tue, 7 Feb 2006, Roberto Sacco wrote:

> Hi Sheila,
> 
> I also had problems with jobs dying some time ago. Eventually I traced 
> it back to having too many events to process, as this may lead to the 
> memory swap being too large. I notice from your log messages that you 
> have run over more than 100000 events per job - in my jobs, I decided 
> to limit the number of events to 70000 per tcl file for MC in the 
> kanga queue. This was in analysis-23, but should help for analysis-30 
> as well.
> 
> Hope this helps,
> 
> Roberto
> 
> > I have been trying to produce some VubRecoilUser 
> > ntuples.  Unfortunately, a very large fraction 
> > of my jobs crashed.
> > 
> > My code is in the analysis-30 test release:
> > 
> > ~penguin/vubrecoil/vub30
> > 
> > I did edit VubXlnu.cc a bit to make it keep events 
> > even if there was no best lepton, so that I could 
> > study the breco sample before and after the lepton 
> > requirement.  However, the code did compile and link, 
> > and SOME of my jobs ran OK, so I don't think that's the 
> > problem.
> > 
> > For SP-1235 and SP-1237, most of the errors were 
> > exit code 134.  This usually means "aborted and core dumped."
> > I have posted a sample of my core dump messages at:
> > 
> > http://www.slac.stanford.edu/~penguin/cores.html
> > 
> > The most common pre-core-dump message was:
> > 
> > VubXlnu::VubRecoilHelper.cc(256):reco/recoil MC association is mixed
> > 
> > However, this message appears frequently in the log files 
> > for the successful jobs, as well.
> > 
> > For SP-2575, SP-3037, SP-6333, SP-6334, SP-3429, and SP-1005, 
> > the most common error was that the job simply exited without 
> > processing any events.  A ROOT file is produced, but it is empty.
> > 
> > Most of the data jobs ran successfully.
> > 
> > My log files are in:
> > 
> > ~penguin/vubrecoil/vub30/workdir/log
> > 
> > You can see the results of all the jobs in:
> > 
> > ~penguin/vub30/workdir/chklog.txt
> > 
> > which is the output of the chklog script in:
> > 
> > /nfs/farm/babar/AWG11/PID/users/penguin/owl/workdir/chklog
> > 
> > run over my log files.
> > 
> > I tried debugging one of the core dumped jobs, 
> > but as I had removed the actual core files this meant 
> > running the job interactively in gdb, and after 
> > two hours it still hadn't crashed, so I killed it.
> > 
> > Does anyone know why so many of my jobs crashed?
> > 
> > Thanks,
> > 
> > sheila
> > 
> > 
> > 
> > 
> > 
>