Hi Sheila,
I also had problems with jobs dying some time ago. Eventually I traced
it back to having too many events to process, as this may lead to the
memory swap being too large. I notice from your log messages that you
have run over more than 100000 events per job - in my jobs, I decided
to limit the number of events to 70000 per tcl file for MC in the
kanga queue. This was in analysis-23, but should help for analysis-30
as well.
Hope this helps,
Roberto
> I have been trying to produce some VubRecoilUser
> ntuples. Unfortunately, a very large fraction
> of my jobs crashed.
>
> My code is in the analysis-30 test release:
>
> ~penguin/vubrecoil/vub30
>
> I did edit VubXlnu.cc a bit to make it keep events
> even if there was no best lepton, so that I could
> study the breco sample before and after the lepton
> requirement. However, the code did compile and link,
> and SOME of my jobs ran OK, so I don't think that's the
> problem.
>
> For SP-1235 and SP-1237, most of the errors were
> exit code 134. This usually means "aborted and core dumped."
> I have posted a sample of my core dump messages at:
>
> http://www.slac.stanford.edu/~penguin/cores.html
>
> The most common pre-core-dump message was:
>
> VubXlnu::VubRecoilHelper.cc(256):reco/recoil MC association is mixed
>
> However, this message appears frequently in the log files
> for the successful jobs, as well.
>
> For SP-2575, SP-3037, SP-6333, SP-6334, SP-3429, and SP-1005,
> the most common error was that the job simply exited without
> processing any events. A ROOT file is produced, but it is empty.
>
> Most of the data jobs ran successfully.
>
> My log files are in:
>
> ~penguin/vubrecoil/vub30/workdir/log
>
> You can see the results of all the jobs in:
>
> ~penguin/vub30/workdir/chklog.txt
>
> which is the output of the chklog script in:
>
> /nfs/farm/babar/AWG11/PID/users/penguin/owl/workdir/chklog
>
> run over my log files.
>
> I tried debugging one of the core dumped jobs,
> but as I had removed the actual core files this meant
> running the job interactively in gdb, and after
> two hours it still hadn't crashed, so I killed it.
>
> Does anyone know why so many of my jobs crashed?
>
> Thanks,
>
> sheila
>
>
>
>
>
|