On Wed, Sep 22, 2010 at 10:04 PM, Amir Farbin <[log in to unmask]> wrote:
> Maybe I'm wrong, but I think we are seeing the whole call stack. It starts with TTree:Write() (of the python wrapper) in our python analysis code, which then goes to the PyRoot TTree wrapper, which then call's ROOT's usual TTree:Write(), which ends up eventually trying to write via Xrd.
in the stack trace you can see 5 threads, 4 of them are the xroot
client threads, but they are doing nothing (sleeping) waiting for an
event such as a new data arriving to a socket or some other thread
rising a semaphore. The crash in thread #1 occurs indeed in
TFile::Write but before it reaches any xroot specific code.
> Few other pieces of info... I have noticed that when we use Xrd to read/write, we have a huge memory leak (~ 1 GB after reading ~100K events). Exact same job reading/writing the exact same files on local-disk or via NFS (or AFP) show no memory leak.
This is probably a caching issue. The xroot client fetches data in
advance and/or splits the request between many streams and then
reassembles and keeps all this in memory to compensate for the network
access latency. This is an issue that I currently work on, because we
are too generous in the cache allocation routines and this is a big
problem particularly when you open many files. The immediate
workaround would be to close the files as soon as you're done with
them. I am not sure how TChain stuff handles that.
Is this the "memory leak" that Alden was mentioning in one of his
> Also, crashes like this only happen to some fraction of jobs when we have lots of simultaneous jobs running (usually 24 jobs reading/writing different files). And I'm fairly sure that I've noticed synchronized crashes, meaning I see more than one job crash at the same time.
You're probably run out of memory because of the issue that I have