On Wed, Sep 22, 2010 at 10:04 PM, Amir Farbin <[log in to unmask]> wrote: > Maybe I'm wrong, but I think we are seeing the whole call stack. It starts with TTree:Write() (of the python wrapper) in our python analysis code, which then goes to the PyRoot TTree wrapper, which then call's ROOT's usual TTree:Write(), which ends up eventually trying to write via Xrd. Hi Amir, in the stack trace you can see 5 threads, 4 of them are the xroot client threads, but they are doing nothing (sleeping) waiting for an event such as a new data arriving to a socket or some other thread rising a semaphore. The crash in thread #1 occurs indeed in TFile::Write but before it reaches any xroot specific code. > Few other pieces of info... I have noticed that when we use Xrd to read/write, we have a huge memory leak (~ 1 GB after reading ~100K events). Exact same job reading/writing the exact same files on local-disk or via NFS (or AFP) show no memory leak. This is probably a caching issue. The xroot client fetches data in advance and/or splits the request between many streams and then reassembles and keeps all this in memory to compensate for the network access latency. This is an issue that I currently work on, because we are too generous in the cache allocation routines and this is a big problem particularly when you open many files. The immediate workaround would be to close the files as soon as you're done with them. I am not sure how TChain stuff handles that. Is this the "memory leak" that Alden was mentioning in one of his previous emails? > Also, crashes like this only happen to some fraction of jobs when we have lots of simultaneous jobs running (usually 24 jobs reading/writing different files). And I'm fairly sure that I've noticed synchronized crashes, meaning I see more than one job crash at the same time. You're probably run out of memory because of the issue that I have mentioned above. Cheers, Lukasz