Hi Andy, > > Yes, in thread 1 it is the whole call stack which is very deep. That said, when you sau "use xrd" is that using the xroot client (e.g., xroot://....)? Yes. I chain files with root://... as prefix. Note the prefix is not xroot. Is there a difference? > If so, what version are you using? > Good point... stupid me never thought that the client might be the problem. I'm using whatever version of xrootd that comes with ROOT 5.26/00. Do you recommend a dev version of ROOT? Thanks, Amir > Andy > > On Wed, 22 Sep 2010, Amir Farbin wrote: > >> Hi, >> >> Maybe I'm wrong, but I think we are seeing the whole call stack. It starts with TTree:Write() (of the python wrapper) in our python analysis code, which then goes to the PyRoot TTree wrapper, which then call's ROOT's usual TTree:Write(), which ends up eventually trying to write via Xrd. >> >> Few other pieces of info... I have noticed that when we use Xrd to read/write, we have a huge memory leak (~ 1 GB after reading ~100K events). Exact same job reading/writing the exact same files on local-disk or via NFS (or AFP) show no memory leak. >> >> Also, crashes like this only happen to some fraction of jobs when we have lots of simultaneous jobs running (usually 24 jobs reading/writing different files). And I'm fairly sure that I've noticed synchronized crashes, meaning I see more than one job crash at the same time. >> >> Amir >> >> On Sep 22, 2010, at 8:52 PM, Andrew Hanushevsky wrote: >> >>> Hi Alden, >>> >>> Looks like the SEGV happens in the root package. Given that this occurs less than more and that the call stack at the time is quite deep, I suspect that the thread stack size may be too small. There is a way to increase that in root but I am not a root expert; so thers should weigh in. >>> >>> Andy >>> >>> On Wed, 22 Sep 2010, Alden Stradling wrote: >>> >>>> We're trying to track down some errors on our OS X cluster (10.6.4, Mac Pro Early 2009). Most of the runs go great, but we are seeing occasional segfaults, as seen below. I will have a larger assortment of error logs soon, but in the meantime -- does this look familiar to anyone? >>>> >>>> >>>> Thanks, >>>> >>>> Alden >>>> >>>> =========================================================== >>>> There was a crash. >>>> This is the entire stack trace of all threads: >>>> =========================================================== >>>> >>>> Thread 5 (process 807): >>>> #0 0x00007fff845f0eca in poll () >>>> #1 0x000000010574f312 in XrdClientSock::RecvRaw () >>>> #2 0x00000001057703b5 in XrdClientPhyConnection::ReadRaw () >>>> #3 0x000000010576d5c4 in XrdClientMessage::ReadRaw () >>>> #4 0x000000010576fa48 in XrdClientPhyConnection::BuildMessage () >>>> #5 0x000000010577012b in SocketReaderThread () >>>> #6 0x000000010579bcf6 in XrdSysThread_Xeq () >>>> #7 0x00007fff845a7456 in _pthread_start () >>>> #8 0x00007fff845a7309 in thread_start () >>>> >>>> Thread 4 (process 807): >>>> #0 0x00007fff845f0eca in poll () >>>> #1 0x000000010574f312 in XrdClientSock::RecvRaw () >>>> #2 0x00000001057703b5 in XrdClientPhyConnection::ReadRaw () >>>> #3 0x000000010576d5c4 in XrdClientMessage::ReadRaw () >>>> #4 0x000000010576fa48 in XrdClientPhyConnection::BuildMessage () >>>> #5 0x000000010577012b in SocketReaderThread () >>>> #6 0x000000010579bcf6 in XrdSysThread_Xeq () >>>> #7 0x00007fff845a7456 in _pthread_start () >>>> #8 0x00007fff845a7309 in thread_start () >>>> >>>> Thread 3 (process 807): >>>> #0 0x00007fff845f0eca in poll () >>>> #1 0x000000010574f312 in XrdClientSock::RecvRaw () >>>> #2 0x00000001057703b5 in XrdClientPhyConnection::ReadRaw () >>>> #3 0x000000010576d5c4 in XrdClientMessage::ReadRaw () >>>> #4 0x000000010576fa48 in XrdClientPhyConnection::BuildMessage () >>>> #5 0x000000010577012b in SocketReaderThread () >>>> #6 0x000000010579bcf6 in XrdSysThread_Xeq () >>>> #7 0x00007fff845a7456 in _pthread_start () >>>> #8 0x00007fff845a7309 in thread_start () >>>> >>>> Thread 2 (process 807): >>>> #0 0x00007fff845a8eb6 in __semwait_signal () >>>> #1 0x00007fff845a8d45 in nanosleep () >>>> #2 0x00007fff845f5b14 in sleep () >>>> #3 0x00000001057667ac in GarbageCollectorThread () >>>> #4 0x000000010579bcf6 in XrdSysThread_Xeq () >>>> #5 0x00007fff845a7456 in _pthread_start () >>>> #6 0x00007fff845a7309 in thread_start () >>>> >>>> Thread 1 (process 807): >>>> #0 0x00007fff845ebc90 in wait4 () >>>> #1 0x00007fff8460023e in system () >>>> #2 0x000000010110b782 in TUnixSystem::StackTrace () >>>> #3 0x000000010110a26a in TUnixSystem::DispatchSignals () >>>> #4 <signal handler called> >>>> #5 0x00000001020950cb in TKey::Create () >>>> #6 0x000000010209849d in TKey::TKey () >>>> #7 0x0000000102077075 in TDirectoryFile::WriteKeys () >>>> #8 0x00000001020768fb in TDirectoryFile::SaveSelf () >>>> #9 0x0000000102079337 in TDirectoryFile::Write () >>>> #10 0x00000001020815f0 in TFile::Write () >>>> #11 0x0000000101293f7e in G__G__Base2_10_0_53 () >>>> #12 0x000000010190d9ea in Cint::G__CallFunc::Execute () >>>> #13 0x00000001005b38c1 in PyROOT::TIntExecutor::Execute () >>>> #14 0x00000001005b9762 in PyROOT::TMethodHolder<PyROOT::TScopeAdapter, PyROOT::TMemberAdapter>::CallSafe () >>>> #15 0x00000001005b9916 in PyROOT::TMethodHolder<PyROOT::TScopeAdapter, PyROOT::TMemberAdapter>::Execute () >>>> #16 0x00000001005b73f5 in PyROOT::TMethodHolder<PyROOT::TScopeAdapter, PyROOT::TMemberAdapter>::operator() () >>>> #17 0x00000001005bdf1b in PyROOT::(anonymous namespace)::mp_call () >>>> #18 0x000000010000aff3 in PyObject_Call () >>>> #19 0x000000010008a51a in PyEval_EvalFrameEx () >>>> #20 0x00000001000892e1 in PyEval_EvalFrameEx () >>>> #21 0x00000001000892e1 in PyEval_EvalFrameEx () >>>> #22 0x000000010008acce in PyEval_EvalCodeEx () >>>> #23 0x000000010008935e in PyEval_EvalFrameEx () >>>> #24 0x000000010008acce in PyEval_EvalCodeEx () >>>> #25 0x000000010008935e in PyEval_EvalFrameEx () >>>> #26 0x00000001000892e1 in PyEval_EvalFrameEx () >>>> #27 0x000000010008acce in PyEval_EvalCodeEx () >>>> #28 0x000000010008ad61 in PyEval_EvalCode () >>>> #29 0x00000001000a265a in Py_CompileString () >>>> #30 0x00000001000a2723 in PyRun_FileExFlags () >>>> #31 0x0000000100083196 in _PyBuiltin_Init () >>>> #32 0x0000000100089187 in PyEval_EvalFrameEx () >>>> #33 0x000000010008acce in PyEval_EvalCodeEx () >>>> #34 0x000000010008ad61 in PyEval_EvalCode () >>>> #35 0x00000001000a265a in Py_CompileString () >>>> #36 0x00000001000a2723 in PyRun_FileExFlags () >>>> #37 0x00000001000a423d in PyRun_SimpleFileExFlags () >>>> #38 0x00000001000b0286 in Py_Main () >>>> #39 0x0000000100000e6c in start () >>>> =========================================================== >>>> The lines below might hint at the cause of the crash. >>>> If they do not help you then please submit a bug report at >>>> http://root.cern.ch/bugs. Please post the ENTIRE stack trace >>>> from above as an attachment in addition to anything else >>>> that might help us fixing this issue. >>>> =========================================================== >>>> #5 0x00000001020950cb in TKey::Create () >>>> #6 0x000000010209849d in TKey::TKey () >>>> #7 0x0000000102077075 in TDirectoryFile::WriteKeys () >>>> #8 0x00000001020768fb in TDirectoryFile::SaveSelf () >>>> #9 0x0000000102079337 in TDirectoryFile::Write () >>>> #10 0x00000001020815f0 in TFile::Write () >>>> =========================================================== >>>> >>>> >>>> >>>> ==> 18632.arnor.cern.ch.afarbin.output <== >>>> 11800 >>>> >>>> ==> 18519.arnor.cern.ch.afarbin.output <== >>>> Traceback (most recent call last): >>>> File "/Volumes/DataA_1/afarbin/Runs/SPyRoot/trunk/bin/PyRootBatch", line 70, in <module> >>>> execfile(sys.argv[1]) >>>> File "Do0LeptonAnalysis.py", line 434, in <module> >>>> RH= RunThisStep(NEvents,SampleNumber) >>>> File "Do0LeptonAnalysis.py", line 428, in RunThisStep >>>> RH.Loop(MaxEntries=NEvents, doPickle=True, pickleDir=NFSPath+out_dir) >>>> File "/Volumes/DataA_1/afarbin/Runs/SPyRoot/trunk/python/RunHandler.py", line 62, in Loop >>>> res = self.Algo.Loop(newSample, MaxEntries, gd, firstEntry) >>>> File "/Volumes/DataA_1/afarbin/Runs/SPyRoot/trunk/python/TTreeAlgorithm.py", line 377, in Loop >>>> >>>> ==> 18824.arnor.cern.ch.afarbin.output <== >>>> 4400 >>>> >>>> ==> 18519.arnor.cern.ch.afarbin.output <== >>>> if not self.finalize(TheSample, AllEntriesData, GlobalData): >>>> File "/Volumes/DataA_1/afarbin/Runs/SPyRoot/trunk/python/TTreeAlgorithm.py", line 236, in finalize >>>> if not Alg.finalize(TheSample, AllEntriesData, GlobalData): >>>> File "/Volumes/DataA_1/afarbin/Runs/SPyRoot/trunk/python/WriterAlgorithm.py", line 179, in finalize >>>> self.file.Write() >>>> TypeError: none of the 2 overloaded methods succeeded. Full details: >>>> problem in C++; program state has been reset >>>> problem in C++; program state has been reset >>>> >>>> >> >>