Print

Print


Hi Andy,
> 
> Yes, in thread 1 it is the whole call stack which is very deep. That said, when you sau "use xrd" is that using the xroot client (e.g., xroot://....)?

Yes. I chain files with root://... as prefix.  Note the prefix is not xroot. Is there a difference?

> If so, what version are you using?
> 

Good point... stupid me never thought that the client might be the problem. I'm using whatever version of xrootd that comes with ROOT 5.26/00. Do you recommend a dev version of ROOT?

Thanks,
Amir

> Andy
> 
> On Wed, 22 Sep 2010, Amir Farbin wrote:
> 
>> Hi,
>> 
>> Maybe I'm wrong, but I think we are seeing the whole call stack. It starts with TTree:Write() (of the python wrapper) in our python analysis code, which then goes to the PyRoot TTree wrapper, which then call's ROOT's usual TTree:Write(), which ends up eventually trying to write via Xrd.
>> 
>> Few other pieces of info... I have noticed that when we use Xrd to read/write, we have a huge memory leak (~ 1 GB after reading ~100K events). Exact same job reading/writing the exact same files on local-disk or via NFS (or AFP) show no memory leak.
>> 
>> Also, crashes like this only happen to some fraction of jobs when we have lots of simultaneous jobs running (usually 24 jobs reading/writing different files). And I'm fairly sure that I've noticed synchronized crashes, meaning I see more than one job crash at the same time.
>> 
>> Amir
>> 
>> On Sep 22, 2010, at 8:52 PM, Andrew Hanushevsky wrote:
>> 
>>> Hi Alden,
>>> 
>>> Looks like the SEGV happens in the root package. Given that this occurs less than more and that the call stack at the time is quite deep, I suspect that the thread stack size may be too small. There is a way to increase that in root but I am not a root expert; so thers should weigh in.
>>> 
>>> Andy
>>> 
>>> On Wed, 22 Sep 2010, Alden Stradling wrote:
>>> 
>>>> We're trying to track down some errors on our OS X cluster (10.6.4, Mac Pro Early 2009). Most of the runs go great, but we are seeing occasional segfaults, as seen below. I will have a larger assortment of error logs soon, but in the meantime -- does this look familiar to anyone?
>>>> 
>>>> 
>>>> Thanks,
>>>> 
>>>> Alden
>>>> 
>>>> ===========================================================
>>>> There was a crash.
>>>> This is the entire stack trace of all threads:
>>>> ===========================================================
>>>> 
>>>> Thread 5 (process 807):
>>>> #0  0x00007fff845f0eca in poll ()
>>>> #1  0x000000010574f312 in XrdClientSock::RecvRaw ()
>>>> #2  0x00000001057703b5 in XrdClientPhyConnection::ReadRaw ()
>>>> #3  0x000000010576d5c4 in XrdClientMessage::ReadRaw ()
>>>> #4  0x000000010576fa48 in XrdClientPhyConnection::BuildMessage ()
>>>> #5  0x000000010577012b in SocketReaderThread ()
>>>> #6  0x000000010579bcf6 in XrdSysThread_Xeq ()
>>>> #7  0x00007fff845a7456 in _pthread_start ()
>>>> #8  0x00007fff845a7309 in thread_start ()
>>>> 
>>>> Thread 4 (process 807):
>>>> #0  0x00007fff845f0eca in poll ()
>>>> #1  0x000000010574f312 in XrdClientSock::RecvRaw ()
>>>> #2  0x00000001057703b5 in XrdClientPhyConnection::ReadRaw ()
>>>> #3  0x000000010576d5c4 in XrdClientMessage::ReadRaw ()
>>>> #4  0x000000010576fa48 in XrdClientPhyConnection::BuildMessage ()
>>>> #5  0x000000010577012b in SocketReaderThread ()
>>>> #6  0x000000010579bcf6 in XrdSysThread_Xeq ()
>>>> #7  0x00007fff845a7456 in _pthread_start ()
>>>> #8  0x00007fff845a7309 in thread_start ()
>>>> 
>>>> Thread 3 (process 807):
>>>> #0  0x00007fff845f0eca in poll ()
>>>> #1  0x000000010574f312 in XrdClientSock::RecvRaw ()
>>>> #2  0x00000001057703b5 in XrdClientPhyConnection::ReadRaw ()
>>>> #3  0x000000010576d5c4 in XrdClientMessage::ReadRaw ()
>>>> #4  0x000000010576fa48 in XrdClientPhyConnection::BuildMessage ()
>>>> #5  0x000000010577012b in SocketReaderThread ()
>>>> #6  0x000000010579bcf6 in XrdSysThread_Xeq ()
>>>> #7  0x00007fff845a7456 in _pthread_start ()
>>>> #8  0x00007fff845a7309 in thread_start ()
>>>> 
>>>> Thread 2 (process 807):
>>>> #0  0x00007fff845a8eb6 in __semwait_signal ()
>>>> #1  0x00007fff845a8d45 in nanosleep ()
>>>> #2  0x00007fff845f5b14 in sleep ()
>>>> #3  0x00000001057667ac in GarbageCollectorThread ()
>>>> #4  0x000000010579bcf6 in XrdSysThread_Xeq ()
>>>> #5  0x00007fff845a7456 in _pthread_start ()
>>>> #6  0x00007fff845a7309 in thread_start ()
>>>> 
>>>> Thread 1 (process 807):
>>>> #0  0x00007fff845ebc90 in wait4 ()
>>>> #1  0x00007fff8460023e in system ()
>>>> #2  0x000000010110b782 in TUnixSystem::StackTrace ()
>>>> #3  0x000000010110a26a in TUnixSystem::DispatchSignals ()
>>>> #4  <signal handler called>
>>>> #5  0x00000001020950cb in TKey::Create ()
>>>> #6  0x000000010209849d in TKey::TKey ()
>>>> #7  0x0000000102077075 in TDirectoryFile::WriteKeys ()
>>>> #8  0x00000001020768fb in TDirectoryFile::SaveSelf ()
>>>> #9  0x0000000102079337 in TDirectoryFile::Write ()
>>>> #10 0x00000001020815f0 in TFile::Write ()
>>>> #11 0x0000000101293f7e in G__G__Base2_10_0_53 ()
>>>> #12 0x000000010190d9ea in Cint::G__CallFunc::Execute ()
>>>> #13 0x00000001005b38c1 in PyROOT::TIntExecutor::Execute ()
>>>> #14 0x00000001005b9762 in PyROOT::TMethodHolder<PyROOT::TScopeAdapter, PyROOT::TMemberAdapter>::CallSafe ()
>>>> #15 0x00000001005b9916 in PyROOT::TMethodHolder<PyROOT::TScopeAdapter, PyROOT::TMemberAdapter>::Execute ()
>>>> #16 0x00000001005b73f5 in PyROOT::TMethodHolder<PyROOT::TScopeAdapter, PyROOT::TMemberAdapter>::operator() ()
>>>> #17 0x00000001005bdf1b in PyROOT::(anonymous namespace)::mp_call ()
>>>> #18 0x000000010000aff3 in PyObject_Call ()
>>>> #19 0x000000010008a51a in PyEval_EvalFrameEx ()
>>>> #20 0x00000001000892e1 in PyEval_EvalFrameEx ()
>>>> #21 0x00000001000892e1 in PyEval_EvalFrameEx ()
>>>> #22 0x000000010008acce in PyEval_EvalCodeEx ()
>>>> #23 0x000000010008935e in PyEval_EvalFrameEx ()
>>>> #24 0x000000010008acce in PyEval_EvalCodeEx ()
>>>> #25 0x000000010008935e in PyEval_EvalFrameEx ()
>>>> #26 0x00000001000892e1 in PyEval_EvalFrameEx ()
>>>> #27 0x000000010008acce in PyEval_EvalCodeEx ()
>>>> #28 0x000000010008ad61 in PyEval_EvalCode ()
>>>> #29 0x00000001000a265a in Py_CompileString ()
>>>> #30 0x00000001000a2723 in PyRun_FileExFlags ()
>>>> #31 0x0000000100083196 in _PyBuiltin_Init ()
>>>> #32 0x0000000100089187 in PyEval_EvalFrameEx ()
>>>> #33 0x000000010008acce in PyEval_EvalCodeEx ()
>>>> #34 0x000000010008ad61 in PyEval_EvalCode ()
>>>> #35 0x00000001000a265a in Py_CompileString ()
>>>> #36 0x00000001000a2723 in PyRun_FileExFlags ()
>>>> #37 0x00000001000a423d in PyRun_SimpleFileExFlags ()
>>>> #38 0x00000001000b0286 in Py_Main ()
>>>> #39 0x0000000100000e6c in start ()
>>>> ===========================================================
>>>> The lines below might hint at the cause of the crash.
>>>> If they do not help you then please submit a bug report at
>>>> http://root.cern.ch/bugs. Please post the ENTIRE stack trace
>>>> from above as an attachment in addition to anything else
>>>> that might help us fixing this issue.
>>>> ===========================================================
>>>> #5  0x00000001020950cb in TKey::Create ()
>>>> #6  0x000000010209849d in TKey::TKey ()
>>>> #7  0x0000000102077075 in TDirectoryFile::WriteKeys ()
>>>> #8  0x00000001020768fb in TDirectoryFile::SaveSelf ()
>>>> #9  0x0000000102079337 in TDirectoryFile::Write ()
>>>> #10 0x00000001020815f0 in TFile::Write ()
>>>> ===========================================================
>>>> 
>>>> 
>>>> 
>>>> ==> 18632.arnor.cern.ch.afarbin.output <==
>>>> 11800
>>>> 
>>>> ==> 18519.arnor.cern.ch.afarbin.output <==
>>>> Traceback (most recent call last):
>>>> File "/Volumes/DataA_1/afarbin/Runs/SPyRoot/trunk/bin/PyRootBatch", line 70, in <module>
>>>>  execfile(sys.argv[1])
>>>> File "Do0LeptonAnalysis.py", line 434, in <module>
>>>>  RH= RunThisStep(NEvents,SampleNumber)
>>>> File "Do0LeptonAnalysis.py", line 428, in RunThisStep
>>>>  RH.Loop(MaxEntries=NEvents, doPickle=True, pickleDir=NFSPath+out_dir)
>>>> File "/Volumes/DataA_1/afarbin/Runs/SPyRoot/trunk/python/RunHandler.py", line 62, in Loop
>>>>  res = self.Algo.Loop(newSample, MaxEntries, gd, firstEntry)
>>>> File "/Volumes/DataA_1/afarbin/Runs/SPyRoot/trunk/python/TTreeAlgorithm.py", line 377, in Loop
>>>> 
>>>> ==> 18824.arnor.cern.ch.afarbin.output <==
>>>> 4400
>>>> 
>>>> ==> 18519.arnor.cern.ch.afarbin.output <==
>>>>  if not self.finalize(TheSample, AllEntriesData, GlobalData):
>>>> File "/Volumes/DataA_1/afarbin/Runs/SPyRoot/trunk/python/TTreeAlgorithm.py", line 236, in finalize
>>>>  if not Alg.finalize(TheSample, AllEntriesData, GlobalData):
>>>> File "/Volumes/DataA_1/afarbin/Runs/SPyRoot/trunk/python/WriterAlgorithm.py", line 179, in finalize
>>>>  self.file.Write()
>>>> TypeError: none of the 2 overloaded methods succeeded. Full details:
>>>> problem in C++; program state has been reset
>>>> problem in C++; program state has been reset
>>>> 
>>>> 
>> 
>>