Print

Print


Hi Amir,

Yes, in thread 1 it is the whole call stack which is very deep. That said, 
when you sau "use xrd" is that using the xroot client (e.g., 
xroot://....)? If so, what version are you using?

Andy

On Wed, 22 Sep 2010, Amir Farbin wrote:

> Hi,
>
> Maybe I'm wrong, but I think we are seeing the whole call stack. It starts with TTree:Write() (of the python wrapper) in our python analysis code, which then goes to the PyRoot TTree wrapper, which then call's ROOT's usual TTree:Write(), which ends up eventually trying to write via Xrd.
>
> Few other pieces of info... I have noticed that when we use Xrd to read/write, we have a huge memory leak (~ 1 GB after reading ~100K events). Exact same job reading/writing the exact same files on local-disk or via NFS (or AFP) show no memory leak.
>
> Also, crashes like this only happen to some fraction of jobs when we have lots of simultaneous jobs running (usually 24 jobs reading/writing different files). And I'm fairly sure that I've noticed synchronized crashes, meaning I see more than one job crash at the same time.
>
> Amir
>
> On Sep 22, 2010, at 8:52 PM, Andrew Hanushevsky wrote:
>
>> Hi Alden,
>>
>> Looks like the SEGV happens in the root package. Given that this occurs less than more and that the call stack at the time is quite deep, I suspect that the thread stack size may be too small. There is a way to increase that in root but I am not a root expert; so thers should weigh in.
>>
>> Andy
>>
>> On Wed, 22 Sep 2010, Alden Stradling wrote:
>>
>>> We're trying to track down some errors on our OS X cluster (10.6.4, Mac Pro Early 2009). Most of the runs go great, but we are seeing occasional segfaults, as seen below. I will have a larger assortment of error logs soon, but in the meantime -- does this look familiar to anyone?
>>>
>>>
>>> Thanks,
>>>
>>> Alden
>>>
>>> ===========================================================
>>> There was a crash.
>>> This is the entire stack trace of all threads:
>>> ===========================================================
>>>
>>> Thread 5 (process 807):
>>> #0  0x00007fff845f0eca in poll ()
>>> #1  0x000000010574f312 in XrdClientSock::RecvRaw ()
>>> #2  0x00000001057703b5 in XrdClientPhyConnection::ReadRaw ()
>>> #3  0x000000010576d5c4 in XrdClientMessage::ReadRaw ()
>>> #4  0x000000010576fa48 in XrdClientPhyConnection::BuildMessage ()
>>> #5  0x000000010577012b in SocketReaderThread ()
>>> #6  0x000000010579bcf6 in XrdSysThread_Xeq ()
>>> #7  0x00007fff845a7456 in _pthread_start ()
>>> #8  0x00007fff845a7309 in thread_start ()
>>>
>>> Thread 4 (process 807):
>>> #0  0x00007fff845f0eca in poll ()
>>> #1  0x000000010574f312 in XrdClientSock::RecvRaw ()
>>> #2  0x00000001057703b5 in XrdClientPhyConnection::ReadRaw ()
>>> #3  0x000000010576d5c4 in XrdClientMessage::ReadRaw ()
>>> #4  0x000000010576fa48 in XrdClientPhyConnection::BuildMessage ()
>>> #5  0x000000010577012b in SocketReaderThread ()
>>> #6  0x000000010579bcf6 in XrdSysThread_Xeq ()
>>> #7  0x00007fff845a7456 in _pthread_start ()
>>> #8  0x00007fff845a7309 in thread_start ()
>>>
>>> Thread 3 (process 807):
>>> #0  0x00007fff845f0eca in poll ()
>>> #1  0x000000010574f312 in XrdClientSock::RecvRaw ()
>>> #2  0x00000001057703b5 in XrdClientPhyConnection::ReadRaw ()
>>> #3  0x000000010576d5c4 in XrdClientMessage::ReadRaw ()
>>> #4  0x000000010576fa48 in XrdClientPhyConnection::BuildMessage ()
>>> #5  0x000000010577012b in SocketReaderThread ()
>>> #6  0x000000010579bcf6 in XrdSysThread_Xeq ()
>>> #7  0x00007fff845a7456 in _pthread_start ()
>>> #8  0x00007fff845a7309 in thread_start ()
>>>
>>> Thread 2 (process 807):
>>> #0  0x00007fff845a8eb6 in __semwait_signal ()
>>> #1  0x00007fff845a8d45 in nanosleep ()
>>> #2  0x00007fff845f5b14 in sleep ()
>>> #3  0x00000001057667ac in GarbageCollectorThread ()
>>> #4  0x000000010579bcf6 in XrdSysThread_Xeq ()
>>> #5  0x00007fff845a7456 in _pthread_start ()
>>> #6  0x00007fff845a7309 in thread_start ()
>>>
>>> Thread 1 (process 807):
>>> #0  0x00007fff845ebc90 in wait4 ()
>>> #1  0x00007fff8460023e in system ()
>>> #2  0x000000010110b782 in TUnixSystem::StackTrace ()
>>> #3  0x000000010110a26a in TUnixSystem::DispatchSignals ()
>>> #4  <signal handler called>
>>> #5  0x00000001020950cb in TKey::Create ()
>>> #6  0x000000010209849d in TKey::TKey ()
>>> #7  0x0000000102077075 in TDirectoryFile::WriteKeys ()
>>> #8  0x00000001020768fb in TDirectoryFile::SaveSelf ()
>>> #9  0x0000000102079337 in TDirectoryFile::Write ()
>>> #10 0x00000001020815f0 in TFile::Write ()
>>> #11 0x0000000101293f7e in G__G__Base2_10_0_53 ()
>>> #12 0x000000010190d9ea in Cint::G__CallFunc::Execute ()
>>> #13 0x00000001005b38c1 in PyROOT::TIntExecutor::Execute ()
>>> #14 0x00000001005b9762 in PyROOT::TMethodHolder<PyROOT::TScopeAdapter, PyROOT::TMemberAdapter>::CallSafe ()
>>> #15 0x00000001005b9916 in PyROOT::TMethodHolder<PyROOT::TScopeAdapter, PyROOT::TMemberAdapter>::Execute ()
>>> #16 0x00000001005b73f5 in PyROOT::TMethodHolder<PyROOT::TScopeAdapter, PyROOT::TMemberAdapter>::operator() ()
>>> #17 0x00000001005bdf1b in PyROOT::(anonymous namespace)::mp_call ()
>>> #18 0x000000010000aff3 in PyObject_Call ()
>>> #19 0x000000010008a51a in PyEval_EvalFrameEx ()
>>> #20 0x00000001000892e1 in PyEval_EvalFrameEx ()
>>> #21 0x00000001000892e1 in PyEval_EvalFrameEx ()
>>> #22 0x000000010008acce in PyEval_EvalCodeEx ()
>>> #23 0x000000010008935e in PyEval_EvalFrameEx ()
>>> #24 0x000000010008acce in PyEval_EvalCodeEx ()
>>> #25 0x000000010008935e in PyEval_EvalFrameEx ()
>>> #26 0x00000001000892e1 in PyEval_EvalFrameEx ()
>>> #27 0x000000010008acce in PyEval_EvalCodeEx ()
>>> #28 0x000000010008ad61 in PyEval_EvalCode ()
>>> #29 0x00000001000a265a in Py_CompileString ()
>>> #30 0x00000001000a2723 in PyRun_FileExFlags ()
>>> #31 0x0000000100083196 in _PyBuiltin_Init ()
>>> #32 0x0000000100089187 in PyEval_EvalFrameEx ()
>>> #33 0x000000010008acce in PyEval_EvalCodeEx ()
>>> #34 0x000000010008ad61 in PyEval_EvalCode ()
>>> #35 0x00000001000a265a in Py_CompileString ()
>>> #36 0x00000001000a2723 in PyRun_FileExFlags ()
>>> #37 0x00000001000a423d in PyRun_SimpleFileExFlags ()
>>> #38 0x00000001000b0286 in Py_Main ()
>>> #39 0x0000000100000e6c in start ()
>>> ===========================================================
>>> The lines below might hint at the cause of the crash.
>>> If they do not help you then please submit a bug report at
>>> http://root.cern.ch/bugs. Please post the ENTIRE stack trace
>>> from above as an attachment in addition to anything else
>>> that might help us fixing this issue.
>>> ===========================================================
>>> #5  0x00000001020950cb in TKey::Create ()
>>> #6  0x000000010209849d in TKey::TKey ()
>>> #7  0x0000000102077075 in TDirectoryFile::WriteKeys ()
>>> #8  0x00000001020768fb in TDirectoryFile::SaveSelf ()
>>> #9  0x0000000102079337 in TDirectoryFile::Write ()
>>> #10 0x00000001020815f0 in TFile::Write ()
>>> ===========================================================
>>>
>>>
>>>
>>> ==> 18632.arnor.cern.ch.afarbin.output <==
>>> 11800
>>>
>>> ==> 18519.arnor.cern.ch.afarbin.output <==
>>> Traceback (most recent call last):
>>> File "/Volumes/DataA_1/afarbin/Runs/SPyRoot/trunk/bin/PyRootBatch", line 70, in <module>
>>>   execfile(sys.argv[1])
>>> File "Do0LeptonAnalysis.py", line 434, in <module>
>>>   RH= RunThisStep(NEvents,SampleNumber)
>>> File "Do0LeptonAnalysis.py", line 428, in RunThisStep
>>>   RH.Loop(MaxEntries=NEvents, doPickle=True, pickleDir=NFSPath+out_dir)
>>> File "/Volumes/DataA_1/afarbin/Runs/SPyRoot/trunk/python/RunHandler.py", line 62, in Loop
>>>   res = self.Algo.Loop(newSample, MaxEntries, gd, firstEntry)
>>> File "/Volumes/DataA_1/afarbin/Runs/SPyRoot/trunk/python/TTreeAlgorithm.py", line 377, in Loop
>>>
>>> ==> 18824.arnor.cern.ch.afarbin.output <==
>>> 4400
>>>
>>> ==> 18519.arnor.cern.ch.afarbin.output <==
>>>   if not self.finalize(TheSample, AllEntriesData, GlobalData):
>>> File "/Volumes/DataA_1/afarbin/Runs/SPyRoot/trunk/python/TTreeAlgorithm.py", line 236, in finalize
>>>   if not Alg.finalize(TheSample, AllEntriesData, GlobalData):
>>> File "/Volumes/DataA_1/afarbin/Runs/SPyRoot/trunk/python/WriterAlgorithm.py", line 179, in finalize
>>>   self.file.Write()
>>> TypeError: none of the 2 overloaded methods succeeded. Full details:
>>> problem in C++; program state has been reset
>>> problem in C++; program state has been reset
>>>
>>>
>
>