Hi Lukasz,
I've run the test. Proxy server runs OK after streams are disconnected
with elapsed TTL.
Thanks,
Alya
On 03/06/14 08:47, Lukasz Janyst wrote:
> Alja,
>
> can you please check if you still see the problem in master?
>
> Cheers,
> Lukasz
>
> On 14.02.2014 21:05, Andrew Hanushevsky wrote:
>> Hi Lukasz,
>>
>> That would work as well but my proposed solution is much simpler. There
>> really is no reason to hold on to the scope lock once you delete it from
>> the registration table. After that you can safely delete it. Minimum
>> change on your part :-)
>>
>> Andy
>>
>> On Fri, 14 Feb 2014, Lukasz Janyst wrote:
>>
>>> OK, I see the problem. Does it help if I remove the socket from inside
>>> the callback?
>>>
>>> Lukasz
>>>
>>> On 14.02.2014 02:03, Andrew Hanushevsky wrote:
>>>> Yes, this appears to be where the poller was an innocent bystander in a
>>>> lockdown. So, here is the scenario:
>>>>
>>>> 1) A socket's TTL has been reached and AsyncSocketHandler has
>>>> decided to
>>>> close and delete the channel associated with the socket. So,
>>>> RemoveSocket() is called which immediately gets the "scopedLock" to
>>>> find
>>>> the socket. It then call the channel to disable event notifications
>>>> (which is no-op as they were already disabled) and deletes the channel
>>>> object. Notice the scopedLock is still held.
>>>>
>>>> 2) The channel sees that the socket being removed from the pollset does
>>>> not belong to the current thread as it is not a poller thread. It then
>>>> unlocks all of its internal locks and sends a message to the correct
>>>> poller (let us call it thread X) thread telling that one of its sockets
>>>> is gone and it should recalibrate the timeout. This thread will wait
>>>> until the message is acted upon. Note that the scopedLock is still
>>>> locked.
>>>>
>>>> 3) In the mean time thread X, our wonderful poller, is in the midst of
>>>> doing callbhacks. One of these happens to require the intervention of
>>>> XrdClPollerBuiltin which needs to look up the socket which means it
>>>> need
>>>> to get the "scopedLock" which means thread X will deadlock because it
>>>> can't get the scopedLock and act upon the message waiting for it
>>>> because
>>>> the message sender (see 1 above) holds the "scopedLock" and is waiting
>>>> for thread X to acknowledge the message.
>>>>
>>>> This can be solved by moving the code in PollerBuiltIn::RemoveSocket
>>>> that does the pSocket.erase(it) immediately after the it is looked up
>>>> and release the lock before doing any callouts, especially to the
>>>> poller. In fact, one should look very closely at all of the code that
>>>> gets the scopedLock because this lock seems to be held for arbitrarily
>>>> long sequences and across calls to other objects making difficult to
>>>> say
>>>> with any certainty that other such deadlocks could not occur.
>>>>
>>>> Andy
>>>> -----Original Message----- From: Lukasz Janyst
>>>> Sent: Thursday, February 13, 2014 1:29 AM
>>>> To: Alja Mrak-Tadel ; [log in to unmask]
>>>> Subject: Re: master branch:: dead lock in proxy client
>>>>
>>>> OK, it's not a bug in the client but in the poller. One stream
>>>> (0x7fd0e4001070) is trying to send some data, and simultaneously
>>>> another
>>>> stream (0x7fcfb80048b0) is being removed due to TTL expiration.
>>>>
>>>> Andy, can you please have a look?
>>>>
>>>> Alja, if it disturbs your testing you can temporarily switch to
>>>> libevent
>>>> by playing with XRD_POLLERPREFERENCE envvar.
>>>>
>>>> Cheers,
>>>> Lukasz
>>>>
>>>> On 13.02.2014 10:17, Lukasz Janyst wrote:
>>>>> Strange. I took care to avoid just this kind of deadlock. I will
>>>>> have a
>>>>> look.
>>>>>
>>>>> Lukasz
>>>>>
>>>>> PS. for the future, please report these as issues on github.
>>>>>
>>>>> On 13.02.2014 07:09, Alja Mrak-Tadel wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I'm running proxy server from the master branch and I consistently
>>>>>> get a
>>>>>> dead lock after running jobs more than 15 min, e.g.,
>>>>>> http://uaf-2.t2.ucsd.edu/~alja/traffic.png
>>>>>>
>>>>>> The lock seems to be related to removing sockets in the client after
>>>>>> elapsing the XRD_DATASERVERTTL of the proxy. This is what I inferred
>>>>>> from debugging proxy with gdb (I picked a locked up thread and
>>>>>> followed
>>>>>> the owners of contended locks):
>>>>>>
>>>>>> https://gist.github.com/alja/2e6a69c48f864cccf1d9#file-xrdcl-pollerbuiltin-removesocket-bt-txt-L12
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> and server debug messages
>>>>>> http://uaf-2.t2.ucsd.edu/~alja/proxy-lock.log
>>>>>> If I set the TTL high enough the proxy server can be 100% efficient
>>>>>> and
>>>>>> stable.
>>>>>>
>>>>>> The lock is not simple to reproduce -- I'm running 600 jobs with 10%
>>>>>> probability of replicated file paths. The gcore after lock-up is in
>>>>>> noric38.slac.stanford.edu:/usr/work/matevz/gcore.12905
>>>>>>
>>>>>> Thanks,
>>>>>> Alya
>>>>>>
>>>>>> ########################################################################
>>>>>>
>>>>>>
>>>>>> Use REPLY-ALL to reply to list
>>>>>>
>>>>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>>>>
>>>>
>>>> ########################################################################
>>>>
>>>> Use REPLY-ALL to reply to list
>>>>
>>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>>
>
########################################################################
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
|