Print

Print


Alja,

    can you please check if you still see the problem in master?

Cheers,
    Lukasz

On 14.02.2014 21:05, Andrew Hanushevsky wrote:
> Hi Lukasz,
>
> That would work as well but my proposed solution is much simpler. There
> really is no reason to hold on to the scope lock once you delete it from
> the registration table. After that you can safely delete it. Minimum
> change on your part :-)
>
> Andy
>
> On Fri, 14 Feb 2014, Lukasz Janyst wrote:
>
>> OK, I see the problem. Does it help if I remove the socket from inside
>> the callback?
>>
>>   Lukasz
>>
>> On 14.02.2014 02:03, Andrew Hanushevsky wrote:
>>> Yes, this appears to be where the poller was an innocent bystander in a
>>> lockdown. So, here is the scenario:
>>>
>>> 1) A socket's TTL has been reached and AsyncSocketHandler has decided to
>>> close and delete the channel associated with the socket. So,
>>> RemoveSocket() is called which immediately gets the "scopedLock" to find
>>> the socket. It then call the channel to disable event notifications
>>> (which is no-op as they were already disabled) and deletes the channel
>>> object. Notice the scopedLock is still held.
>>>
>>> 2) The channel sees that the socket being removed from the pollset does
>>> not belong to the current thread as it is not a poller thread. It then
>>> unlocks all of its internal locks and sends a message to the correct
>>> poller (let us call it thread X) thread telling that one of its sockets
>>> is gone and it should recalibrate the timeout. This thread will wait
>>> until the message is acted upon. Note that the scopedLock is still
>>> locked.
>>>
>>> 3) In the mean time thread X, our wonderful poller, is in the midst of
>>> doing callbhacks. One of these happens to require the intervention of
>>> XrdClPollerBuiltin which needs to look up the socket which means it need
>>> to get the "scopedLock" which means thread X will deadlock because it
>>> can't get the scopedLock and act upon the message waiting for it because
>>> the message sender (see 1 above) holds the "scopedLock" and is waiting
>>> for thread X to acknowledge the message.
>>>
>>> This can be solved by moving the code in PollerBuiltIn::RemoveSocket
>>> that does the pSocket.erase(it) immediately after the it is looked up
>>> and release the lock before doing any callouts, especially to the
>>> poller. In fact, one should look very closely at all of the code that
>>> gets the scopedLock because this lock seems to be held for arbitrarily
>>> long sequences and across calls to other objects making difficult to say
>>> with any certainty that other such deadlocks could not occur.
>>>
>>> Andy
>>> -----Original Message----- From: Lukasz Janyst
>>> Sent: Thursday, February 13, 2014 1:29 AM
>>> To: Alja Mrak-Tadel ; [log in to unmask]
>>> Subject: Re: master branch:: dead lock in proxy client
>>>
>>> OK, it's not a bug in the client but in the poller. One stream
>>> (0x7fd0e4001070) is trying to send some data, and simultaneously another
>>> stream (0x7fcfb80048b0) is being removed due to TTL expiration.
>>>
>>> Andy, can you please have a look?
>>>
>>> Alja, if it disturbs your testing you can temporarily switch to libevent
>>> by playing with XRD_POLLERPREFERENCE envvar.
>>>
>>> Cheers,
>>>     Lukasz
>>>
>>> On 13.02.2014 10:17, Lukasz Janyst wrote:
>>>> Strange. I took care to avoid just this kind of deadlock. I will have a
>>>> look.
>>>>
>>>>     Lukasz
>>>>
>>>> PS. for the future, please report these as issues on github.
>>>>
>>>> On 13.02.2014 07:09, Alja Mrak-Tadel wrote:
>>>>> Hi,
>>>>>
>>>>> I'm running proxy server from the master branch and I consistently
>>>>> get a
>>>>> dead lock after running jobs more than 15 min, e.g.,
>>>>> http://uaf-2.t2.ucsd.edu/~alja/traffic.png
>>>>>
>>>>> The lock seems to be related to removing sockets in the client after
>>>>> elapsing the XRD_DATASERVERTTL of the proxy. This is what I inferred
>>>>> from debugging proxy with gdb (I picked a locked up thread and
>>>>> followed
>>>>> the owners of contended locks):
>>>>>
>>>>> https://gist.github.com/alja/2e6a69c48f864cccf1d9#file-xrdcl-pollerbuiltin-removesocket-bt-txt-L12
>>>>>
>>>>>
>>>>>
>>>>> and server debug messages
>>>>> http://uaf-2.t2.ucsd.edu/~alja/proxy-lock.log
>>>>> If I set the TTL high enough the proxy server can be 100% efficient
>>>>> and
>>>>> stable.
>>>>>
>>>>> The lock is not simple to reproduce -- I'm running 600 jobs with 10%
>>>>> probability of replicated file paths. The gcore after lock-up is in
>>>>> noric38.slac.stanford.edu:/usr/work/matevz/gcore.12905
>>>>>
>>>>> Thanks,
>>>>> Alya
>>>>>
>>>>> ########################################################################
>>>>>
>>>>> Use REPLY-ALL to reply to list
>>>>>
>>>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>>>
>>>
>>> ########################################################################
>>> Use REPLY-ALL to reply to list
>>>
>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1