OK, I see the problem. Does it help if I remove the socket from inside
the callback?
Lukasz
On 14.02.2014 02:03, Andrew Hanushevsky wrote:
> Yes, this appears to be where the poller was an innocent bystander in a
> lockdown. So, here is the scenario:
>
> 1) A socket's TTL has been reached and AsyncSocketHandler has decided to
> close and delete the channel associated with the socket. So,
> RemoveSocket() is called which immediately gets the "scopedLock" to find
> the socket. It then call the channel to disable event notifications
> (which is no-op as they were already disabled) and deletes the channel
> object. Notice the scopedLock is still held.
>
> 2) The channel sees that the socket being removed from the pollset does
> not belong to the current thread as it is not a poller thread. It then
> unlocks all of its internal locks and sends a message to the correct
> poller (let us call it thread X) thread telling that one of its sockets
> is gone and it should recalibrate the timeout. This thread will wait
> until the message is acted upon. Note that the scopedLock is still locked.
>
> 3) In the mean time thread X, our wonderful poller, is in the midst of
> doing callbhacks. One of these happens to require the intervention of
> XrdClPollerBuiltin which needs to look up the socket which means it need
> to get the "scopedLock" which means thread X will deadlock because it
> can't get the scopedLock and act upon the message waiting for it because
> the message sender (see 1 above) holds the "scopedLock" and is waiting
> for thread X to acknowledge the message.
>
> This can be solved by moving the code in PollerBuiltIn::RemoveSocket
> that does the pSocket.erase(it) immediately after the it is looked up
> and release the lock before doing any callouts, especially to the
> poller. In fact, one should look very closely at all of the code that
> gets the scopedLock because this lock seems to be held for arbitrarily
> long sequences and across calls to other objects making difficult to say
> with any certainty that other such deadlocks could not occur.
>
> Andy
> -----Original Message----- From: Lukasz Janyst
> Sent: Thursday, February 13, 2014 1:29 AM
> To: Alja Mrak-Tadel ; [log in to unmask]
> Subject: Re: master branch:: dead lock in proxy client
>
> OK, it's not a bug in the client but in the poller. One stream
> (0x7fd0e4001070) is trying to send some data, and simultaneously another
> stream (0x7fcfb80048b0) is being removed due to TTL expiration.
>
> Andy, can you please have a look?
>
> Alja, if it disturbs your testing you can temporarily switch to libevent
> by playing with XRD_POLLERPREFERENCE envvar.
>
> Cheers,
> Lukasz
>
> On 13.02.2014 10:17, Lukasz Janyst wrote:
>> Strange. I took care to avoid just this kind of deadlock. I will have a
>> look.
>>
>> Lukasz
>>
>> PS. for the future, please report these as issues on github.
>>
>> On 13.02.2014 07:09, Alja Mrak-Tadel wrote:
>>> Hi,
>>>
>>> I'm running proxy server from the master branch and I consistently get a
>>> dead lock after running jobs more than 15 min, e.g.,
>>> http://uaf-2.t2.ucsd.edu/~alja/traffic.png
>>>
>>> The lock seems to be related to removing sockets in the client after
>>> elapsing the XRD_DATASERVERTTL of the proxy. This is what I inferred
>>> from debugging proxy with gdb (I picked a locked up thread and followed
>>> the owners of contended locks):
>>>
>>> https://gist.github.com/alja/2e6a69c48f864cccf1d9#file-xrdcl-pollerbuiltin-removesocket-bt-txt-L12
>>>
>>>
>>> and server debug messages http://uaf-2.t2.ucsd.edu/~alja/proxy-lock.log
>>> If I set the TTL high enough the proxy server can be 100% efficient and
>>> stable.
>>>
>>> The lock is not simple to reproduce -- I'm running 600 jobs with 10%
>>> probability of replicated file paths. The gcore after lock-up is in
>>> noric38.slac.stanford.edu:/usr/work/matevz/gcore.12905
>>>
>>> Thanks,
>>> Alya
>>>
>>> ########################################################################
>>> Use REPLY-ALL to reply to list
>>>
>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the XROOTD-DEV list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
########################################################################
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
|