Print

Print


Hi Lukasz,

If that results in global simplification, then I am all for it. Recall 
that we reworked the poller to make sure that no locks are held during a 
callback and that no cross-thread locks are held if you managed to call 
the poller from a non-callback thread. Of course, the poller doesn't know 
about any locks outside of its own context.

Also, looking at the code, I can simplify it for Linux and Solaris (not 
MacOS) since in either platfown we really don't need to wait for 
acknowledgement from the poller thread when we add or remove a socket. 
That also would have prevented the deadlock but it's good that we found it 
as we now see a global simplification :-)

Andy


On Fri, 14 Feb 2014, Lukasz Janyst wrote:

> Hi Andy,
>
>   You were right. Requiring that the poller state can be altered from both 
> inside and outside of the callbacks results in too much craziness. I will 
> move every poller command in such a way that it's always issued from inside 
> of a callback (including socket additions). I would still require that you 
> don't hold any poller locks while callbacks are processed. Will this work for 
> you? Doing so should simplify both the poller and the client code 
> considerably.
>
> Cheers,
>   Lukasz
>
> On 14.02.2014 10:48, Lukasz Janyst wrote:
>> OK, I see the problem. Does it help if I remove the socket from inside
>> the callback?
>>
>>     Lukasz
>> 
>> On 14.02.2014 02:03, Andrew Hanushevsky wrote:
>>> Yes, this appears to be where the poller was an innocent bystander in a
>>> lockdown. So, here is the scenario:
>>> 
>>> 1) A socket's TTL has been reached and AsyncSocketHandler has decided to
>>> close and delete the channel associated with the socket. So,
>>> RemoveSocket() is called which immediately gets the "scopedLock" to find
>>> the socket. It then call the channel to disable event notifications
>>> (which is no-op as they were already disabled) and deletes the channel
>>> object. Notice the scopedLock is still held.
>>> 
>>> 2) The channel sees that the socket being removed from the pollset does
>>> not belong to the current thread as it is not a poller thread. It then
>>> unlocks all of its internal locks and sends a message to the correct
>>> poller (let us call it thread X) thread telling that one of its sockets
>>> is gone and it should recalibrate the timeout. This thread will wait
>>> until the message is acted upon. Note that the scopedLock is still
>>> locked.
>>> 
>>> 3) In the mean time thread X, our wonderful poller, is in the midst of
>>> doing callbhacks. One of these happens to require the intervention of
>>> XrdClPollerBuiltin which needs to look up the socket which means it need
>>> to get the "scopedLock" which means thread X will deadlock because it
>>> can't get the scopedLock and act upon the message waiting for it because
>>> the message sender (see 1 above) holds the "scopedLock" and is waiting
>>> for thread X to acknowledge the message.
>>> 
>>> This can be solved by moving the code in PollerBuiltIn::RemoveSocket
>>> that does the pSocket.erase(it) immediately after the it is looked up
>>> and release the lock before doing any callouts, especially to the
>>> poller. In fact, one should look very closely at all of the code that
>>> gets the scopedLock because this lock seems to be held for arbitrarily
>>> long sequences and across calls to other objects making difficult to say
>>> with any certainty that other such deadlocks could not occur.
>>> 
>>> Andy
>>> -----Original Message----- From: Lukasz Janyst
>>> Sent: Thursday, February 13, 2014 1:29 AM
>>> To: Alja Mrak-Tadel ; [log in to unmask]
>>> Subject: Re: master branch:: dead lock in proxy client
>>> 
>>> OK, it's not a bug in the client but in the poller. One stream
>>> (0x7fd0e4001070) is trying to send some data, and simultaneously another
>>> stream (0x7fcfb80048b0) is being removed due to TTL expiration.
>>> 
>>> Andy, can you please have a look?
>>> 
>>> Alja, if it disturbs your testing you can temporarily switch to libevent
>>> by playing with XRD_POLLERPREFERENCE envvar.
>>> 
>>> Cheers,
>>>     Lukasz
>>> 
>>> On 13.02.2014 10:17, Lukasz Janyst wrote:
>>>> Strange. I took care to avoid just this kind of deadlock. I will have a
>>>> look.
>>>>
>>>>     Lukasz
>>>> 
>>>> PS. for the future, please report these as issues on github.
>>>> 
>>>> On 13.02.2014 07:09, Alja Mrak-Tadel wrote:
>>>>> Hi,
>>>>> 
>>>>> I'm running proxy server from the master branch and I consistently
>>>>> get a
>>>>> dead lock after running jobs more than 15 min, e.g.,
>>>>> http://uaf-2.t2.ucsd.edu/~alja/traffic.png
>>>>> 
>>>>> The lock seems to be related to removing sockets in the client after
>>>>> elapsing the XRD_DATASERVERTTL of the proxy. This is what I inferred
>>>>> from debugging proxy with gdb (I picked a locked up thread and followed
>>>>> the owners of contended locks):
>>>>> 
>>>>> https://gist.github.com/alja/2e6a69c48f864cccf1d9#file-xrdcl-pollerbuiltin-removesocket-bt-txt-L12
>>>>> 
>>>>> 
>>>>> 
>>>>> and server debug messages http://uaf-2.t2.ucsd.edu/~alja/proxy-lock.log
>>>>> If I set the TTL high enough the proxy server can be 100% efficient and
>>>>> stable.
>>>>> 
>>>>> The lock is not simple to reproduce -- I'm running 600 jobs with 10%
>>>>> probability of replicated file paths. The gcore after lock-up is in
>>>>> noric38.slac.stanford.edu:/usr/work/matevz/gcore.12905
>>>>> 
>>>>> Thanks,
>>>>> Alya
>>>>> 
>>>>> ########################################################################
>>>>> 
>>>>> Use REPLY-ALL to reply to list
>>>>> 
>>>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>>>> 
>>> 
>>> ########################################################################
>>> Use REPLY-ALL to reply to list
>>> 
>>> To unsubscribe from the XROOTD-DEV list, click the following link:
>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>> 
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1