Alja, can you please check if you still see the problem in master? Cheers, Lukasz On 14.02.2014 21:05, Andrew Hanushevsky wrote: > Hi Lukasz, > > That would work as well but my proposed solution is much simpler. There > really is no reason to hold on to the scope lock once you delete it from > the registration table. After that you can safely delete it. Minimum > change on your part :-) > > Andy > > On Fri, 14 Feb 2014, Lukasz Janyst wrote: > >> OK, I see the problem. Does it help if I remove the socket from inside >> the callback? >> >> Lukasz >> >> On 14.02.2014 02:03, Andrew Hanushevsky wrote: >>> Yes, this appears to be where the poller was an innocent bystander in a >>> lockdown. So, here is the scenario: >>> >>> 1) A socket's TTL has been reached and AsyncSocketHandler has decided to >>> close and delete the channel associated with the socket. So, >>> RemoveSocket() is called which immediately gets the "scopedLock" to find >>> the socket. It then call the channel to disable event notifications >>> (which is no-op as they were already disabled) and deletes the channel >>> object. Notice the scopedLock is still held. >>> >>> 2) The channel sees that the socket being removed from the pollset does >>> not belong to the current thread as it is not a poller thread. It then >>> unlocks all of its internal locks and sends a message to the correct >>> poller (let us call it thread X) thread telling that one of its sockets >>> is gone and it should recalibrate the timeout. This thread will wait >>> until the message is acted upon. Note that the scopedLock is still >>> locked. >>> >>> 3) In the mean time thread X, our wonderful poller, is in the midst of >>> doing callbhacks. One of these happens to require the intervention of >>> XrdClPollerBuiltin which needs to look up the socket which means it need >>> to get the "scopedLock" which means thread X will deadlock because it >>> can't get the scopedLock and act upon the message waiting for it because >>> the message sender (see 1 above) holds the "scopedLock" and is waiting >>> for thread X to acknowledge the message. >>> >>> This can be solved by moving the code in PollerBuiltIn::RemoveSocket >>> that does the pSocket.erase(it) immediately after the it is looked up >>> and release the lock before doing any callouts, especially to the >>> poller. In fact, one should look very closely at all of the code that >>> gets the scopedLock because this lock seems to be held for arbitrarily >>> long sequences and across calls to other objects making difficult to say >>> with any certainty that other such deadlocks could not occur. >>> >>> Andy >>> -----Original Message----- From: Lukasz Janyst >>> Sent: Thursday, February 13, 2014 1:29 AM >>> To: Alja Mrak-Tadel ; [log in to unmask] >>> Subject: Re: master branch:: dead lock in proxy client >>> >>> OK, it's not a bug in the client but in the poller. One stream >>> (0x7fd0e4001070) is trying to send some data, and simultaneously another >>> stream (0x7fcfb80048b0) is being removed due to TTL expiration. >>> >>> Andy, can you please have a look? >>> >>> Alja, if it disturbs your testing you can temporarily switch to libevent >>> by playing with XRD_POLLERPREFERENCE envvar. >>> >>> Cheers, >>> Lukasz >>> >>> On 13.02.2014 10:17, Lukasz Janyst wrote: >>>> Strange. I took care to avoid just this kind of deadlock. I will have a >>>> look. >>>> >>>> Lukasz >>>> >>>> PS. for the future, please report these as issues on github. >>>> >>>> On 13.02.2014 07:09, Alja Mrak-Tadel wrote: >>>>> Hi, >>>>> >>>>> I'm running proxy server from the master branch and I consistently >>>>> get a >>>>> dead lock after running jobs more than 15 min, e.g., >>>>> http://uaf-2.t2.ucsd.edu/~alja/traffic.png >>>>> >>>>> The lock seems to be related to removing sockets in the client after >>>>> elapsing the XRD_DATASERVERTTL of the proxy. This is what I inferred >>>>> from debugging proxy with gdb (I picked a locked up thread and >>>>> followed >>>>> the owners of contended locks): >>>>> >>>>> https://gist.github.com/alja/2e6a69c48f864cccf1d9#file-xrdcl-pollerbuiltin-removesocket-bt-txt-L12 >>>>> >>>>> >>>>> >>>>> and server debug messages >>>>> http://uaf-2.t2.ucsd.edu/~alja/proxy-lock.log >>>>> If I set the TTL high enough the proxy server can be 100% efficient >>>>> and >>>>> stable. >>>>> >>>>> The lock is not simple to reproduce -- I'm running 600 jobs with 10% >>>>> probability of replicated file paths. The gcore after lock-up is in >>>>> noric38.slac.stanford.edu:/usr/work/matevz/gcore.12905 >>>>> >>>>> Thanks, >>>>> Alya >>>>> >>>>> ######################################################################## >>>>> >>>>> Use REPLY-ALL to reply to list >>>>> >>>>> To unsubscribe from the XROOTD-DEV list, click the following link: >>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1 >>>> >>> >>> ######################################################################## >>> Use REPLY-ALL to reply to list >>> >>> To unsubscribe from the XROOTD-DEV list, click the following link: >>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1 >> ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-DEV list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1