Print

Print


Hi Serge,

The czar dying definitely (either being killed or looses its TCP 
connection) causes any worker than has an in-progress query to die as 
well. This is a known problem and is on the schedule to fix. This is a 
qserv issue as the worker session object does not know how to stop an 
in-progress query nor does it keep track of the fact that it should 
deep-six the query after it finishes.

Andy

On Fri, 14 Aug 2015, Serge Monkewitz wrote:

> I stared at the aftermath for a while. One problem is that the OOM killer blew away the czar. If I'm interpreting the conversation in DM-3432 correctly, that then causes xrootd level cancellations to eventually propagate to workers. And I guess this causes deletion/cleanup of xrootd level objects in a way that is either not communicated properly to the higher level worker code, or not responded to properly by that code. The workers then segfault when they try to use objects that no longer (or only partially) exist, e.g. when trying to send results back to the dead czar. I'm not at all sure if that explains all the worker failures, but likely at least some of them.
>
> As for how much memory the czar was using, here's what the OOM killer saw before swinging the hatchet (rss and total_vm are in units of 4KiB pages):
>
> [5068353.773850] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
> [5068353.774048] [11688]  1000 11688  4179102  2267789    7958  1696364             0 python
>
> so ~8.7GiB. The machine has 16GiB of RAM, but at the time of czar death, mysql-proxy was using 4.3 GiB and mmfsd (GPFS daemon) 2.1GiB. So there's probably a czar-side memory leak, or maybe some really inefficient use of resources when sub-chunking. I'll have to look more next week. But I also think that killing the czar should not result in all workers processing queries from it segfaulting.
>
>
>> On Aug 14, 2015, at 7:57 AM, Becla, Jacek <[log in to unmask]> wrote:
>>
>> Serge
>>
>> I tried running a small number of queries, it looks like pretty early on my script failed to connect to czar and the last message that it printed was that it was about to run
>>
>> Running: select o1.ra as ra1, o2.ra as ra2, o1.decl as decl1, o2.decl as decl2, scisql_angSep(o1.ra, o1.decl,o2.ra, o2.decl) AS theDistance from Object o1, Object o2 where qserv_areaspec_box(90.299197, -66.468216, 98.762526, -56.412851) and scisql_angSep(o1.ra, o1.decl, o2.ra, o2.decl) < 0.015
>>
>>
>> That query started ok by hand earlier, so this the syntax is fine as far as I can tell.
>>
>> Looks like czar and most xrootd servers are down. It happened before earlier last night too.
>>
>> I am leaving things as they are. If you have time, you might want to peek at it˙˙
>>
>> Feel free to restart services
>>
>> Thanks,
>> Jacek
>>
>> Use REPLY-ALL to reply to list
>>
>> To unsubscribe from the QSERV-L list, click the following link:
>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1 <https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1>
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the QSERV-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1