LISTSERV 16.5 - QSERV-L Archives

I stared at the aftermath for a while. One problem is that the OOM killer blew away the czar. If I'm interpreting the conversation in DM-3432 correctly, that then causes xrootd level cancellations to eventually propagate to workers. And I guess this causes deletion/cleanup of xrootd level objects in a way that is either not communicated properly to the higher level worker code, or not responded to properly by that code. The workers then segfault when they try to use objects that no longer (or only partially) exist, e.g. when trying to send results back to the dead czar. I'm not at all sure if that explains all the worker failures, but likely at least some of them.

As for how much memory the czar was using, here's what the OOM killer saw before swinging the hatchet (rss and total_vm are in units of 4KiB pages):

[5068353.773850] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[5068353.774048] [11688]  1000 11688  4179102  2267789    7958  1696364             0 python

so ~8.7GiB. The machine has 16GiB of RAM, but at the time of czar death, mysql-proxy was using 4.3 GiB and mmfsd (GPFS daemon) 2.1GiB. So there's probably a czar-side memory leak, or maybe some really inefficient use of resources when sub-chunking. I'll have to look more next week. But I also think that killing the czar should not result in all workers processing queries from it segfaulting.


> On Aug 14, 2015, at 7:57 AM, Becla, Jacek <[log in to unmask]> wrote:
> 
> Serge
> 
> I tried running a small number of queries, it looks like pretty early on my script failed to connect to czar and the last message that it printed was that it was about to run
> 
> Running: select o1.ra as ra1, o2.ra as ra2, o1.decl as decl1, o2.decl as decl2, scisql_angSep(o1.ra, o1.decl,o2.ra, o2.decl) AS theDistance from Object o1, Object o2 where qserv_areaspec_box(90.299197, -66.468216, 98.762526, -56.412851) and scisql_angSep(o1.ra, o1.decl, o2.ra, o2.decl) < 0.015
> 
> 
> That query started ok by hand earlier, so this the syntax is fine as far as I can tell.
> 
> Looks like czar and most xrootd servers are down. It happened before earlier last night too.
> 
> I am leaving things as they are. If you have time, you might want to peek at it…
> 
> Feel free to restart services
> 
> Thanks,
> Jacek
> 
> Use REPLY-ALL to reply to list
> 
> To unsubscribe from the QSERV-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1 <https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1