LISTSERV 16.5 - QSERV-L Archives

Subscriber's Corner

Email Lists

QSERV-L Archives

QSERV-L@LISTSERV.SLAC.STANFORD.EDU

View:

Message:

[

First

Last

]

By Topic:

[

First

Last

]

By Author:

[

First

Last

]

Font:

Proportional Font

		LISTSERV Archives
		QSERV-L Home
		QSERV-L August 2015

Subject:

Re: cluster

From:

Fabrice Jammes <[log in to unmask]>

Reply-To:

General discussion for qserv (LSST prototype baseline catalog)

Date:

Mon, 17 Aug 2015 07:58:55 +0200

Content-Type:

text/plain

Parts/Attachments:

text/plain (109 lines)

Hi Serge,

Thanks for this clear report. It is not clear why mmfsd use 2GB of RAM, 
as GPFS should only be used during Qserv installation, and not during 
Qserv execution. Does the czar try to access some resources stored in 
/sps because of some configuration issue?

Please let me know if I can help.

Fabrice

On 08/15/2015 03:18 AM, Andrew Hanushevsky wrote:
> Hi Serge,
>
> The czar dying definitely (either being killed or looses its TCP 
> connection) causes any worker than has an in-progress query to die as 
> well. This is a known problem and is on the schedule to fix. This is a 
> qserv issue as the worker session object does not know how to stop an 
> in-progress query nor does it keep track of the fact that it should 
> deep-six the query after it finishes.
>
> Andy
>
> On Fri, 14 Aug 2015, Serge Monkewitz wrote:
>
>> I stared at the aftermath for a while. One problem is that the OOM 
>> killer blew away the czar. If I'm interpreting the conversation in 
>> DM-3432 correctly, that then causes xrootd level cancellations to 
>> eventually propagate to workers. And I guess this causes 
>> deletion/cleanup of xrootd level objects in a way that is either not 
>> communicated properly to the higher level worker code, or not 
>> responded to properly by that code. The workers then segfault when 
>> they try to use objects that no longer (or only partially) exist, 
>> e.g. when trying to send results back to the dead czar. I'm not at 
>> all sure if that explains all the worker failures, but likely at 
>> least some of them.
>>
>> As for how much memory the czar was using, here's what the OOM killer 
>> saw before swinging the hatchet (rss and total_vm are in units of 
>> 4KiB pages):
>>
>> [5068353.773850] [ pid ]   uid  tgid total_vm      rss nr_ptes 
>> swapents oom_score_adj name
>> [5068353.774048] [11688]  1000 11688  4179102  2267789    7958 
>> 1696364             0 python
>>
>> so ~8.7GiB. The machine has 16GiB of RAM, but at the time of czar 
>> death, mysql-proxy was using 4.3 GiB and mmfsd (GPFS daemon) 2.1GiB. 
>> So there's probably a czar-side memory leak, or maybe some really 
>> inefficient use of resources when sub-chunking. I'll have to look 
>> more next week. But I also think that killing the czar should not 
>> result in all workers processing queries from it segfaulting.
>>
>>
>>> On Aug 14, 2015, at 7:57 AM, Becla, Jacek <[log in to unmask]> 
>>> wrote:
>>>
>>> Serge
>>>
>>> I tried running a small number of queries, it looks like pretty 
>>> early on my script failed to connect to czar and the last message 
>>> that it printed was that it was about to run
>>>
>>> Running: select o1.ra as ra1, o2.ra as ra2, o1.decl as decl1, 
>>> o2.decl as decl2, scisql_angSep(o1.ra, o1.decl,o2.ra, o2.decl) AS 
>>> theDistance from Object o1, Object o2 where 
>>> qserv_areaspec_box(90.299197, -66.468216, 98.762526, -56.412851) and 
>>> scisql_angSep(o1.ra, o1.decl, o2.ra, o2.decl) < 0.015
>>>
>>>
>>> That query started ok by hand earlier, so this the syntax is fine as 
>>> far as I can tell.
>>>
>>> Looks like czar and most xrootd servers are down. It happened before 
>>> earlier last night too.
>>>
>>> I am leaving things as they are. If you have time, you might want to 
>>> peek at itÿÿ
>>>
>>> Feel free to restart services
>>>
>>> Thanks,
>>> Jacek
>>>
>>> Use REPLY-ALL to reply to list
>>>
>>> To unsubscribe from the QSERV-L list, click the following link:
>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1 
>>> <https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1>
>>
>> ########################################################################
>> Use REPLY-ALL to reply to list
>>
>> To unsubscribe from the QSERV-L list, click the following link:
>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1
>>
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the QSERV-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1

Top of Message | Previous Page | Permalink

Search Archives

Advanced Options

Options

		Log In
		Get Password

		Search Archives

		Subscribe or Unsubscribe