LISTSERV mailing list manager LISTSERV 16.5

Help for QSERV-L Archives


QSERV-L Archives

QSERV-L Archives


QSERV-L@LISTSERV.SLAC.STANFORD.EDU


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

QSERV-L Home

QSERV-L Home

QSERV-L  August 2015

QSERV-L August 2015

Subject:

Re: cluster

From:

Fabrice Jammes <[log in to unmask]>

Reply-To:

General discussion for qserv (LSST prototype baseline catalog)

Date:

Mon, 17 Aug 2015 07:58:55 +0200

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (109 lines)

Hi Serge,

Thanks for this clear report. It is not clear why mmfsd use 2GB of RAM, 
as GPFS should only be used during Qserv installation, and not during 
Qserv execution. Does the czar try to access some resources stored in 
/sps because of some configuration issue?

Please let me know if I can help.

Fabrice

On 08/15/2015 03:18 AM, Andrew Hanushevsky wrote:
> Hi Serge,
>
> The czar dying definitely (either being killed or looses its TCP 
> connection) causes any worker than has an in-progress query to die as 
> well. This is a known problem and is on the schedule to fix. This is a 
> qserv issue as the worker session object does not know how to stop an 
> in-progress query nor does it keep track of the fact that it should 
> deep-six the query after it finishes.
>
> Andy
>
> On Fri, 14 Aug 2015, Serge Monkewitz wrote:
>
>> I stared at the aftermath for a while. One problem is that the OOM 
>> killer blew away the czar. If I'm interpreting the conversation in 
>> DM-3432 correctly, that then causes xrootd level cancellations to 
>> eventually propagate to workers. And I guess this causes 
>> deletion/cleanup of xrootd level objects in a way that is either not 
>> communicated properly to the higher level worker code, or not 
>> responded to properly by that code. The workers then segfault when 
>> they try to use objects that no longer (or only partially) exist, 
>> e.g. when trying to send results back to the dead czar. I'm not at 
>> all sure if that explains all the worker failures, but likely at 
>> least some of them.
>>
>> As for how much memory the czar was using, here's what the OOM killer 
>> saw before swinging the hatchet (rss and total_vm are in units of 
>> 4KiB pages):
>>
>> [5068353.773850] [ pid ]   uid  tgid total_vm      rss nr_ptes 
>> swapents oom_score_adj name
>> [5068353.774048] [11688]  1000 11688  4179102  2267789    7958 
>> 1696364             0 python
>>
>> so ~8.7GiB. The machine has 16GiB of RAM, but at the time of czar 
>> death, mysql-proxy was using 4.3 GiB and mmfsd (GPFS daemon) 2.1GiB. 
>> So there's probably a czar-side memory leak, or maybe some really 
>> inefficient use of resources when sub-chunking. I'll have to look 
>> more next week. But I also think that killing the czar should not 
>> result in all workers processing queries from it segfaulting.
>>
>>
>>> On Aug 14, 2015, at 7:57 AM, Becla, Jacek <[log in to unmask]> 
>>> wrote:
>>>
>>> Serge
>>>
>>> I tried running a small number of queries, it looks like pretty 
>>> early on my script failed to connect to czar and the last message 
>>> that it printed was that it was about to run
>>>
>>> Running: select o1.ra as ra1, o2.ra as ra2, o1.decl as decl1, 
>>> o2.decl as decl2, scisql_angSep(o1.ra, o1.decl,o2.ra, o2.decl) AS 
>>> theDistance from Object o1, Object o2 where 
>>> qserv_areaspec_box(90.299197, -66.468216, 98.762526, -56.412851) and 
>>> scisql_angSep(o1.ra, o1.decl, o2.ra, o2.decl) < 0.015
>>>
>>>
>>> That query started ok by hand earlier, so this the syntax is fine as 
>>> far as I can tell.
>>>
>>> Looks like czar and most xrootd servers are down. It happened before 
>>> earlier last night too.
>>>
>>> I am leaving things as they are. If you have time, you might want to 
>>> peek at it˙˙
>>>
>>> Feel free to restart services
>>>
>>> Thanks,
>>> Jacek
>>>
>>> Use REPLY-ALL to reply to list
>>>
>>> To unsubscribe from the QSERV-L list, click the following link:
>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1 
>>> <https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1>
>>
>> ########################################################################
>> Use REPLY-ALL to reply to list
>>
>> To unsubscribe from the QSERV-L list, click the following link:
>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1
>>
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the QSERV-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1

Top of Message | Previous Page | Permalink

Advanced Options


Options

Log In

Log In

Get Password

Get Password


Search Archives

Search Archives


Subscribe or Unsubscribe

Subscribe or Unsubscribe


Archives

March 2018
February 2018
January 2018
December 2017
August 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012

ATOM RSS1 RSS2



LISTSERV.SLAC.STANFORD.EDU

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager

Privacy Notice, Security Notice and Terms of Use