LISTSERV mailing list manager LISTSERV 16.5

Help for QSERV-L Archives


QSERV-L Archives

QSERV-L Archives


QSERV-L@LISTSERV.SLAC.STANFORD.EDU


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

QSERV-L Home

QSERV-L Home

QSERV-L  September 2013

QSERV-L September 2013

Subject:

Re: 2 new fault tolerance tickets

From:

Jacek Becla <[log in to unmask]>

Reply-To:

General discussion for qserv (LSST prototype baseline catalog)

Date:

Tue, 24 Sep 2013 14:51:36 -0700

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (34 lines)

Bill

Sure! Please go ahead

Jacek


On 09/22/2013 12:32 PM, Bill Chickering wrote:
> Hi Jacek -
>
> I'd like to create two new tickets within TRAC. These are related to my fault tolerance work and build on top of ticket 2531, which implements status/error messages and basic fault tolerance (but not recovery) (note that 2531 has been submitted for review). The two additional tickets are for (development is complete for both):
>
> 1) Logging: This has been very helpful when debugging. Includes both the logging mechanism and integration throughout qserv master. For example, references to std::cout within the C++ layer of qserv are replaced with the logging stream. Implemented using boost filtering streams; very lightweight. Features includes:
>      -- Thread safe.
>      -- Severity levels allow the verbosity of logging output to be easily throttled (helpful when debugging).
>      -- Automatically includes timestamp, thread id, and severity level.
>      -- Integrated into error messaging (ticket 2531) so any/all error messages are automatically logged.
>      -- Buffers each line to minimize output from different threads "stepping on each other".
>      -- Placed in common directory so can be used by both master and worker.
>      -- Implemented Swig enabled interface so it's accessible from python layer.
>
> 2) Error recovery: Builds on both ticket 2531 and logging features. This is the basic chunkQuery-level error recovery described in the review:
>> Consider the event of a disk failure. Qserv's worker logic is not equipped to manage such a failure on localized regions of disk and would behave as if a software fault had occurred. The worker process would therefore crash and all chunk queries belonging to that worker would be lost. The in-flight queries on its local mysqld would be cleaned up and have resources freed. The Qserv master's requests to retrieve these chunk queries via XRootD would then return an error code. The master responds by re-initializing the chunk queries and re-submits them to XRootD. Ideally, duplicate data associated with the chunk queries exists on other nodes. In this case, XRootD silently re-routes the request(s) to the surviving node(s) and all associated queries are completed as usual. In the event that duplicate data does not exist for one or more chunk queries, XRootD would again return an error code. The master will re-initialize and re-submit a chunk query a fixed number of times (determined by a
 parameter within Qserv) before giving up, logging information about the failure, and returning an error message to the user in response to the associated query.
>
> -- Bill
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1

Top of Message | Previous Page | Permalink

Advanced Options


Options

Log In

Log In

Get Password

Get Password


Search Archives

Search Archives


Subscribe or Unsubscribe

Subscribe or Unsubscribe


Archives

March 2018
February 2018
January 2018
December 2017
August 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012

ATOM RSS1 RSS2



LISTSERV.SLAC.STANFORD.EDU

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager

Privacy Notice, Security Notice and Terms of Use