LISTSERV mailing list manager LISTSERV 16.5

Help for QSERV-L Archives


QSERV-L Archives

QSERV-L Archives


QSERV-L@LISTSERV.SLAC.STANFORD.EDU


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

QSERV-L Home

QSERV-L Home

QSERV-L  February 2015

QSERV-L February 2015

Subject:

Re: data set for large scale test

From:

Fabrice Jammes <[log in to unmask]>

Reply-To:

General discussion for qserv (LSST prototype baseline catalog)

Date:

Fri, 27 Feb 2015 10:55:03 -0800

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (57 lines)

Hi Serge,

Please see my answer below.

On 02/26/2015 05:35 PM, Serge Monkewitz wrote:
> On Feb 26, 2015, at 4:24 PM, Fabrice Jammes <[log in to unmask]> wrote:
>
>> On 02/26/2015 04:14 PM, Serge Monkewitz wrote:
>>> On Feb 26, 2015, at 12:39 PM, Daniel L. Wang <[log in to unmask]> wrote:
>>>
>>>> I would like to note that the current system requires the extra raObject and declObject columns in ForcedSource, so that table's size will be proportionally larger than it would be in production.
>>> This is not quite true. The position of associated director table rows must be present in the CSV input to the partitioner. However, recall that the partitioner and data duplicator have the ability to drop columns while partitioning. Even if the data loader doesn’t quite support it yet, we should be able to produce something pretty close to the baseline ForcedSource schema (i.e. without object position or any of the other non-baseline columns that were produced by forced source measurement for stripe82).
>>> ########################################################################
>>> Use REPLY-ALL to reply to list
>>>
>>> To unsubscribe from the QSERV-L list, click the following link:
>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1
>> Ok, but launching SQL query against the secondary index to retrieve a source chunk w.r.t its objectId would avoid to run the partitioner against Source data? This might be faster and can also easily be map-reduced. Don't you think so?
> Partitioning as it currently stands boils down to evaluating a function that computes a (chunk ID, sub-chunk ID) pair per input row (in other words, map), followed by a sort on chunk ID, and then finished by breaking the output into files by chunk ID (reduce). As far as I can tell, what is being proposed here replaces an analytic function for computing chunk IDs with a database lookup (+ extra processing so that the database lookup isn’t per row and performance isn’t totally horrendous).
>
> So I don’t follow how the proposal is substantially different from what currently happens. It adds some complexity, both in terms of code and because it introduces load order constraints (you cannot load a forced source before you’ve loaded, or at least seen, the corresponding object). If you have those, then as far as I can see, you might as well proceed by object batch. In other words, partition a batch of objects, remember the chunks and object IDs you saw, then switch to the various child tables, and never query some central db.
>
> If we cannot do that for whatever reason, then I guess we ingest/process all director rows before looking at any child rows. Even in that case, there have been threads on this list discussing custom indexes (external to the db) that would both involve minimal seeking for searches and very good data compression (for the likely LSST object ID generation strategy). I guess we can put the index into a (no-)SQL database instead, but… does that actually buy us very much?
>
> Finally, pre-sorting child tables according to director table PK isn’t necessarily a win. While that will lead to a small and localized read footprint on the objectID->chunk ID mapping, object IDs that are nearby in “ID-space” could be scattered all over the sky, leading to lots of small writes instead.
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the QSERV-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1
Thanks for this interesting tracks,I fully understand your point. There 
is still a lot of things to discover an explore here and optimizing this 
hard, but interesting, problem will require experiencing with big data.
I was told there will be 40 billions objects in the director table of 
DR1. This document, which may be outdated: 
https://dev.lsstcorp.org/trac/wiki/db/tests/SchemaEvolution, says that 
DR1 will contains 0.6 million objects. So, in MySQL for example, the 
secondary index size will be around (if we concat chunkId and subChunkId 
in the same int):

0.6 million objects * (sizeof(BIGINT) + sizeof(INT)) = 0.6 * 10E6 * 
(8+4) = 7.2 MB

This size seems reasonable. Furthermore, i fully agree with you, the 
index lookup could still be optimized a lot.

Cheers,

Fabrice

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1

Top of Message | Previous Page | Permalink

Advanced Options


Options

Log In

Log In

Get Password

Get Password


Search Archives

Search Archives


Subscribe or Unsubscribe

Subscribe or Unsubscribe


Archives

March 2018
February 2018
January 2018
December 2017
August 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012

ATOM RSS1 RSS2



LISTSERV.SLAC.STANFORD.EDU

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager

Privacy Notice, Security Notice and Terms of Use