LISTSERV 16.5 - QSERV-L Archives

Subscriber's Corner

Email Lists

QSERV-L Archives

QSERV-L@LISTSERV.SLAC.STANFORD.EDU

View:

Message:

[

First

Last

]

By Topic:

[

First

Last

]

By Author:

[

First

Last

]

Font:

Proportional Font

		LISTSERV Archives
		QSERV-L Home
		QSERV-L February 2015

Subject:

Re: data set for large scale test

From:

Fabrice Jammes <[log in to unmask]>

Reply-To:

General discussion for qserv (LSST prototype baseline catalog)

Date:

Fri, 27 Feb 2015 10:55:03 -0800

Content-Type:

text/plain

Parts/Attachments:

text/plain (57 lines)

Hi Serge,

Please see my answer below.

On 02/26/2015 05:35 PM, Serge Monkewitz wrote:
> On Feb 26, 2015, at 4:24 PM, Fabrice Jammes <[log in to unmask]> wrote:
>
>> On 02/26/2015 04:14 PM, Serge Monkewitz wrote:
>>> On Feb 26, 2015, at 12:39 PM, Daniel L. Wang <[log in to unmask]> wrote:
>>>
>>>> I would like to note that the current system requires the extra raObject and declObject columns in ForcedSource, so that table's size will be proportionally larger than it would be in production.
>>> This is not quite true. The position of associated director table rows must be present in the CSV input to the partitioner. However, recall that the partitioner and data duplicator have the ability to drop columns while partitioning. Even if the data loader doesn’t quite support it yet, we should be able to produce something pretty close to the baseline ForcedSource schema (i.e. without object position or any of the other non-baseline columns that were produced by forced source measurement for stripe82).
>>> ########################################################################
>>> Use REPLY-ALL to reply to list
>>>
>>> To unsubscribe from the QSERV-L list, click the following link:
>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1
>> Ok, but launching SQL query against the secondary index to retrieve a source chunk w.r.t its objectId would avoid to run the partitioner against Source data? This might be faster and can also easily be map-reduced. Don't you think so?
> Partitioning as it currently stands boils down to evaluating a function that computes a (chunk ID, sub-chunk ID) pair per input row (in other words, map), followed by a sort on chunk ID, and then finished by breaking the output into files by chunk ID (reduce). As far as I can tell, what is being proposed here replaces an analytic function for computing chunk IDs with a database lookup (+ extra processing so that the database lookup isn’t per row and performance isn’t totally horrendous).
>
> So I don’t follow how the proposal is substantially different from what currently happens. It adds some complexity, both in terms of code and because it introduces load order constraints (you cannot load a forced source before you’ve loaded, or at least seen, the corresponding object). If you have those, then as far as I can see, you might as well proceed by object batch. In other words, partition a batch of objects, remember the chunks and object IDs you saw, then switch to the various child tables, and never query some central db.
>
> If we cannot do that for whatever reason, then I guess we ingest/process all director rows before looking at any child rows. Even in that case, there have been threads on this list discussing custom indexes (external to the db) that would both involve minimal seeking for searches and very good data compression (for the likely LSST object ID generation strategy). I guess we can put the index into a (no-)SQL database instead, but… does that actually buy us very much?
>
> Finally, pre-sorting child tables according to director table PK isn’t necessarily a win. While that will lead to a small and localized read footprint on the objectID->chunk ID mapping, object IDs that are nearby in “ID-space” could be scattered all over the sky, leading to lots of small writes instead.
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the QSERV-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1
Thanks for this interesting tracks,I fully understand your point. There 
is still a lot of things to discover an explore here and optimizing this 
hard, but interesting, problem will require experiencing with big data.
I was told there will be 40 billions objects in the director table of 
DR1. This document, which may be outdated: 
https://dev.lsstcorp.org/trac/wiki/db/tests/SchemaEvolution, says that 
DR1 will contains 0.6 million objects. So, in MySQL for example, the 
secondary index size will be around (if we concat chunkId and subChunkId 
in the same int):

0.6 million objects * (sizeof(BIGINT) + sizeof(INT)) = 0.6 * 10E6 * 
(8+4) = 7.2 MB

This size seems reasonable. Furthermore, i fully agree with you, the 
index lookup could still be optimized a lot.

Cheers,

Fabrice

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1

Top of Message | Previous Page | Permalink

Search Archives

Advanced Options

Options

		Log In
		Get Password

		Search Archives

		Subscribe or Unsubscribe