Daniel,
I have a question between the lines below.
On 02/26/2015 03:32 PM, Daniel L. Wang wrote:
> On 02/26/2015 01:45 PM, Kian-Tat Lim wrote:
>>> The general case is very expensive (lookup position and chunk for each
>>> position!?), and we are only going to get away with it because our
>>> bulk-loads for ForcedSource will be spatially-restricted.
>> I'm pretty sure that ForcedSource and (final) Object tables will
>> be available at the same time, so the partitioner could look up the
>> coordinates based on objectId using a simple merge. I thought you were
>> already going to have a central objectId-to-chunk (or even subchunk?)
>> index anyway; scanning an input ForcedSource table for all its objectIds
>> and then doing a single query (or at least batching objectIds to reduce
>> the number of queries by an order of magnitude) to get the mappings
>> doesn't sound ridiculous.
> I think we're on the same page. The general case is truly expensive:
> loading arbitrary child table rows requires lookups on the director
> table. We can certainly batch this, but again, if child table rows
> come in an arbitrary order, and don't have the one-to-many
> (object<->forcedsource) relationship, it's really expensive.
>
> But yes, batching should be really effective because of the shape of
> our data, and the patterns in which we produce it. I don't think a
> coordinated multi-table partitioning action is scalable: it means that
> the director and all its child table rows need to be available *at the
> same time*. (Oh, want to add another child table? Oh, I guess I need
> to repartition the director and the other 4 child tables and reload
> them. Uh, no.)
>
> My point is not that we don't know how to scale it, but that the
> processing model is different from what we do now. The current
> partitioner and loader can load director and child tables one at a
> time, without checking existing data, sharing only partitioning
> parameters (stripes/substripes). The catch is that it requires the
> child table to be pre-joined (effectively) to the director table in
> order to have partitioning coordinates. We know this is insufficient.
Using the secondary index (built using Object table) to compute the
chunk of a given Source w.r.t its objectId column would avoid this join,
isn't it?
For example, if a source i has objectId field equal to j, then we can
query the secondary index on objectId=j to get the chunk of the source,
this should work.
Of course we have to build the secondary index prior to this operation.
Cheers
>
> The general case (optimized with batching) needs to be handled anyway,
> because L3 usage will want to create partitioned tables for joining
> with L2 data. LSST use cases should be optimized nicely, and we'll
> just have a note in the usage guide for completely different data
> domains that, hey, sort your input by the director's key column or
> your ingest performance could be terrible.
>
> -Daniel
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the QSERV-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1
########################################################################
Use REPLY-ALL to reply to list
To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1
|