Print

Print


Daniel,

I have a question between the lines below.

On 02/26/2015 03:32 PM, Daniel L. Wang wrote:
> On 02/26/2015 01:45 PM, Kian-Tat Lim wrote:
>>> The general case is very expensive (lookup position and chunk for each
>>> position!?), and we are only going to get away with it because our
>>> bulk-loads for ForcedSource will be spatially-restricted.
>>     I'm pretty sure that ForcedSource and (final) Object tables will
>> be available at the same time, so the partitioner could look up the
>> coordinates based on objectId using a simple merge.  I thought you were
>> already going to have a central objectId-to-chunk (or even subchunk?)
>> index anyway; scanning an input ForcedSource table for all its objectIds
>> and then doing a single query (or at least batching objectIds to reduce
>> the number of queries by an order of magnitude) to get the mappings
>> doesn't sound ridiculous.
> I think we're on the same page. The general case is truly expensive: 
> loading arbitrary child table rows requires lookups on the director 
> table. We can certainly batch this, but again, if child table rows 
> come in an arbitrary order, and don't have the one-to-many 
> (object<->forcedsource) relationship, it's really expensive.
>
> But yes, batching should be really effective because of the shape of 
> our data, and the patterns in which we produce it. I don't think a 
> coordinated multi-table partitioning action is scalable: it means that 
> the director and all its child table rows need to be available *at the 
> same time*. (Oh, want to add another child table? Oh, I guess I need 
> to repartition the director and the other 4 child tables and reload 
> them. Uh, no.)
>
> My point is not that we don't know how to scale it, but that the 
> processing model is different from what we do now. The current 
> partitioner and loader can load director and child tables one at a 
> time, without checking existing data, sharing only partitioning 
> parameters (stripes/substripes). The catch is that it requires the 
> child table to be pre-joined (effectively) to the director table in 
> order to have partitioning coordinates. We know this is insufficient.
Using the secondary index (built using Object table) to compute the 
chunk of a given Source w.r.t its objectId column would avoid this join, 
isn't it?
For example, if a source i has objectId field equal to j, then we can 
query the secondary index on objectId=j to get the chunk of the source, 
this should work.
Of course we have to build the secondary index prior to this operation.

Cheers
>
> The general case (optimized with batching) needs to be handled anyway, 
> because L3 usage will want to create partitioned tables for joining 
> with L2 data.  LSST use cases should be optimized nicely, and we'll 
> just have a note in the usage guide for completely different data 
> domains that, hey, sort your input by the director's key column or 
> your ingest performance could be terrible.
>
> -Daniel
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the QSERV-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1