Print

Print


On 02/26/2015 01:45 PM, Kian-Tat Lim wrote:
>> The general case is very expensive (lookup position and chunk for each
>> position!?), and we are only going to get away with it because our
>> bulk-loads for ForcedSource will be spatially-restricted.
> 	I'm pretty sure that ForcedSource and (final) Object tables will
> be available at the same time, so the partitioner could look up the
> coordinates based on objectId using a simple merge.  I thought you were
> already going to have a central objectId-to-chunk (or even subchunk?)
> index anyway; scanning an input ForcedSource table for all its objectIds
> and then doing a single query (or at least batching objectIds to reduce
> the number of queries by an order of magnitude) to get the mappings
> doesn't sound ridiculous.
I think we're on the same page. The general case is truly expensive: 
loading arbitrary child table rows requires lookups on the director 
table. We can certainly batch this, but again, if child table rows come 
in an arbitrary order, and don't have the one-to-many 
(object<->forcedsource) relationship, it's really expensive.

But yes, batching should be really effective because of the shape of our 
data, and the patterns in which we produce it. I don't think a 
coordinated multi-table partitioning action is scalable: it means that 
the director and all its child table rows need to be available *at the 
same time*. (Oh, want to add another child table? Oh, I guess I need to 
repartition the director and the other 4 child tables and reload them. 
Uh, no.)

My point is not that we don't know how to scale it, but that the 
processing model is different from what we do now. The current 
partitioner and loader can load director and child tables one at a time, 
without checking existing data, sharing only partitioning parameters 
(stripes/substripes). The catch is that it requires the child table to 
be pre-joined (effectively) to the director table in order to have 
partitioning coordinates. We know this is insufficient.

The general case (optimized with batching) needs to be handled anyway, 
because L3 usage will want to create partitioned tables for joining with 
L2 data.  LSST use cases should be optimized nicely, and we'll just have 
a note in the usage guide for completely different data domains that, 
hey, sort your input by the director's key column or your ingest 
performance could be terrible.

-Daniel

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1