Daniel, I have a question between the lines below. On 02/26/2015 03:32 PM, Daniel L. Wang wrote: > On 02/26/2015 01:45 PM, Kian-Tat Lim wrote: >>> The general case is very expensive (lookup position and chunk for each >>> position!?), and we are only going to get away with it because our >>> bulk-loads for ForcedSource will be spatially-restricted. >> I'm pretty sure that ForcedSource and (final) Object tables will >> be available at the same time, so the partitioner could look up the >> coordinates based on objectId using a simple merge. I thought you were >> already going to have a central objectId-to-chunk (or even subchunk?) >> index anyway; scanning an input ForcedSource table for all its objectIds >> and then doing a single query (or at least batching objectIds to reduce >> the number of queries by an order of magnitude) to get the mappings >> doesn't sound ridiculous. > I think we're on the same page. The general case is truly expensive: > loading arbitrary child table rows requires lookups on the director > table. We can certainly batch this, but again, if child table rows > come in an arbitrary order, and don't have the one-to-many > (object<->forcedsource) relationship, it's really expensive. > > But yes, batching should be really effective because of the shape of > our data, and the patterns in which we produce it. I don't think a > coordinated multi-table partitioning action is scalable: it means that > the director and all its child table rows need to be available *at the > same time*. (Oh, want to add another child table? Oh, I guess I need > to repartition the director and the other 4 child tables and reload > them. Uh, no.) > > My point is not that we don't know how to scale it, but that the > processing model is different from what we do now. The current > partitioner and loader can load director and child tables one at a > time, without checking existing data, sharing only partitioning > parameters (stripes/substripes). The catch is that it requires the > child table to be pre-joined (effectively) to the director table in > order to have partitioning coordinates. We know this is insufficient. Using the secondary index (built using Object table) to compute the chunk of a given Source w.r.t its objectId column would avoid this join, isn't it? For example, if a source i has objectId field equal to j, then we can query the secondary index on objectId=j to get the chunk of the source, this should work. Of course we have to build the secondary index prior to this operation. Cheers > > The general case (optimized with batching) needs to be handled anyway, > because L3 usage will want to create partitioned tables for joining > with L2 data. LSST use cases should be optimized nicely, and we'll > just have a note in the usage guide for completely different data > domains that, hey, sort your input by the director's key column or > your ingest performance could be terrible. > > -Daniel > > ######################################################################## > Use REPLY-ALL to reply to list > > To unsubscribe from the QSERV-L list, click the following link: > https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1 ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the QSERV-L list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1