On 02/26/2015 03:46 PM, Fabrice Jammes wrote: > Daniel, > > I have a question between the lines below. > > On 02/26/2015 03:32 PM, Daniel L. Wang wrote: >> On 02/26/2015 01:45 PM, Kian-Tat Lim wrote: >>>> The general case is very expensive (lookup position and chunk for each >>>> position!?), and we are only going to get away with it because our >>>> bulk-loads for ForcedSource will be spatially-restricted. >>> I'm pretty sure that ForcedSource and (final) Object tables will >>> be available at the same time, so the partitioner could look up the >>> coordinates based on objectId using a simple merge. I thought you were >>> already going to have a central objectId-to-chunk (or even subchunk?) >>> index anyway; scanning an input ForcedSource table for all its >>> objectIds >>> and then doing a single query (or at least batching objectIds to reduce >>> the number of queries by an order of magnitude) to get the mappings >>> doesn't sound ridiculous. >> I think we're on the same page. The general case is truly expensive: >> loading arbitrary child table rows requires lookups on the director >> table. We can certainly batch this, but again, if child table rows >> come in an arbitrary order, and don't have the one-to-many >> (object<->forcedsource) relationship, it's really expensive. >> >> But yes, batching should be really effective because of the shape of >> our data, and the patterns in which we produce it. I don't think a >> coordinated multi-table partitioning action is scalable: it means >> that the director and all its child table rows need to be available >> *at the same time*. (Oh, want to add another child table? Oh, I guess >> I need to repartition the director and the other 4 child tables and >> reload them. Uh, no.) >> >> My point is not that we don't know how to scale it, but that the >> processing model is different from what we do now. The current >> partitioner and loader can load director and child tables one at a >> time, without checking existing data, sharing only partitioning >> parameters (stripes/substripes). The catch is that it requires the >> child table to be pre-joined (effectively) to the director table in >> order to have partitioning coordinates. We know this is insufficient. > Using the secondary index (built using Object table) to compute the > chunk of a given Source w.r.t its objectId column would avoid this > join, isn't it? > For example, if a source i has objectId field equal to j, then we can > query the secondary index on objectId=j to get the chunk of the > source, this should work. > Of course we have to build the secondary index prior to this operation. This is effectively a join, no? I'm not suggesting sending a SQL join query into the czar's normal pipeline. But looking up chunkId with the secondary index is an index-only join. I think we might still want to create a smaller lookup table for each batch of child table rows, depending on how fast we can make the full index lookups: Is it faster to do 10 million full-index lookups (on disk), or 100k full-index lookups, create a 100k hash table, and 10 million lookups on the in-memory table? I don't know yet. -Daniel ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the QSERV-L list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1