Print

Print


On 02/26/2015 03:46 PM, Fabrice Jammes wrote:
> Daniel,
>
> I have a question between the lines below.
>
> On 02/26/2015 03:32 PM, Daniel L. Wang wrote:
>> On 02/26/2015 01:45 PM, Kian-Tat Lim wrote:
>>>> The general case is very expensive (lookup position and chunk for each
>>>> position!?), and we are only going to get away with it because our
>>>> bulk-loads for ForcedSource will be spatially-restricted.
>>>     I'm pretty sure that ForcedSource and (final) Object tables will
>>> be available at the same time, so the partitioner could look up the
>>> coordinates based on objectId using a simple merge.  I thought you were
>>> already going to have a central objectId-to-chunk (or even subchunk?)
>>> index anyway; scanning an input ForcedSource table for all its 
>>> objectIds
>>> and then doing a single query (or at least batching objectIds to reduce
>>> the number of queries by an order of magnitude) to get the mappings
>>> doesn't sound ridiculous.
>> I think we're on the same page. The general case is truly expensive: 
>> loading arbitrary child table rows requires lookups on the director 
>> table. We can certainly batch this, but again, if child table rows 
>> come in an arbitrary order, and don't have the one-to-many 
>> (object<->forcedsource) relationship, it's really expensive.
>>
>> But yes, batching should be really effective because of the shape of 
>> our data, and the patterns in which we produce it. I don't think a 
>> coordinated multi-table partitioning action is scalable: it means 
>> that the director and all its child table rows need to be available 
>> *at the same time*. (Oh, want to add another child table? Oh, I guess 
>> I need to repartition the director and the other 4 child tables and 
>> reload them. Uh, no.)
>>
>> My point is not that we don't know how to scale it, but that the 
>> processing model is different from what we do now. The current 
>> partitioner and loader can load director and child tables one at a 
>> time, without checking existing data, sharing only partitioning 
>> parameters (stripes/substripes). The catch is that it requires the 
>> child table to be pre-joined (effectively) to the director table in 
>> order to have partitioning coordinates. We know this is insufficient.
> Using the secondary index (built using Object table) to compute the 
> chunk of a given Source w.r.t its objectId column would avoid this 
> join, isn't it?
> For example, if a source i has objectId field equal to j, then we can 
> query the secondary index on objectId=j to get the chunk of the 
> source, this should work.
> Of course we have to build the secondary index prior to this operation.
This is effectively a join, no? I'm not suggesting sending a SQL join 
query into the czar's normal pipeline. But looking up chunkId with the 
secondary index is an index-only join. I think we might still want to 
create a smaller lookup table for each batch of child table rows, 
depending on how fast we can make the full index lookups: Is it faster 
to do 10 million full-index lookups (on disk), or 100k full-index 
lookups, create a 100k hash table, and 10 million lookups on the 
in-memory table? I don't know yet.

  -Daniel

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1