Print

Print


On Feb 26, 2015, at 4:24 PM, Fabrice Jammes <[log in to unmask]> wrote:

> On 02/26/2015 04:14 PM, Serge Monkewitz wrote:
>> On Feb 26, 2015, at 12:39 PM, Daniel L. Wang <[log in to unmask]> wrote:
>> 
>>> I would like to note that the current system requires the extra raObject and declObject columns in ForcedSource, so that table's size will be proportionally larger than it would be in production.
>> This is not quite true. The position of associated director table rows must be present in the CSV input to the partitioner. However, recall that the partitioner and data duplicator have the ability to drop columns while partitioning. Even if the data loader doesn’t quite support it yet, we should be able to produce something pretty close to the baseline ForcedSource schema (i.e. without object position or any of the other non-baseline columns that were produced by forced source measurement for stripe82).
>> ########################################################################
>> Use REPLY-ALL to reply to list
>> 
>> To unsubscribe from the QSERV-L list, click the following link:
>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1
> Ok, but launching SQL query against the secondary index to retrieve a source chunk w.r.t its objectId would avoid to run the partitioner against Source data? This might be faster and can also easily be map-reduced. Don't you think so?

Partitioning as it currently stands boils down to evaluating a function that computes a (chunk ID, sub-chunk ID) pair per input row (in other words, map), followed by a sort on chunk ID, and then finished by breaking the output into files by chunk ID (reduce). As far as I can tell, what is being proposed here replaces an analytic function for computing chunk IDs with a database lookup (+ extra processing so that the database lookup isn’t per row and performance isn’t totally horrendous).

So I don’t follow how the proposal is substantially different from what currently happens. It adds some complexity, both in terms of code and because it introduces load order constraints (you cannot load a forced source before you’ve loaded, or at least seen, the corresponding object). If you have those, then as far as I can see, you might as well proceed by object batch. In other words, partition a batch of objects, remember the chunks and object IDs you saw, then switch to the various child tables, and never query some central db.

If we cannot do that for whatever reason, then I guess we ingest/process all director rows before looking at any child rows. Even in that case, there have been threads on this list discussing custom indexes (external to the db) that would both involve minimal seeking for searches and very good data compression (for the likely LSST object ID generation strategy). I guess we can put the index into a (no-)SQL database instead, but… does that actually buy us very much?

Finally, pre-sorting child tables according to director table PK isn’t necessarily a win. While that will lead to a small and localized read footprint on the objectID->chunk ID mapping, object IDs that are nearby in “ID-space” could be scattered all over the sky, leading to lots of small writes instead.

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1