Print

Print


> I think we're on the same page. The general case is truly expensive:
> loading arbitrary child table rows requires lookups on the director
> table. We can certainly batch this, but again, if child table rows
> come in an arbitrary order, and don't have the one-to-many
> (object<->forcedsource) relationship, it's really expensive.

> and we'll just have a note in the usage guide for completely
> different data domains that, hey, sort your input by the director's
> key column or your ingest performance could be terrible.

	First, you can remove redundant queries by only querying once
for each objectId.  This can be done either by caching the results or by
pre-sorting the child table rows by objectId or by just pre-scanning the
child table rows and accumulating a list of distinct objectIds.  (The
partitioner should be allowed to do whatever is necessary to the input
data, including reading it multiple times; if we're streaming the data
in, it could be allowed to store a copy on disk.)

	Second, you can reduce the total number of queries by batching
multiple objectIds into the same query -- "SELECT ra, decl FROM
secondaryIndex WHERE objectId IN (..., ..., ...)" or its equivalent if
you're using a non-SQL index.

	As you say, for the LSST use cases with up to 1000 child table
entries per director table entry, this can reduce the number of queries
by several orders of magnitude.  But even for arbitrary inputs I think
you can still reduce the queries by one or two orders of magnitude.

-- 
Kian-Tat Lim, LSST Data Management, [log in to unmask]

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1