> I think we're on the same page. The general case is truly expensive: > loading arbitrary child table rows requires lookups on the director > table. We can certainly batch this, but again, if child table rows > come in an arbitrary order, and don't have the one-to-many > (object<->forcedsource) relationship, it's really expensive. > and we'll just have a note in the usage guide for completely > different data domains that, hey, sort your input by the director's > key column or your ingest performance could be terrible. First, you can remove redundant queries by only querying once for each objectId. This can be done either by caching the results or by pre-sorting the child table rows by objectId or by just pre-scanning the child table rows and accumulating a list of distinct objectIds. (The partitioner should be allowed to do whatever is necessary to the input data, including reading it multiple times; if we're streaming the data in, it could be allowed to store a copy on disk.) Second, you can reduce the total number of queries by batching multiple objectIds into the same query -- "SELECT ra, decl FROM secondaryIndex WHERE objectId IN (..., ..., ...)" or its equivalent if you're using a non-SQL index. As you say, for the LSST use cases with up to 1000 child table entries per director table entry, this can reduce the number of queries by several orders of magnitude. But even for arbitrary inputs I think you can still reduce the queries by one or two orders of magnitude. -- Kian-Tat Lim, LSST Data Management, [log in to unmask] ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the QSERV-L list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1