Print

Print


Srini,

> >So I have a catalog of 600 million stars. The data that I have is :
> >
> >ra, dec, muRa, muDec, …. some mag columns etc etc.
> >
> >I want to add a few columns,
> >
> >1. Add column for each row that tells me how many stars I have in the
> >neighborhood with distance x.
> >2. Add column for each row that tells me how many stars I have in the
> >neighborhood with distance x + 1.

This is essentially a spatial (self-)cross-match plus GROUP BY (or
subquery) and COUNT(*) aggregation.

SELECT A.id, COUNT(*) AS neighbors_x FROM Stars AS A, Stars AS B
WHERE A.id != B.id AND SPH_DIST(A.ra, A.dec, B.ra, B.dec) <= x
GROUP BY A.id

Executing this directly will typically be very inefficient, of course.

Adding RA and dec ranges to the WHERE clause (like B.ra BETWEEN A.ra - x
AND A.ra + x, but handling wraparound) could speed things up
significantly if you've B-tree indexed on RA and dec and your database
engine can handle the range indexing well.  I think MySQL does *not* do
this well, SQL Server does (hence its usage in SDSS), and I'm not sure
about PostGreSQL.  Computing zones (RA bands within dec stripes) or HTM
trixel indexes can also help optimize the query.  An R-tree index is
another possibility, but it may require that you transform to x,y,z
coordinates for best speed.

Qserv's chunk/subchunk scheme is supposed to make queries like this more
efficient.  You could try running on a single-node instance.

-- 
Kian-Tat Lim, LSST Data Management, [log in to unmask]

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1