Print

Print


Dear Serge,

Thanks for your answer, it's both interesting and usefull.

i've just added

- Dominique and Osman from CC-IN2P3,
- Philippe Gris and Bogdan Vulpescu, physicists from LPC-IN2P3

to the discussion.

My answers are below :

On 01/17/2014 02:54 AM, Serge Monkewitz wrote:
> Hi Fabrice,
>
>      OK, I read through that page. I just want to point out that you will have to be careful to partition based on the deep source position, not the deep forced source positions. This is because duplicate forced sources are not guaranteed to have identical (ra, decl) coordinates. They will however be associated with the same deep source (have identical deepSourceId values). Thus, to run the deduplication procedure from the trac page on chunks, you'll need to ensure that duplicates always end up in the same chunk.
>
> I’m not sure whether the deep forced source data includes the ra,dec of the deep source it was derived from.
Table RunDeepForcedSource doesn't seem to contains this information.
In this table, the only fields which could contains these spatial 
informations are :

         cluster_id BIGINT NULL,
         cluster_coord_ra DOUBLE NULL,
         cluster_coord_decl DOUBLE NULL,


Bogdan, Dominique, or Philippe, could you please confirm that the 
RunDeepForcedSource table doesn't contains the ra,dec of DeepSource 
entities it references ?

>
> If it does not, things will get “interesting”. It should be the case that duplicates have positions that are extremely close to one another. So the first thing to do would be to go ahead and partition with the deep forced source position anyway. To check that no duplicates where split across chunks, you’ll want to set up the partitioner such that each chunk contains exactly one sub-chunk, and such that the overlap radius is non-zero but small (let’s say an arc-minute). This way, the partitioner will split input into chunks, and, for each chunk, provide nearby rows (the overlap). If the two deep forced sources in a duplicate pair are assigned to different chunks, then one will be in the overlap of the chunk for the other, and vice-versa.
ok, very good idea, I fully agree with it, and will rely on it.
One question : could duplicate RunDeepForcedSource entities sets have a 
cardinality greater than 2 ?
>
> So, load all chunk and chunk overlap tables, and check for the existence of split duplicate pairs by testing whether equi-joining a chunk and its overlap on deep source ID yields any rows.
ok, I understand.
>
> Hopefully you will not encounter any cases where this actually happens, in which case you can just drop all the overlap tables. But if you are unlucky, you’ll need to deal with the annoyance of picking a chunk for each split duplicate pair (I would just assign the pair to the chunk with the smaller ID), and adding/removing rows from chunks to reflect your decisions.
Maybe we could tell scientists that for entities placed near the limit 
of a chunk, related entities like DeepSource of DeepForcedSource could 
be placed in the nearest neighbor chunk ?
Furthermore, Osman plans to use MySQL merge engine to manage chunk 
union, this technology should solve this side effect problem (except at 
the border of the zone covered by the union of contiguous chunks), 
shouldn't it ?
This would allow scientist to have consistent data in a significant area 
(except on the area border) and could be enough for this first test? 
Don't you think so ?

If no, I think two new questions are opened :
- couldn't we assign the whole pair to the chunk containing the 
DeepSource entity referenced by the pair ? I think this would allow 
spatial queries on DeepSource (ra,dec) to be consistent.
- in case the user issue a spatial query on DeepForcedSource (ra,dec), 
he could lost an element of a pair if it has been placed in a contiguous 
chunk (containing the other element of the pair), is this a problem for 
scientist ? I don't know.

In the future, Qserv may solve these questions as it manages chunk 
overlaps. Maybe, for now, we could let the merge engine do the job ? It 
seems this would lead to consistent data everywhere except on the merged 
chunks union border.

>
> I’m happy to help if you run into any problems!

Thanks again, i will try soon to use the partitioner, it is well 
documented and it seems easy, nevertheless i'll let you know if a run 
into any problems.
Thanks,

Fabrice

> Cheers,
> Serge
>
> On Jan 16, 2014, at 2:23 PM, Fabrice Jammes <[log in to unmask]> wrote:
>
>> Hello Serge,
>>
>> Interesting informations are available here :
>> https://dev.lsstcorp.org/trac/wiki/Summer2013/ConfigAndStackTestingPlans/DedupeForcedSources
>>
>> Thanks for your offer to help with partitioner, i could contact you soon.
>>
>> Have a nice day,
>>
>> Fabrice
>>
>> ########################################################################
>> Use REPLY-ALL to reply to list
>>
>> To unsubscribe from the QSERV-L list, click the following link:
>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1