LISTSERV mailing list manager LISTSERV 16.5

Help for QSERV-L Archives


QSERV-L Archives

QSERV-L Archives


QSERV-L@LISTSERV.SLAC.STANFORD.EDU


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

QSERV-L Home

QSERV-L Home

QSERV-L  October 2013

QSERV-L October 2013

Subject:

notes from data ingest discussion (Oct 29)

From:

Jacek Becla <[log in to unmask]>

Reply-To:

General discussion for qserv (LSST prototype baseline catalog)

Date:

Tue, 29 Oct 2013 17:30:54 -0700

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (150 lines)

Douglas, K-T, Serge, Fabrice, Bill, Jacek

"internal publishing" = providing limited access to a small
selected group that does QA. No science, no publications
based on that data. Data can change

types of updates on internally published data?
  - typically just adding quality flags, but can't preclude
    larger changes, e,g. might need to fix code and rerun
    parts of analysis on full data or subset of data,
    output data might change
  - so, set reasonable restrictions, document what we
    can do/support


hardware used for data loading?
  - all dedicated, don't mix with production servers
  --> ACTION: need to capture in storage model [Jacek]


non trivial issue that we need to deal with:
one of the nodes when we run partitioning goes down
while we are deleting chunk, we want to continue and
not wait for that node.  Later that node comes back up,
need to cleanup data that was supposed to be deleted.


when adding data to existing chunks, use merge
engine, each underlying table can have a different version.
Then combine all underlying table for each merged table,
preferably at the end when all data is QA'ed and ready
to be released for public use.


Will partitioner be sending data to qms?
  - yes, eg largest angular separation between source
    and object
  - it could also produce empty chunk list

watch out for the issue: empty chunk can have non-empty
overlap table. That complicates generating objectId index


partitioner and table prep - are they distributed? Yes!


Feeding data to partitioner - through gpfs


What if we lose cache managed by the TablePrepMgr on one or
a small number of machines?
  - are we keeping 2 replicas? (effectively doubling storage)
  - or should we rerun and recover using input fits table data?
  - probably the latter

data produced by DRP: many complete files, not stream of data

yes, input files will have good spatial locality,
it'd be best if we batch groups of files when loading


So, the plan: DRP keeps producing files and dumping to
gpfs, we consume them in batches, say a new batch every
day or week. Batching helps with segregating writes and
optimizing disk IO

If we reprocess after finding problems during QA before we
make data available to public, we will end up with two versions
of the same objects (positions can change, everything can
change, objectIds stay), so we need to throw existing
chunk corresponding to reprocessed data, and reinsert data


Avoid merging while we are still doing QA, because it might
complicate capturing provenance.

but that means we will have to deal with many files:

20K partitions, so ~1sq deg, so ~few hundreds of input
files per chunk. That is:

20K x ~5 tables x 3 files per table x say 300 merge tables
= almost 100 million files! (distributed, but still it is
a lot)


Create db and empty tables before loading. That is a separate
step. Don't do a "special first load that creates tables"


all data fed to data loader should be in ready-to-load
format, no astronomy related math


Expected schema of the data that data loader gets from DRP?
  - same as baseline schema
  - but we will need few extra columns, like procHistoryId
    column or chunkId. Loader should add these
  - loader should also ensure the schema and data match


But data products we are ingesting are used for other
internal things in production, so forcing apps code
to use units we want in database is not a good idea.
We might need a conversion step to realign units. So,
we are proposing:
  - write a separate converter that transforms output
    from DRP to desired schema
  - data loader provides plugin api, apps team
    implements the plugin
  - this needs to be discuss with the rest of DM

Also, we need to deal with name conversion/mapping
of different fields.
  - discuss with the rest of DM


Partitioner should be flexible enough to handle
any schema that it gets, this will be important
for testing/commissioning, etc


Is the schema for chunks always the same, independently
from data quality? Yes!


processing history id will be recorded by orch layer
and stored in provenance tables,
  - the proc history id will come with the data files,
  - fits metadata is a good place to put it
  - we will need to capture additional provenance that
    captures how data was loaded, on which machines etc
    - create new prochistory id

need to model provenance for loader, or
use task/taskGraph for that

--> ACTION: merge "data ingest" trac page with
     "data loading" page, and update the page to
     capture what was discussed at this meeting
     [Jacek]

Jacek

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1

Top of Message | Previous Page | Permalink

Advanced Options


Options

Log In

Log In

Get Password

Get Password


Search Archives

Search Archives


Subscribe or Unsubscribe

Subscribe or Unsubscribe


Archives

March 2018
February 2018
January 2018
December 2017
August 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012

ATOM RSS1 RSS2



LISTSERV.SLAC.STANFORD.EDU

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager

Privacy Notice, Security Notice and Terms of Use