Douglas, K-T, Serge, Fabrice, Bill, Jacek "internal publishing" = providing limited access to a small selected group that does QA. No science, no publications based on that data. Data can change types of updates on internally published data? - typically just adding quality flags, but can't preclude larger changes, e,g. might need to fix code and rerun parts of analysis on full data or subset of data, output data might change - so, set reasonable restrictions, document what we can do/support hardware used for data loading? - all dedicated, don't mix with production servers --> ACTION: need to capture in storage model [Jacek] non trivial issue that we need to deal with: one of the nodes when we run partitioning goes down while we are deleting chunk, we want to continue and not wait for that node. Later that node comes back up, need to cleanup data that was supposed to be deleted. when adding data to existing chunks, use merge engine, each underlying table can have a different version. Then combine all underlying table for each merged table, preferably at the end when all data is QA'ed and ready to be released for public use. Will partitioner be sending data to qms? - yes, eg largest angular separation between source and object - it could also produce empty chunk list watch out for the issue: empty chunk can have non-empty overlap table. That complicates generating objectId index partitioner and table prep - are they distributed? Yes! Feeding data to partitioner - through gpfs What if we lose cache managed by the TablePrepMgr on one or a small number of machines? - are we keeping 2 replicas? (effectively doubling storage) - or should we rerun and recover using input fits table data? - probably the latter data produced by DRP: many complete files, not stream of data yes, input files will have good spatial locality, it'd be best if we batch groups of files when loading So, the plan: DRP keeps producing files and dumping to gpfs, we consume them in batches, say a new batch every day or week. Batching helps with segregating writes and optimizing disk IO If we reprocess after finding problems during QA before we make data available to public, we will end up with two versions of the same objects (positions can change, everything can change, objectIds stay), so we need to throw existing chunk corresponding to reprocessed data, and reinsert data Avoid merging while we are still doing QA, because it might complicate capturing provenance. but that means we will have to deal with many files: 20K partitions, so ~1sq deg, so ~few hundreds of input files per chunk. That is: 20K x ~5 tables x 3 files per table x say 300 merge tables = almost 100 million files! (distributed, but still it is a lot) Create db and empty tables before loading. That is a separate step. Don't do a "special first load that creates tables" all data fed to data loader should be in ready-to-load format, no astronomy related math Expected schema of the data that data loader gets from DRP? - same as baseline schema - but we will need few extra columns, like procHistoryId column or chunkId. Loader should add these - loader should also ensure the schema and data match But data products we are ingesting are used for other internal things in production, so forcing apps code to use units we want in database is not a good idea. We might need a conversion step to realign units. So, we are proposing: - write a separate converter that transforms output from DRP to desired schema - data loader provides plugin api, apps team implements the plugin - this needs to be discuss with the rest of DM Also, we need to deal with name conversion/mapping of different fields. - discuss with the rest of DM Partitioner should be flexible enough to handle any schema that it gets, this will be important for testing/commissioning, etc Is the schema for chunks always the same, independently from data quality? Yes! processing history id will be recorded by orch layer and stored in provenance tables, - the proc history id will come with the data files, - fits metadata is a good place to put it - we will need to capture additional provenance that captures how data was loaded, on which machines etc - create new prochistory id need to model provenance for loader, or use task/taskGraph for that --> ACTION: merge "data ingest" trac page with "data loading" page, and update the page to capture what was discussed at this meeting [Jacek] Jacek ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the QSERV-L list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1