It also may be the case that shared scans on the worker are ordered simply by chunkId, and this may be fine as the primary tables are Object, Object_Extra, Source, or Forced Source in most cases, but as the number of possible tables goes up, we may want to change the ordering to be by chunkId then primary table name. -John On 02/05/16 14:07, Becla, Jacek wrote: > John > > Let’s talk about it in the list. > > The issue of scanning older releases is a big one, it is still > unclear. It is pretty much about balancing the cost and guessing what > users will want. I have RFC-134 opened about it, and i am scheduled > to talk to Mario on Monday to understand it better. > > The thinking about current and next chunk in memory was that we are > fetching chunk x+1 while we are processing chunk x that is already in > memory (that is purely CPU, no disk IO). This would help us keep the > disks busy. > > BTW, you might watch RFC-132 as well. It’d get pretty nasty if we had > to do efficient joins between data releases. Also, RFC-133 is > interesting, we might need to have an additional scan for Dia* tables > (level 1 data processed through DRP). It is unclear yet if that is > needed (I am guessing it will be), and whether we’d need to do any > joins between Dia* and say Object. I’m checking with Mario on that. > > I’ll keep you all updated about input I get from Mario. I do want to > sort out the requirement side of things (older DRs, cross DR joins > etc) in the next few weeks. > > thanks, > Jacek > > > > >> On Feb 5, 2016, at 1:52 PM, Gates, John H <[log in to unmask] >> <mailto:[log in to unmask]>> wrote: >> >> Hi Jacek, >> >> I was looking through LDM-135 and came across a couple of things. >> >> * one synchronized full table scan of Object_Narrow and >> Object_Extra every 8 hours for the latest and _previous data >> releases._ >> >> This is the only scan that specifically mentions the previous data >> releases. I'm not sure why it is special like that. It also looks >> like we aren't making any promises at all about other scans for older >> releases. >> >> >> * For a self-joins, a single shared scans will be sufficient, >> however each node must have sufficient memory to hold 2 chunks at >> any given time (the processed chunk and next chunk). Refer to the >> sizing model[LDM-141] <x-msg://38/#ldm-141>for further details on >> the cost of shared scans. >> >> I'm not sure why a node needs enough memory to hold the processed >> chunk and the next chunk. I would think it only needs to have enough >> memory to deal with the chunk it is working with right now, assuming >> this is per shared scan. >> >> >> I also had a couple of thoughts on the sizing model and shared scans. >> It looks like we're heading for one cluster per release, which is >> simple but means doing joins across releases is not easy. If, on the >> other hand it is decided to put all releases on the same cluster, and >> put all the same chunks for all the releases on one node, the worker >> scheduler can be pretty easily modified to order scans by chunkId >> then release number. Provided an increase in worker nodes and czars >> with each added release, it should work. Joins between object_extra >> between releases could require a fair bit of memory, but if one >> release is under utilized, the other queries should be faster. >> >> -John >> > ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the QSERV-L list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1