OK. I mean you are right there is no point in duplicating the effort. I felt tat a clean self-contained implementation would be more maintainable in the long run. I will call probably at noon. Andy On Tue, 1 Dec 2015, Jacek Becla wrote: > [sending to the list] > > Andy, > > We sort of realized you were thinking about implementing the scheduler > all on your side, that is why I asked so prominently :), there is no > point in wasting efforts. Our perhaps naive reaction was "but the > design of the existing ScanScheduler is not too bad, maybe we should > at least reuse it?" But I think we can be talked out of it! If you > call in tomorrow maybe we can discuss all this a bit. > > Thanks > Jacek > > > > > > > -------- Forwarded Message -------- > Subject: Re: shared scan > Date: Tue, 1 Dec 2015 02:23:21 -0800 > From: Andrew Hanushevsky <[log in to unmask]> > To: Becla, Jacek <[log in to unmask]> > CC: Gates, John H <[log in to unmask]>, Mueller, Fritz > <[log in to unmask]> > > Hi All, > > Well, it may seem that we are working at cross-puposes. I was under the > impression thatthe new (proposed) shared scan scheduler would replace wat > is in qserv at the moment. So, it would seem to me that devoting a lot of > time to further improve wht's there would be missplaced. I am proposing a > rather self-contained interface. The shared scan scheduler works on all of > the worker nodes trying to maximize the use of locked memory while > minimizing the amount that is locked. That is not an easy task. If you > look closely at AddQuery() it should be apparent that the scheduler ants > to know which tables a query will need to access and whether those tables > need to be locked. The tables you pass into Create() can be optionally > locked at the front (which I would assume that at least the object table > would be always locked -- otherwise it doesn't make much sense > resource-wise). Anyway it would be good to straighten all of this out. I > am not in favor of further massaging the existing code. > > Andy > > On Tue, 1 Dec 2015, Becla, Jacek wrote: > >> John: thanks for writing this up. >> >> Andy, see below. >> >> The #1 question is, on which side of the fence are we doing scheduling? >> Your side, or Qserv side (ScanScheduler)? If yours, why? >> >> I inserted more comments below >> >> >>> On Nov 30, 2015, at 3:41 PM, Gates, John H <[log in to unmask]> >>> wrote: >>> >>> Jacek, Fritz, >>> >>> Please look this over and see if I missed anything or got something wrong >>> >>> >>> >>> Hi Andy, >>> >>> We (Jacek, Fritz, Nate, and John) had a discussion about the scan >>> scheduler today. We'd like to know a bit more about what you have in mind, >>> and let you know what we already have. >>> >>> There is currently a scan scheduler (wsched::ScanScheduler). UserQueries >>> are broken into TaskMsg by the czar and sent to the worker, which turns >>> them into Tasks. The Tasks are given to the BlendScheduler, which gives >>> any Tasks with scantable_size > 0 to the ScanScheduler. The czar does all >>> of the query analysis at this time >>> >>> The ScanScheduler has an active heap and a pending heap, both of which are >>> minimum value heaps. It has chunk id that is currently being read in from >>> disk (lastChunk). If a new Task with a chunk id higher than the lastChunk >>> is added, it goes on the active heap. If it is less than or equal the >>> chunkId, it goes on the pending heap. Once the active heap is empty, the >>> pending heap is swapped with the active heap and the lastChunk is set >>> equal to the top element of the new active heap. >>> >>> The ScanScheduler is currently allowed to advance to the next chunk id as >>> soon as ANY query on the lastChunk finishes. This is pretty naive and will >>> need to change. The current ScanScheduler is concerned with disk i/o and >>> not concerned about memory constraints. Changing this is simply a matter >>> of changing the _ready() function in the ScanScheduler so that Tasks can >>> be started only when enough memory is available, or some other criteria. >>> >>> Scanning always goes by chunk id. There are not separate schedulers for >>> Source tables and Object tables. >>> Scan scheduling will need to consider how much memory is available and the >>> size of the files that would need to be locked. >>> There are currently 4 different scans that will probably each need their >>> own scheduler: >>> Object 1hr per full scan >>> Object joined with Source 8hr per full scan >>> Object joined with Forced Source 8hr per full scan >>> Object joined with Object_Extra 12hr per full scan >>> For each one, the appropriate tables need to be locked, the "Object" >>> scheduler would only lock the Object table files for its current chunk id >>> in memory, while "Object joined with Source" scheduler would lock Object >>> and Source tables for its the current chunk id. >> >> Note that these are just core production tables, there were be many more, >> Object will be vertically partitioned into several tables, there will be >> many level 3 user tables >> >> >> >>> Looking at this, it might be better to go with schedulers that run at >>> expected rates (1hr/full_scan, 8hr/full_scan, 12hr/full_scan) and have >>> flags indicating which tables they want to use. The problem being that the >>> number of permutations of joins gets out of hand quickly. It would be >>> simple to rank them by chunk id and then group them by which tables are >>> needed. (Are there Source table only queries? Object_Extra table only >>> queries? Object, ObjectExtra and Source?) >>> >>> It might be desirable to have the Object scheduler be able to identify >>> slow Tasks and take all Tasks for that UserQuery and move them to the >>> Object joined with Source scheduler, so they don't bog down the Object >>> scheduler. This would require a unique user query query id or something >>> similar. >>> >>> I don't think this would be difficult to do with the current >>> BlendScheduler and ScanScheduler. They already contain code to limit the >>> number of threads spawned by any scheduler type and easy to change values >>> for controlling their limits at a high level in the code. It's pretty easy >>> to have multiple schedulers and switch between them at compile time (or at >>> program start up if we really want to). Thoughts? >>> >>> The table sizes should be something like: ( first year size -> size >>> after 10 years ) >>> Object 1x >>> ObjExtra 10x >>> Forced Source 1x> 10x >>> Source 5x>40x >>> >>> Should we do anything for tables required for the query that don't need to >>> be locked? >> >> >> Do we need to pass to your functions the tables that do not need to >> be locked as part of shared scans, say we have a query: >> >> SELECT <whatever> >> FROM Object o >> JOIN Source s using (o.objectId = s.objectId) >> JOIN Filter f using s.filterId = f.filterId >> WHERE f.filterName = r >> >> the table Filter is tiny (6 rows) and there is no need to lock it, should >> we still pass it? No? >> I guess not, but the docs needs clarification. >> >> >>> >>> We need to ask Mario: >>> - Will we have queries that want to see sources and forces sources? >>> - Joining between data releases - do we need to handle all Drs data >>> through the same qserv instance? >> >> Im going to ask Mario >> >>> >>> For scheduling to work, we will need some information available. This will >>> need to part of the interface. >>> - Which tables are locked in memory? >>> - How many Tasks are using a particular table locked in memory? (Free them >>> by reference counting?) >>> - How much memory have we locked up? >>> - What's the most memory we should have locked up? >>> - Before a table is locked in memory, how much room is it likely to take? >>> >>> Note that the GroupScheduler is working through its own Tasks. Its Tasks >>> only involve a couple of chunks, but it still needs some memory to work >>> with. >>> >>> >>> >>> >>> Concerns/clarification for anything above? >>> >>> What are the arguments for having your code do the scheduling? >>> >>> important details of the file locking? >>> >>> >>> >>> Thanks, >>> John >> >> > > ######################################################################## > Use REPLY-ALL to reply to list > > To unsubscribe from the QSERV-L list, click the following link: > https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1 > ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the QSERV-L list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1