Print

Print


OK. I mean you are right there is no point in duplicating the effort. I 
felt tat a clean self-contained implementation would be more maintainable 
in the long run. I will call probably at noon.

Andy

On Tue, 1 Dec 2015, Jacek Becla wrote:

> [sending to the list]
>
> Andy,
>
> We sort of realized you were thinking about implementing the scheduler
> all on your side, that is why I asked so prominently :), there is no
> point in wasting efforts. Our perhaps naive reaction was "but the
> design of the existing ScanScheduler is not too bad, maybe we should
> at least reuse it?" But I think we can be talked out of it! If you
> call in tomorrow maybe we can discuss all this a bit.
>
> Thanks
> Jacek
>
>
>
>
>
>
> -------- Forwarded Message --------
> Subject: Re: shared scan
> Date: Tue, 1 Dec 2015 02:23:21 -0800
> From: Andrew Hanushevsky <[log in to unmask]>
> To: Becla, Jacek <[log in to unmask]>
> CC: Gates, John H <[log in to unmask]>, Mueller, Fritz 
> <[log in to unmask]>
>
> Hi All,
>
> Well, it may seem that we are working at cross-puposes. I was under the
> impression thatthe new (proposed) shared scan scheduler would replace wat
> is in qserv at the moment. So, it would seem to me that devoting a lot of
> time to further improve wht's there would be missplaced. I am proposing a
> rather self-contained interface. The shared scan scheduler works on all of
> the worker nodes trying to maximize the use of locked memory while
> minimizing the amount that is locked. That is not an easy task. If you
> look closely at AddQuery() it should be apparent that the scheduler ants
> to know which tables a query will need to access and whether those tables
> need to be locked. The tables you pass into Create() can be optionally
> locked at the front (which I would assume that at least the object table
> would be always locked -- otherwise it doesn't make much sense
> resource-wise). Anyway it would be good to straighten all of this out. I
> am not in favor of further massaging the existing code.
>
> Andy
>
> On Tue, 1 Dec 2015, Becla, Jacek wrote:
>
>> John: thanks for writing this up.
>> 
>> Andy, see below.
>> 
>> The #1 question is, on which side of the fence are we doing scheduling?
>> Your side, or  Qserv side (ScanScheduler)? If yours, why?
>> 
>> I inserted more comments below
>> 
>> 
>>> On Nov 30, 2015, at 3:41 PM, Gates, John H <[log in to unmask]> 
>>> wrote:
>>> 
>>> Jacek, Fritz,
>>> 
>>> Please look this over and see if I missed anything or got something wrong
>>> 
>>> 
>>> 
>>> Hi Andy,
>>> 
>>> We (Jacek, Fritz, Nate, and John) had a discussion about the scan 
>>> scheduler today. We'd like to know a bit more about what you have in mind, 
>>> and let you know what we already have.
>>> 
>>> There is currently a scan scheduler (wsched::ScanScheduler). UserQueries 
>>> are broken into TaskMsg by the czar and sent to the worker, which turns 
>>> them into Tasks. The Tasks are given to the BlendScheduler, which gives 
>>> any Tasks with scantable_size > 0 to the ScanScheduler. The czar does all 
>>> of the query analysis at this time
>>> 
>>> The ScanScheduler has an active heap and a pending heap, both of which are 
>>> minimum value heaps. It has chunk id that is currently being read in from 
>>> disk (lastChunk). If a new Task with a chunk id higher than the lastChunk 
>>> is added, it goes on the active heap. If it is less than or equal the 
>>> chunkId, it goes on the pending heap. Once the active heap is empty, the 
>>> pending heap is swapped with the active heap and the lastChunk is set 
>>> equal to the top element of the new active heap.
>>> 
>>> The ScanScheduler is currently allowed to advance to the next chunk id as 
>>> soon as ANY query on the lastChunk finishes. This is pretty naive and will 
>>> need to change. The current ScanScheduler is concerned with disk i/o and 
>>> not concerned about memory constraints. Changing this is simply a matter 
>>> of changing the _ready() function in the ScanScheduler so that Tasks can 
>>> be started only when enough memory is available, or some other criteria.
>>> 
>>> Scanning always goes by chunk id. There are not separate schedulers for 
>>> Source tables and Object tables.
>>> Scan scheduling will need to consider how much memory is available and the 
>>> size of the files that would need to be locked.
>>> There are currently 4 different scans that will probably each need their 
>>> own scheduler:
>>>    Object                               1hr per full scan
>>>    Object joined with Source            8hr per full scan
>>>    Object joined with Forced Source     8hr per full scan
>>>    Object joined with Object_Extra      12hr per full scan
>>> For each one, the appropriate tables need to be locked, the "Object" 
>>> scheduler would only lock the Object table files for its current chunk id 
>>> in memory, while "Object joined with Source" scheduler would lock Object 
>>> and Source tables for its the current chunk id.
>> 
>> Note that these are just core production tables, there were be many more, 
>> Object will be vertically partitioned into several tables, there will be 
>> many level 3 user tables
>> 
>> 
>>
>>>    Looking at this, it might be better to go with schedulers that run at 
>>> expected rates (1hr/full_scan, 8hr/full_scan, 12hr/full_scan) and have 
>>> flags indicating which tables they want to use. The problem being that the 
>>> number of permutations of joins gets out of hand quickly. It would be 
>>> simple to rank them by chunk id and then group them by which tables are 
>>> needed. (Are there Source table only queries? Object_Extra table only 
>>> queries? Object, ObjectExtra and Source?)
>>> 
>>> It might be desirable to have the Object scheduler be able to identify 
>>> slow Tasks and take all Tasks for that UserQuery and move them to the 
>>> Object joined with Source scheduler, so they don't bog down the Object 
>>> scheduler. This would require a unique user query query id or something 
>>> similar.
>>> 
>>> I don't think this would be difficult to do with the current 
>>> BlendScheduler and ScanScheduler. They already contain code to limit the 
>>> number of threads spawned by any scheduler type and easy to change values 
>>> for controlling their limits at a high level in the code. It's pretty easy 
>>> to have multiple schedulers and switch between them at compile time (or at 
>>> program start up if we really want to).  Thoughts?
>>> 
>>> The table sizes should be something like:   ( first year size -> size 
>>> after 10 years )
>>>   Object    1x
>>>   ObjExtra 10x
>>>   Forced Source  1x> 10x
>>>   Source   5x>40x
>>> 
>>> Should we do anything for tables required for the query that don't need to 
>>> be locked?
>> 
>> 
>> Do we  need to pass to your functions the tables that do not need to
>> be locked as part of shared scans, say we have a query:
>> 
>> SELECT <whatever>
>> FROM Object o
>> JOIN Source s using (o.objectId = s.objectId)
>> JOIN Filter f using s.filterId = f.filterId
>> WHERE f.filterName = r
>> 
>> the table Filter is tiny (6 rows) and there is no need to lock it, should 
>> we still pass it? No?
>> I guess not, but the docs needs clarification.
>> 
>> 
>>> 
>>> We need to ask Mario:
>>> - Will we have queries that want to see sources and forces sources?
>>> - Joining between data releases - do we need to handle all Drs data 
>>> through the same qserv instance?
>> 
>> Im going to ask Mario
>> 
>>> 
>>> For scheduling to work, we will need some information available. This will 
>>> need to part of the interface.
>>> - Which tables are locked in memory?
>>> - How many Tasks are using a particular table locked in memory? (Free them 
>>> by reference counting?)
>>> - How much memory have we locked up?
>>> - What's the most memory we should have locked up?
>>> - Before a table is locked in memory, how much room is it likely to take?
>>> 
>>> Note that the GroupScheduler is working through its own Tasks. Its Tasks 
>>> only involve a couple of chunks, but it still needs some memory to work 
>>> with.
>>> 
>>> 
>>> 
>>> 
>>> Concerns/clarification for anything above?
>>> 
>>> What are the arguments for having your code do the scheduling?
>>> 
>>> important details of the file locking?
>>> 
>>> 
>>> 
>>> Thanks,
>>> John
>> 
>> 
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the QSERV-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1