Print

Print


[sending to the list]

Andy,

We sort of realized you were thinking about implementing the scheduler
all on your side, that is why I asked so prominently :), there is no
point in wasting efforts. Our perhaps naive reaction was "but the
design of the existing ScanScheduler is not too bad, maybe we should
at least reuse it?" But I think we can be talked out of it! If you
call in tomorrow maybe we can discuss all this a bit.

Thanks
Jacek






-------- Forwarded Message --------
Subject: Re: shared scan
Date: Tue, 1 Dec 2015 02:23:21 -0800
From: Andrew Hanushevsky <[log in to unmask]>
To: Becla, Jacek <[log in to unmask]>
CC: Gates, John H <[log in to unmask]>, Mueller, Fritz 
<[log in to unmask]>

Hi All,

Well, it may seem that we are working at cross-puposes. I was under the
impression thatthe new (proposed) shared scan scheduler would replace wat
is in qserv at the moment. So, it would seem to me that devoting a lot of
time to further improve wht's there would be missplaced. I am proposing a
rather self-contained interface. The shared scan scheduler works on all of
the worker nodes trying to maximize the use of locked memory while
minimizing the amount that is locked. That is not an easy task. If you
look closely at AddQuery() it should be apparent that the scheduler ants
to know which tables a query will need to access and whether those tables
need to be locked. The tables you pass into Create() can be optionally
locked at the front (which I would assume that at least the object table
would be always locked -- otherwise it doesn't make much sense
resource-wise). Anyway it would be good to straighten all of this out. I
am not in favor of further massaging the existing code.

Andy

On Tue, 1 Dec 2015, Becla, Jacek wrote:

> John: thanks for writing this up.
>
> Andy, see below.
>
> The #1 question is, on which side of the fence are we doing scheduling?
> Your side, or  Qserv side (ScanScheduler)? If yours, why?
>
> I inserted more comments below
>
>
>> On Nov 30, 2015, at 3:41 PM, Gates, John H <[log in to unmask]> wrote:
>>
>> Jacek, Fritz,
>>
>> Please look this over and see if I missed anything or got something wrong
>>
>>
>>
>> Hi Andy,
>>
>> We (Jacek, Fritz, Nate, and John) had a discussion about the scan scheduler today. We'd like to know a bit more about what you have in mind, and let you know what we already have.
>>
>> There is currently a scan scheduler (wsched::ScanScheduler). UserQueries are broken into TaskMsg by the czar and sent to the worker, which turns them into Tasks. The Tasks are given to the BlendScheduler, which gives any Tasks with scantable_size > 0 to the ScanScheduler. The czar does all of the query analysis at this time
>>
>> The ScanScheduler has an active heap and a pending heap, both of which are minimum value heaps. It has chunk id that is currently being read in from disk (lastChunk). If a new Task with a chunk id higher than the lastChunk is added, it goes on the active heap. If it is less than or equal the chunkId, it goes on the pending heap. Once the active heap is empty, the pending heap is swapped with the active heap and the lastChunk is set equal to the top element of the new active heap.
>>
>> The ScanScheduler is currently allowed to advance to the next chunk id as soon as ANY query on the lastChunk finishes. This is pretty naive and will need to change. The current ScanScheduler is concerned with disk i/o and not concerned about memory constraints. Changing this is simply a matter of changing the _ready() function in the ScanScheduler so that Tasks can be started only when enough memory is available, or some other criteria.
>>
>> Scanning always goes by chunk id. There are not separate schedulers for Source tables and Object tables.
>> Scan scheduling will need to consider how much memory is available and the size of the files that would need to be locked.
>> There are currently 4 different scans that will probably each need their own scheduler:
>>    Object                               1hr per full scan
>>    Object joined with Source            8hr per full scan
>>    Object joined with Forced Source     8hr per full scan
>>    Object joined with Object_Extra      12hr per full scan
>> For each one, the appropriate tables need to be locked, the "Object" scheduler would only lock the Object table files for its current chunk id in memory, while "Object joined with Source" scheduler would lock Object and Source tables for its the current chunk id.
>
> Note that these are just ˙˙core˙˙ production tables, there were be many more, Object will be vertically partitioned into several tables, there will be many level 3 user tables˙˙
>
>
>
>>    Looking at this, it might be better to go with schedulers that run at expected rates (1hr/full_scan, 8hr/full_scan, 12hr/full_scan) and have flags indicating which tables they want to use. The problem being that the number of permutations of joins gets out of hand quickly. It would be simple to rank them by chunk id and then group them by which tables are needed. (Are there Source table only queries? Object_Extra table only queries? Object, ObjectExtra and Source?)
>>
>> It might be desirable to have the Object scheduler be able to identify slow Tasks and take all Tasks for that UserQuery and move them to the Object joined with Source scheduler, so they don't bog down the Object scheduler. This would require a unique user query query id or something similar.
>>
>> I don't think this would be difficult to do with the current BlendScheduler and ScanScheduler. They already contain code to limit the number of threads spawned by any scheduler type and easy to change values for controlling their limits at a high level in the code. It's pretty easy to have multiple schedulers and switch between them at compile time (or at program start up if we really want to).  Thoughts?
>>
>> The table sizes should be something like:   ( first year size -> size after 10 years )
>>   Object    1x
>>   ObjExtra 10x
>>   Forced Source  1x˙˙> 10x
>>   Source   5x˙˙>40x
>>
>> Should we do anything for tables required for the query that don't need to be locked?
>
>
> Do we  need to pass to your functions the tables that do not need to
> be locked as part of shared scans, say we have a query:
>
> SELECT <whatever>
> FROM Object o
> JOIN Source s using (o.objectId = s.objectId)
> JOIN Filter f using s.filterId = f.filterId
> WHERE f.filterName = ˙˙r˙˙
>
> the table Filter is tiny (6 rows) and there is no need to lock it, should we still pass it? No?
> I guess not, but the docs needs clarification.
>
>
>>
>> We need to ask Mario:
>> - Will we have queries that want to see sources and forces sources?
>> - Joining between data releases - do we need to handle all Drs data through the same qserv instance?
>
> I˙˙m going to ask Mario
>
>>
>> For scheduling to work, we will need some information available. This will need to part of the interface.
>> - Which tables are locked in memory?
>> - How many Tasks are using a particular table locked in memory? (Free them by reference counting?)
>> - How much memory have we locked up?
>> - What's the most memory we should have locked up?
>> - Before a table is locked in memory, how much room is it likely to take?
>>
>> Note that the GroupScheduler is working through its own Tasks. Its Tasks only involve a couple of chunks, but it still needs some memory to work with.
>>
>>
>>
>>
>> Concerns/clarification for anything above?
>>
>> What are the arguments for having your code do the scheduling?
>>
>> important details of the file locking?
>>
>>
>>
>> Thanks,
>> John
>
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1