Print

Print


it sounds like there is another meeting on this today; is there a hangout for it? I'd like to join remotely if possible.

thanks,
nate
________________________________________
From: [log in to unmask] <[log in to unmask]> on behalf of Andrew Hanushevsky <[log in to unmask]>
Sent: Tuesday, December 1, 2015 8:52 AM
To: Becla, Jacek
Cc: qserv-l
Subject: Re: [QSERV-L] Fwd: Re: shared scan

OK. I mean you are right there is no point in duplicating the effort. I
felt tat a clean self-contained implementation would be more maintainable
in the long run. I will call probably at noon.

Andy

On Tue, 1 Dec 2015, Jacek Becla wrote:

> [sending to the list]
>
> Andy,
>
> We sort of realized you were thinking about implementing the scheduler
> all on your side, that is why I asked so prominently :), there is no
> point in wasting efforts. Our perhaps naive reaction was "but the
> design of the existing ScanScheduler is not too bad, maybe we should
> at least reuse it?" But I think we can be talked out of it! If you
> call in tomorrow maybe we can discuss all this a bit.
>
> Thanks
> Jacek
>
>
>
>
>
>
> -------- Forwarded Message --------
> Subject: Re: shared scan
> Date: Tue, 1 Dec 2015 02:23:21 -0800
> From: Andrew Hanushevsky <[log in to unmask]>
> To: Becla, Jacek <[log in to unmask]>
> CC: Gates, John H <[log in to unmask]>, Mueller, Fritz
> <[log in to unmask]>
>
> Hi All,
>
> Well, it may seem that we are working at cross-puposes. I was under the
> impression thatthe new (proposed) shared scan scheduler would replace wat
> is in qserv at the moment. So, it would seem to me that devoting a lot of
> time to further improve wht's there would be missplaced. I am proposing a
> rather self-contained interface. The shared scan scheduler works on all of
> the worker nodes trying to maximize the use of locked memory while
> minimizing the amount that is locked. That is not an easy task. If you
> look closely at AddQuery() it should be apparent that the scheduler ants
> to know which tables a query will need to access and whether those tables
> need to be locked. The tables you pass into Create() can be optionally
> locked at the front (which I would assume that at least the object table
> would be always locked -- otherwise it doesn't make much sense
> resource-wise). Anyway it would be good to straighten all of this out. I
> am not in favor of further massaging the existing code.
>
> Andy
>
> On Tue, 1 Dec 2015, Becla, Jacek wrote:
>
>> John: thanks for writing this up.
>>
>> Andy, see below.
>>
>> The #1 question is, on which side of the fence are we doing scheduling?
>> Your side, or  Qserv side (ScanScheduler)? If yours, why?
>>
>> I inserted more comments below
>>
>>
>>> On Nov 30, 2015, at 3:41 PM, Gates, John H <[log in to unmask]>
>>> wrote:
>>>
>>> Jacek, Fritz,
>>>
>>> Please look this over and see if I missed anything or got something wrong
>>>
>>>
>>>
>>> Hi Andy,
>>>
>>> We (Jacek, Fritz, Nate, and John) had a discussion about the scan
>>> scheduler today. We'd like to know a bit more about what you have in mind,
>>> and let you know what we already have.
>>>
>>> There is currently a scan scheduler (wsched::ScanScheduler). UserQueries
>>> are broken into TaskMsg by the czar and sent to the worker, which turns
>>> them into Tasks. The Tasks are given to the BlendScheduler, which gives
>>> any Tasks with scantable_size > 0 to the ScanScheduler. The czar does all
>>> of the query analysis at this time
>>>
>>> The ScanScheduler has an active heap and a pending heap, both of which are
>>> minimum value heaps. It has chunk id that is currently being read in from
>>> disk (lastChunk). If a new Task with a chunk id higher than the lastChunk
>>> is added, it goes on the active heap. If it is less than or equal the
>>> chunkId, it goes on the pending heap. Once the active heap is empty, the
>>> pending heap is swapped with the active heap and the lastChunk is set
>>> equal to the top element of the new active heap.
>>>
>>> The ScanScheduler is currently allowed to advance to the next chunk id as
>>> soon as ANY query on the lastChunk finishes. This is pretty naive and will
>>> need to change. The current ScanScheduler is concerned with disk i/o and
>>> not concerned about memory constraints. Changing this is simply a matter
>>> of changing the _ready() function in the ScanScheduler so that Tasks can
>>> be started only when enough memory is available, or some other criteria.
>>>
>>> Scanning always goes by chunk id. There are not separate schedulers for
>>> Source tables and Object tables.
>>> Scan scheduling will need to consider how much memory is available and the
>>> size of the files that would need to be locked.
>>> There are currently 4 different scans that will probably each need their
>>> own scheduler:
>>>    Object                               1hr per full scan
>>>    Object joined with Source            8hr per full scan
>>>    Object joined with Forced Source     8hr per full scan
>>>    Object joined with Object_Extra      12hr per full scan
>>> For each one, the appropriate tables need to be locked, the "Object"
>>> scheduler would only lock the Object table files for its current chunk id
>>> in memory, while "Object joined with Source" scheduler would lock Object
>>> and Source tables for its the current chunk id.
>>
>> Note that these are just core production tables, there were be many more,
>> Object will be vertically partitioned into several tables, there will be
>> many level 3 user tables
>>
>>
>>
>>>    Looking at this, it might be better to go with schedulers that run at
>>> expected rates (1hr/full_scan, 8hr/full_scan, 12hr/full_scan) and have
>>> flags indicating which tables they want to use. The problem being that the
>>> number of permutations of joins gets out of hand quickly. It would be
>>> simple to rank them by chunk id and then group them by which tables are
>>> needed. (Are there Source table only queries? Object_Extra table only
>>> queries? Object, ObjectExtra and Source?)
>>>
>>> It might be desirable to have the Object scheduler be able to identify
>>> slow Tasks and take all Tasks for that UserQuery and move them to the
>>> Object joined with Source scheduler, so they don't bog down the Object
>>> scheduler. This would require a unique user query query id or something
>>> similar.
>>>
>>> I don't think this would be difficult to do with the current
>>> BlendScheduler and ScanScheduler. They already contain code to limit the
>>> number of threads spawned by any scheduler type and easy to change values
>>> for controlling their limits at a high level in the code. It's pretty easy
>>> to have multiple schedulers and switch between them at compile time (or at
>>> program start up if we really want to).  Thoughts?
>>>
>>> The table sizes should be something like:   ( first year size -> size
>>> after 10 years )
>>>   Object    1x
>>>   ObjExtra 10x
>>>   Forced Source  1x> 10x
>>>   Source   5x>40x
>>>
>>> Should we do anything for tables required for the query that don't need to
>>> be locked?
>>
>>
>> Do we  need to pass to your functions the tables that do not need to
>> be locked as part of shared scans, say we have a query:
>>
>> SELECT <whatever>
>> FROM Object o
>> JOIN Source s using (o.objectId = s.objectId)
>> JOIN Filter f using s.filterId = f.filterId
>> WHERE f.filterName = r
>>
>> the table Filter is tiny (6 rows) and there is no need to lock it, should
>> we still pass it? No?
>> I guess not, but the docs needs clarification.
>>
>>
>>>
>>> We need to ask Mario:
>>> - Will we have queries that want to see sources and forces sources?
>>> - Joining between data releases - do we need to handle all Drs data
>>> through the same qserv instance?
>>
>> Im going to ask Mario
>>
>>>
>>> For scheduling to work, we will need some information available. This will
>>> need to part of the interface.
>>> - Which tables are locked in memory?
>>> - How many Tasks are using a particular table locked in memory? (Free them
>>> by reference counting?)
>>> - How much memory have we locked up?
>>> - What's the most memory we should have locked up?
>>> - Before a table is locked in memory, how much room is it likely to take?
>>>
>>> Note that the GroupScheduler is working through its own Tasks. Its Tasks
>>> only involve a couple of chunks, but it still needs some memory to work
>>> with.
>>>
>>>
>>>
>>>
>>> Concerns/clarification for anything above?
>>>
>>> What are the arguments for having your code do the scheduling?
>>>
>>> important details of the file locking?
>>>
>>>
>>>
>>> Thanks,
>>> John
>>
>>
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the QSERV-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1