LISTSERV 16.5 - QSERV-L Archives

Nate

Not today, we were talking about qserv hangout tomorrow (Wed) at noon

Jacek



On 12/01/2015 09:40 AM, Pease, Nathan wrote:
> it sounds like there is another meeting on this today; is there a hangout for it? I'd like to join remotely if possible.
>
> thanks,
> nate
> ________________________________________
> From: [log in to unmask] <[log in to unmask]> on behalf of Andrew Hanushevsky <[log in to unmask]>
> Sent: Tuesday, December 1, 2015 8:52 AM
> To: Becla, Jacek
> Cc: qserv-l
> Subject: Re: [QSERV-L] Fwd: Re: shared scan
>
> OK. I mean you are right there is no point in duplicating the effort. I
> felt tat a clean self-contained implementation would be more maintainable
> in the long run. I will call probably at noon.
>
> Andy
>
> On Tue, 1 Dec 2015, Jacek Becla wrote:
>
>> [sending to the list]
>>
>> Andy,
>>
>> We sort of realized you were thinking about implementing the scheduler
>> all on your side, that is why I asked so prominently :), there is no
>> point in wasting efforts. Our perhaps naive reaction was "but the
>> design of the existing ScanScheduler is not too bad, maybe we should
>> at least reuse it?" But I think we can be talked out of it! If you
>> call in tomorrow maybe we can discuss all this a bit.
>>
>> Thanks
>> Jacek
>>
>>
>>
>>
>>
>>
>> -------- Forwarded Message --------
>> Subject: Re: shared scan
>> Date: Tue, 1 Dec 2015 02:23:21 -0800
>> From: Andrew Hanushevsky <[log in to unmask]>
>> To: Becla, Jacek <[log in to unmask]>
>> CC: Gates, John H <[log in to unmask]>, Mueller, Fritz
>> <[log in to unmask]>
>>
>> Hi All,
>>
>> Well, it may seem that we are working at cross-puposes. I was under the
>> impression thatthe new (proposed) shared scan scheduler would replace wat
>> is in qserv at the moment. So, it would seem to me that devoting a lot of
>> time to further improve wht's there would be missplaced. I am proposing a
>> rather self-contained interface. The shared scan scheduler works on all of
>> the worker nodes trying to maximize the use of locked memory while
>> minimizing the amount that is locked. That is not an easy task. If you
>> look closely at AddQuery() it should be apparent that the scheduler ants
>> to know which tables a query will need to access and whether those tables
>> need to be locked. The tables you pass into Create() can be optionally
>> locked at the front (which I would assume that at least the object table
>> would be always locked -- otherwise it doesn't make much sense
>> resource-wise). Anyway it would be good to straighten all of this out. I
>> am not in favor of further massaging the existing code.
>>
>> Andy
>>
>> On Tue, 1 Dec 2015, Becla, Jacek wrote:
>>
>>> John: thanks for writing this up.
>>>
>>> Andy, see below.
>>>
>>> The #1 question is, on which side of the fence are we doing scheduling?
>>> Your side, or  Qserv side (ScanScheduler)? If yours, why?
>>>
>>> I inserted more comments below
>>>
>>>
>>>> On Nov 30, 2015, at 3:41 PM, Gates, John H <[log in to unmask]>
>>>> wrote:
>>>>
>>>> Jacek, Fritz,
>>>>
>>>> Please look this over and see if I missed anything or got something wrong
>>>>
>>>>
>>>>
>>>> Hi Andy,
>>>>
>>>> We (Jacek, Fritz, Nate, and John) had a discussion about the scan
>>>> scheduler today. We'd like to know a bit more about what you have in mind,
>>>> and let you know what we already have.
>>>>
>>>> There is currently a scan scheduler (wsched::ScanScheduler). UserQueries
>>>> are broken into TaskMsg by the czar and sent to the worker, which turns
>>>> them into Tasks. The Tasks are given to the BlendScheduler, which gives
>>>> any Tasks with scantable_size > 0 to the ScanScheduler. The czar does all
>>>> of the query analysis at this time
>>>>
>>>> The ScanScheduler has an active heap and a pending heap, both of which are
>>>> minimum value heaps. It has chunk id that is currently being read in from
>>>> disk (lastChunk). If a new Task with a chunk id higher than the lastChunk
>>>> is added, it goes on the active heap. If it is less than or equal the
>>>> chunkId, it goes on the pending heap. Once the active heap is empty, the
>>>> pending heap is swapped with the active heap and the lastChunk is set
>>>> equal to the top element of the new active heap.
>>>>
>>>> The ScanScheduler is currently allowed to advance to the next chunk id as
>>>> soon as ANY query on the lastChunk finishes. This is pretty naive and will
>>>> need to change. The current ScanScheduler is concerned with disk i/o and
>>>> not concerned about memory constraints. Changing this is simply a matter
>>>> of changing the _ready() function in the ScanScheduler so that Tasks can
>>>> be started only when enough memory is available, or some other criteria.
>>>>
>>>> Scanning always goes by chunk id. There are not separate schedulers for
>>>> Source tables and Object tables.
>>>> Scan scheduling will need to consider how much memory is available and the
>>>> size of the files that would need to be locked.
>>>> There are currently 4 different scans that will probably each need their
>>>> own scheduler:
>>>>     Object                               1hr per full scan
>>>>     Object joined with Source            8hr per full scan
>>>>     Object joined with Forced Source     8hr per full scan
>>>>     Object joined with Object_Extra      12hr per full scan
>>>> For each one, the appropriate tables need to be locked, the "Object"
>>>> scheduler would only lock the Object table files for its current chunk id
>>>> in memory, while "Object joined with Source" scheduler would lock Object
>>>> and Source tables for its the current chunk id.
>>>
>>> Note that these are just core production tables, there were be many more,
>>> Object will be vertically partitioned into several tables, there will be
>>> many level 3 user tables
>>>
>>>
>>>
>>>>     Looking at this, it might be better to go with schedulers that run at
>>>> expected rates (1hr/full_scan, 8hr/full_scan, 12hr/full_scan) and have
>>>> flags indicating which tables they want to use. The problem being that the
>>>> number of permutations of joins gets out of hand quickly. It would be
>>>> simple to rank them by chunk id and then group them by which tables are
>>>> needed. (Are there Source table only queries? Object_Extra table only
>>>> queries? Object, ObjectExtra and Source?)
>>>>
>>>> It might be desirable to have the Object scheduler be able to identify
>>>> slow Tasks and take all Tasks for that UserQuery and move them to the
>>>> Object joined with Source scheduler, so they don't bog down the Object
>>>> scheduler. This would require a unique user query query id or something
>>>> similar.
>>>>
>>>> I don't think this would be difficult to do with the current
>>>> BlendScheduler and ScanScheduler. They already contain code to limit the
>>>> number of threads spawned by any scheduler type and easy to change values
>>>> for controlling their limits at a high level in the code. It's pretty easy
>>>> to have multiple schedulers and switch between them at compile time (or at
>>>> program start up if we really want to).  Thoughts?
>>>>
>>>> The table sizes should be something like:   ( first year size -> size
>>>> after 10 years )
>>>>    Object    1x
>>>>    ObjExtra 10x
>>>>    Forced Source  1x> 10x
>>>>    Source   5x>40x
>>>>
>>>> Should we do anything for tables required for the query that don't need to
>>>> be locked?
>>>
>>>
>>> Do we  need to pass to your functions the tables that do not need to
>>> be locked as part of shared scans, say we have a query:
>>>
>>> SELECT <whatever>
>>> FROM Object o
>>> JOIN Source s using (o.objectId = s.objectId)
>>> JOIN Filter f using s.filterId = f.filterId
>>> WHERE f.filterName = r
>>>
>>> the table Filter is tiny (6 rows) and there is no need to lock it, should
>>> we still pass it? No?
>>> I guess not, but the docs needs clarification.
>>>
>>>
>>>>
>>>> We need to ask Mario:
>>>> - Will we have queries that want to see sources and forces sources?
>>>> - Joining between data releases - do we need to handle all Drs data
>>>> through the same qserv instance?
>>>
>>> Im going to ask Mario
>>>
>>>>
>>>> For scheduling to work, we will need some information available. This will
>>>> need to part of the interface.
>>>> - Which tables are locked in memory?
>>>> - How many Tasks are using a particular table locked in memory? (Free them
>>>> by reference counting?)
>>>> - How much memory have we locked up?
>>>> - What's the most memory we should have locked up?
>>>> - Before a table is locked in memory, how much room is it likely to take?
>>>>
>>>> Note that the GroupScheduler is working through its own Tasks. Its Tasks
>>>> only involve a couple of chunks, but it still needs some memory to work
>>>> with.
>>>>
>>>>
>>>>
>>>>
>>>> Concerns/clarification for anything above?
>>>>
>>>> What are the arguments for having your code do the scheduling?
>>>>
>>>> important details of the file locking?
>>>>
>>>>
>>>>
>>>> Thanks,
>>>> John
>>>
>>>
>>
>> ########################################################################
>> Use REPLY-ALL to reply to list
>>
>> To unsubscribe from the QSERV-L list, click the following link:
>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1
>>
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the QSERV-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1