Let’s talk about it in the list.
The issue of scanning older releases is a big one,
it is still unclear. It is pretty much about balancing the
cost and guessing what users will want. I have RFC-134 opened
about it, and i am scheduled to talk to Mario on Monday to
understand it better.
The thinking about current and next chunk in
memory was that we are fetching chunk x+1 while we are
processing chunk x that is already in memory (that is purely
CPU, no disk IO). This would help us keep the disks busy.
BTW, you might watch RFC-132 as well. It’d get
pretty nasty if we had to do efficient joins between data
releases. Also, RFC-133 is interesting, we might need to have
an additional scan for Dia* tables (level 1 data processed
through DRP). It is unclear yet if that is needed (I am
guessing it will be), and whether we’d need to do any joins
between Dia* and say Object. I’m checking with Mario on that.
I’ll keep you all updated about input I get from
Mario. I do want to sort out the requirement side of things
(older DRs, cross DR joins etc) in the next few weeks.
thanks,
Jacek
Hi
Jacek,
I was looking through LDM-135 and came across a couple
of things.
- one synchronized full
table scan of Object_Narrow and Object_Extra every
8 hours for the latest and previous
data releases.
This
is the only scan that specifically mentions the
previous data releases. I'm not sure why it is
special like that. It also looks like we aren't
making any promises at all about other scans for
older releases.
- For
a self-joins, a single shared scans will be
sufficient, however each node must have
sufficient memory to hold 2 chunks at any given
time (the processed chunk and next chunk). Refer
to the sizing model [LDM-141] for
further details on the cost of shared scans.
I'm
not sure why a node needs enough memory to hold the
processed chunk and the next chunk. I would think it
only needs to have enough memory to deal with the
chunk it is working with right now, assuming this is
per shared scan.
I also had a couple of thoughts on the sizing model
and shared scans. It looks like we're heading for
one cluster per release, which is simple but means
doing joins across releases is not easy. If, on the
other hand it is decided to put all releases on the
same cluster, and put all the same chunks for all
the releases on one node, the worker scheduler can
be pretty easily modified to order scans by chunkId
then release number. Provided an increase in worker
nodes and czars with each added release, it should
work. Joins between object_extra between releases
could require a fair bit of memory, but if one
release is under utilized, the other queries should
be faster.
-John