Print

Print


I'm not sure this still matters, but I'm confused. I thought we were 
talking about the transfer time for a single "chunk". I took "chunk" to 
mean qserv-chunk-of-a-table. This means that we need to transfer a 
number of bytes equal to the on-disk representation of that table. For 
MyISAM, this means the raw row data + myisam overhead + index files. 
rowdata+overhead+index all need to accessible for chunk X of table A to 
be usable. It is not clear to me that transferring rowdata and index 
concurrently is faster unless you are getting around TCP 
window/congestion control (or you have multiple source nodes, multiple 
pipes, and multiple destinations).

Still not sure this is worth arguing about. I feel that the larger 
question of whether it's a good idea to have an external qserv cluster 
that has no local (data) storage and uses the LSST cluster as a backing 
store is a bit orthogonal. For the larger question, I am a bit concerned 
that I haven't heard of implementations auto-caching HDFS clusters or 
distributed dbms clusters that use other clusters as backing stores, 
whether research, proprietary, or open-source. This concerns me. (On the 
other hand, it's probably worth a master's or phd thesis.)

-Daniel

On 09/24/2013 05:19 PM, Becla, Jacek wrote:
> We are talking here about sizes of *individual*
> chunks that are transferred, my point is that
> data+index are not a single file.
>
> All 20,000 chunks are going through the same
> pipe too, right? So if we consider db and index,
> it 40,000 chunks.
>
> Jacek
>
>
>
>
> On 9/24/2013 5:15 PM, Wang, Daniel Liwei wrote:
>> Wait, why is it faster in parallel? Same pipe, right? Unless you are
>> thinking disjoint sets of source-pipe-dest.
>>
>> -Daniel
>>
>> On 09/24/2013 04:44 PM, Jacek Becla wrote:
>>> As we just talked, my numbers are for data chunks,
>>> index is up to 2x larger, so we can use 2x larger
>>> numbers. Data+index come in separate files, so
>>> they can be transferred in parallel, so I think
>>> it'd be unfair to assume 3x my numbers though
>>>
>>> Jacek
>>>
>>>
>>>
>>> On 9/24/2013 3:07 PM, Jacek Becla wrote:
>>>>> 	Chunks are expected to be multiple terabytes in size, which
>>>>> means that downloads are hours long.
>>>> K-T,
>>>>
>>>> Based on the baseline, which assumes flat 20K chunks per tables,
>>>> the largest chunk will be 255 GB. The numbers are (in GB,
>>>> DR1 --> DR11)
>>>>       - Object:    2 -->   4
>>>>       - ObjExtra: 25 -->  69
>>>>       - Source:    9 --> 255
>>>>       - ForcedSrc: 2 -->  98
>>>>
>>>> This is in LDM-141, dbL2, L141 (and nearby)
>>>>
>>>> And, that is before compression.
>>>>
>>>> We talked about keeping chunk size const rather than #chunks
>>>> constants, which will probably make us go with DR1-size chunk
>>>> sizes, thus keeping chunk size closer to 25 GB than 1/4 TB)
>>>>
>>>> Jacek
>>>>
>>> ########################################################################
>>> Use REPLY-ALL to reply to list
>>>
>>> To unsubscribe from the QSERV-L list, click the following link:
>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1