LISTSERV mailing list manager LISTSERV 16.5

Help for QSERV-L Archives


QSERV-L Archives

QSERV-L Archives


QSERV-L@LISTSERV.SLAC.STANFORD.EDU


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

QSERV-L Home

QSERV-L Home

QSERV-L  June 2013

QSERV-L June 2013

Subject:

Re: Questions

From:

"Daniel L. Wang" <[log in to unmask]>

Reply-To:

General discussion for qserv (LSST prototype baseline catalog)

Date:

Thu, 6 Jun 2013 12:11:56 -0700

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (102 lines)

Hello,

> * Data loading :
>
>     - Could  you  define  how   data  will  be  generated  and
>       distributed on workers ?
For this test, we will make a copy of a smaller dataset (<100GB) on a 
shared filesystem mounted by the workers. We will then run a 
duplicator/partitioner on each worker which will synthesize the data 
portions that belong on that worker. The synthesizer is coded for a 
particular data distribution.

>
>     - Could  you begin  data  load testing  on  our 4  current
>       machines at CC ?
We can, if they have the same characteristics of machines in the target 
cluster. They must have the exact same installed packages (OS + 
libraries +environment), same disk configuration (free space mounted at 
the same mount points, similar disk characteristics), same memory size, 
and same kernel tweaks (max threads, max open files, etc.).

We have found that machines in a cluster run different configurations 
than general-purpose development machines (even when the OS is the 
same), and it often takes a couple days to shake out the problems, if 
the sysadmins are responsive (may be difficult across time zones). In 
the past, it has taken more than a week, but that was prior to having 
installation scripts.

It's probably easier for sysadmins to grant non-exclusive access to a 
few machines (having one or two exposes 80% of the issues) than to 
figure out how to make a dev machine look like a cluster machine.

>     - How far are Qserv installation and data loading
>       independent ?
I defer to Douglas. Qserv installation is strictly different than 
installing data. Loading data (creating a schema, partitioning data, 
loading it) is still a painful part of making data available via Qserv.


> * Production planning:
>
>     - What happens if a worker crash ? What actions are needed
>       ( error recovery ) ? How to exclude one worker node from
>       Qserv ?
If a worker crashes, we must login, debug it, and try to reproduce. 
Qserv has many bugs and the implementation is quite incomplete, and much 
of the value of this testing is to expose problems at scale. At minimum, 
we need the log files, the mysqld query log (should be enabled after 
data loading), and the stack trace of the xrootd process (available by 
saving the corefile).

Worker nodes register themselves, so if a worker node crashes, it will 
not respond to query dispatches. An administrator can lookup the data 
portions that reside on the worker node and mark them as bad (empty) on 
the master, allowing the master to complete, but return incorrect results.

>     - What exactly do you need to be monitored on workers ?
Log files, mostly. For debugging, we need free memory, free disk, cpu 
load, and mysqld processlist (output of SHOW PROCESSLIST, but mytop 
http://jeremy.zawodny.com/mysql/mytop/ would be really nice).

>     - We propose to  begin installation on 4, then  10, 50 and
>       finally 300 nodes. What  are the validation steps ? What
>       will be the queries used ?
I'm assuming that 4 = 1+3, 10= 1+9, and 300=1+299. Note that the master 
node may get heavily loaded in certain situations. The validation steps 
are simply to run queries: full-scan queries, point-lookup queries, and 
join queries. Each time, the workers need to be spot-checked to verify 
that they are responding as expected and the master log must be 
inspected, because errors may be unnoticeable in the query results.

> It would be good to have a wiki page for all these so that
> we could refer to it.
I think Douglas has started such a page, but we should check with him on 
tomorrow's call.

Qserv is quite complex and is often difficult to debug. The last test 
with 150 nodes would have been impossible without ssh access to the 
nodes for setup, validation, and debugging and responsive sysadmins to 
check for OS problems and faulty disks or other hardware (we had 2-3 
disk failures and 1 misbehaving machine in our 150 node test over 2 
weeks, and the more recent test had disk corruption and odd performance 
problems even at 16 nodes).

Note that some problems required new features to be quickly developed in 
order to continue testing. We will probably need such patches during 
this test.

We are really grateful for the opportunity to do this testing and hope 
to learn a lot about qserv, its future direction, and, more generally, 
the feasibility of the current plans for LSST catalog management.


Hope this helps,
-Daniel

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1

Top of Message | Previous Page | Permalink

Advanced Options


Options

Log In

Log In

Get Password

Get Password


Search Archives

Search Archives


Subscribe or Unsubscribe

Subscribe or Unsubscribe


Archives

March 2018
February 2018
January 2018
December 2017
August 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012

ATOM RSS1 RSS2



LISTSERV.SLAC.STANFORD.EDU

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager

Privacy Notice, Security Notice and Terms of Use