Print

Print


This latest 50 node problem seems to be another, different, worker  
bug. This time workers are seeing a SEGV due to dereferencing a NULL  
boost::shared_ptr. In other words, they are crashing, not hanging,  
and the master then waits around forever.

I'm not lucky enough to reproduce inside gdb, so next I'm going to  
try to reconfigure the workers to produce core dumps when they crash.

1. This suggests that for these immediate tests, one can tell if  
progress is being made by checking that all the workers xrootd  
processes are still alive.

2. Douglas - I'm seeing failures of some of your test queries that  
look like:

SELECT count(*) FROM   Object  WHERE  qserv_areaspec_box(1,2,3,4);
ERROR 4120 (Proxy): Error during execution: 'open failed for chunk 
(s): 1234567890'

when running against the 50 node list. I'm just guessing here, but I  
think what's happening is that Daniel keeps around an empty dummy  
chunk 1234567890 on some worker node and when the areaspec doesn't  
intersect any non-empty chunk, he issues a query against that dummy  
chunk. In this case, none of the workers contain an Object_1234567890  
table, and ccqserv007 contains Source_1234567890. I think you can get  
around these failures by just creating the appropriately named tables  
on some worker node that is common to the 50,100,150,200,250, and 300  
node tests. If that turns out not to be ccqserv007, you'll probably  
also want to delete the existing dummy chunk table for Source on that  
node.

3. There's another phenomenon I'm seeing where I start up the whole  
cluster, and then the first query is incredibly slow despite only  
using 50 nodes. At first I thought this was down to the fact that I  
shrank the qserv master thread pool sizes a lot (in the interests of  
stability). But actually, it turns out that  there are network reads  
inside the xrootd client that are timing out. In one case I looked at  
with gdb, there was a master thread waiting for 300 sec (5 min) for a  
response to an xrootd "Endsess" command which it never received.  
After the 5 minutes were up, the xrootd client somehow recovered and  
the query (just a simple count(*)) completed successfully. I don't  
really know what else to say about that...


On Sep 20, 2013, at 4:00 PM, Douglas Smith wrote:

> Jacek -
>
> So, not sure what to say here.  But the instabilities of the 300node
> cluster this week, is making the testing just not go anywhere.  I  
> can't
> get the cluster to remain stable over a full set of tests, so I  
> can't get
> all the tests that you've asked for.  And a longer test, like of a
> 10x10deg. area of near neighbor query, doesn't really make sense,
> since I can't tell if the query is taking a long time or I am waiting
> for the workers that are hanging.
>
> Serge is now looking at the cluster, but it is late in the day on  
> Friday
> here, and you wanted numbers.  I have numbers for test queries
> at 300nodes, but not at other number of nodes.
>
> Serge may come back to me here, and testing can go again, and
> you can have lots more numbers by Mon., but not sure what you
> might want us to do now?
>
> Douglas
>
> ###################################################################### 
> ##
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the QSERV-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1


########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1