This latest 50 node problem seems to be another, different, worker bug. This time workers are seeing a SEGV due to dereferencing a NULL boost::shared_ptr. In other words, they are crashing, not hanging, and the master then waits around forever.

I'm not lucky enough to reproduce inside gdb, so next I'm going to try to reconfigure the workers to produce core dumps when they crash.

1. This suggests that for these immediate tests, one can tell if progress is being made by checking that all the workers xrootd processes are still alive.

2. Douglas - I'm seeing failures of some of your test queries that look like:

SELECT count(*) FROM Object WHERE qserv_areaspec_box(1,2,3,4);

ERROR 4120 (Proxy): Error during execution: 'open failed for chunk(s): 1234567890'

when running against the 50 node list. I'm just guessing here, but I think what's happening is that Daniel keeps around an empty dummy chunk 1234567890 on some worker node and when the areaspec doesn't intersect any non-empty chunk, he issues a query against that dummy chunk. In this case, none of the workers contain an Object_1234567890 table, and ccqserv007 contains Source_1234567890. I think you can get around these failures by just creating the appropriately named tables on some worker node that is common to the 50,100,150,200,250, and 300 node tests. If that turns out not to be ccqserv007, you'll probably also want to delete the existing dummy chunk table for Source on that node.

3. There's another phenomenon I'm seeing where I start up the whole cluster, and then the first query is incredibly slow despite only using 50 nodes. At first I thought this was down to the fact that I shrank the qserv master thread pool sizes a lot (in the interests of stability). But actually, it turns out that there are network reads inside the xrootd client that are timing out. In one case I looked at with gdb, there was a master thread waiting for 300 sec (5 min) for a response to an xrootd "Endsess" command which it never received. After the 5 minutes were up, the xrootd client somehow recovered and the query (just a simple count(*)) completed successfully. I don't really know what else to say about that...

On Sep 20, 2013, at 4:00 PM, Douglas Smith wrote:

Jacek -

So, not sure what to say here. But the instabilities of the 300node
cluster this week, is making the testing just not go anywhere. I can't
get the cluster to remain stable over a full set of tests, so I can't get
all the tests that you've asked for. And a longer test, like of a
10x10deg. area of near neighbor query, doesn't really make sense,
since I can't tell if the query is taking a long time or I am waiting
for the workers that are hanging.

Serge is now looking at the cluster, but it is late in the day on Friday
here, and you wanted numbers. I have numbers for test queries
at 300nodes, but not at other number of nodes.

Serge may come back to me here, and testing can go again, and
you can have lots more numbers by Mon., but not sure what you
might want us to do now?

Douglas

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1