Douglas,
I said:
This latest 50 node problem seems to be another, different, worker bug. This time workers are seeing a SEGV due to dereferencing a NULL boost::shared_ptr. In other words, they are crashing, not hanging, and the master then waits around forever.
So it turns out I had already fixed this bug while working on concurrency issues on the yili cluster at SLAC. The ccqserv worker nodes have a July 26 checkout of u/fjammes/lst2013_2_installCC. Fabrice has merged from master (including my fixes) since then, but it doesn't look like a pull and rebuild has been performed on the workers since the very initial checkout.
I did two things:
1. I pulled on ccqserv001, rebuilt, and then broadcast fresh worker libraries across the cluster (so the checkouts on the worker nodes are still stale).
2. To fix the chunk 1234567890 failures, I created an empty Object_1234567890 chunk table on ccqserv007 and added ccqserv007 to /qserv/list_50nodes.txt (ccqserv007 is in all the other node lists already). I did _not_ subtract from the 50-node empty chunk list in any way, so the 50 node test should only hit ccqserv007 for queries involving the dummy chunk. I guess technically it's still 51 nodes, but... I'll leave the fix to you if this really does matter.
After this, /qserv/test_sql_v01.sql runs successfully multiple times in a row, without flakiness, in the 50 node case. Please go ahead and take the cluster back for more testing...
Cheers,
Serge