Print

Print


Douglas,

I said:
> This latest 50 node problem seems to be another, different, worker  
> bug. This time workers are seeing a SEGV due to dereferencing a  
> NULL boost::shared_ptr. In other words, they are crashing, not  
> hanging, and the master then waits around forever.

So it turns out I had already fixed this bug while working on  
concurrency issues on the yili cluster at SLAC. The ccqserv worker  
nodes have a July 26 checkout of u/fjammes/lst2013_2_installCC.  
Fabrice has merged from master (including my fixes) since then, but  
it doesn't look like a pull and rebuild has been performed on the  
workers since the very initial checkout.

I did two things:

1. I pulled on ccqserv001, rebuilt, and then broadcast fresh worker  
libraries across the cluster (so the checkouts on the worker nodes  
are still stale).

2. To fix the chunk 1234567890 failures, I created an empty  
Object_1234567890 chunk table on ccqserv007 and added ccqserv007 to / 
qserv/list_50nodes.txt (ccqserv007 is in all the other node lists  
already). I did _not_ subtract from the 50-node empty chunk list in  
any way, so the 50 node test should only hit ccqserv007 for queries  
involving the dummy chunk. I guess technically it's still 51 nodes,  
but... I'll leave the fix to you if this really does matter.

After this, /qserv/test_sql_v01.sql runs successfully multiple times  
in a row, without flakiness, in the 50 node case. Please go ahead and  
take the cluster back for more testing...

Cheers,
Serge

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1