Print

Print


Hi John,

I've deployed your code on the in2p3 cluster. Feel free to ask for anything you need. Re-deploying a new version a your code would a bit slow (around 1 hour because of Docker Hub latency), but nearly completely automated and very easy. Nevertheless it isn't available yet on master branch, so you still need to ask mo to do this...

fjammes@ccosvms0070:~/src/qserv-cluster/shmux (master *%=)$ ./ps.sh
ccqserv125.in2p3.fr: 8682d87601f6        qserv/qserv:tickets_DM-2699_master_ccqserv125.in2p3.fr   "/bin/sh -c '/qserv/r"   3 minutes ago       Up 3 minutes                            qserv
ccqserv127.in2p3.fr: aadb03cd4afe        qserv/qserv:tickets_DM-2699_worker_ccqserv125.in2p3.fr   "/bin/sh -c '/qserv/r"   3 minutes ago       Up 3 minutes                            qserv
ccqserv126.in2p3.fr: 4d95a18745d4        qserv/qserv:tickets_DM-2699_worker_ccqserv125.in2p3.fr   "/bin/sh -c '/qserv/r"   3 minutes ago       Up 3 minutes                            qserv
ccqserv133.in2p3.fr: a339cb8d905e        qserv/qserv:tickets_DM-2699_worker_ccqserv125.in2p3.fr   "/bin/sh -c '/qserv/r"   3 minutes ago       Up 3 minutes                            qserv
ccqserv134.in2p3.fr: d88cd6a1055b        qserv/qserv:tickets_DM-2699_worker_ccqserv125.in2p3.fr   "/bin/sh -c '/qserv/r"   3 minutes ago       Up 3 minutes                            qserv
ccqserv128.in2p3.fr: cdb2ab5fb7fa        qserv/qserv:tickets_DM-2699_worker_ccqserv125.in2p3.fr   "/bin/sh -c '/qserv/r"   3 minutes ago       Up 3 minutes                            qserv
ccqserv131.in2p3.fr: ce4cd3e25eea        qserv/qserv:tickets_DM-2699_worker_ccqserv125.in2p3.fr   "/bin/sh -c '/qserv/r"   3 minutes ago       Up 3 minutes                            qserv
ccqserv135.in2p3.fr: 76d19c4feab9        qserv/qserv:tickets_DM-2699_worker_ccqserv125.in2p3.fr   "/bin/sh -c '/qserv/r"   3 minutes ago       Up 3 minutes                            qserv
ccqserv130.in2p3.fr: 9847460e34e0        qserv/qserv:tickets_DM-2699_worker_ccqserv125.in2p3.fr   "/bin/sh -c '/qserv/r"   3 minutes ago       Up 3 minutes                            qserv
ccqserv129.in2p3.fr: 0bfce265cb57        qserv/qserv:tickets_DM-2699_worker_ccqserv125.in2p3.fr   "/bin/sh -c '/qserv/r"   3 minutes ago       Up 3 minutes                            qserv
ccqserv132.in2p3.fr: 22eb1cc474b1        qserv/qserv:tickets_DM-2699_worker_ccqserv125.in2p3.fr   "/bin/sh -c '/qserv/r"   3 minutes ago       Up 3 minutes                            qserv
ccqserv136.in2p3.fr: 5bd897f7c6bf        qserv/qserv:tickets_DM-2699_worker_ccqserv125.in2p3.fr   "/bin/sh -c '/qserv/r"   3 minutes ago       Up 3 minutes                            qserv
ccqserv137.in2p3.fr: c2c803935378        qserv/qserv:tickets_DM-2699_worker_ccqserv125.in2p3.fr   "/bin/sh -c '/qserv/r"   2 minutes ago       Up 2 minutes                            qserv
ccqserv141.in2p3.fr: 81a39ef9dad2        qserv/qserv:tickets_DM-2699_worker_ccqserv125.in2p3.fr   "/bin/sh -c '/qserv/r"   2 minutes ago       Up 2 minutes                            qserv
ccqserv142.in2p3.fr: eca29186eba3        qserv/qserv:tickets_DM-2699_worker_ccqserv125.in2p3.fr   "/bin/sh -c '/qserv/r"   2 minutes ago       Up 2 minutes                            qserv
ccqserv138.in2p3.fr: 3cb904553f8c        qserv/qserv:tickets_DM-2699_worker_ccqserv125.in2p3.fr   "/bin/sh -c '/qserv/r"   2 minutes ago       Up 2 minutes                            qserv
ccqserv144.in2p3.fr: 7848550dd7f7        qserv/qserv:tickets_DM-2699_worker_ccqserv125.in2p3.fr   "/bin/sh -c '/qserv/r"   2 minutes ago       Up 2 minutes                            qserv
ccqserv140.in2p3.fr: 913b9fc4ca48        qserv/qserv:tickets_DM-2699_worker_ccqserv125.in2p3.fr   "/bin/sh -c '/qserv/r"   2 minutes ago       Up 2 minutes                            qserv
ccqserv139.in2p3.fr: 751631b2407b        qserv/qserv:tickets_DM-2699_worker_ccqserv125.in2p3.fr   "/bin/sh -c '/qserv/r"   2 minutes ago       Up 2 minutes                            qserv
ccqserv143.in2p3.fr: 56442d0eb252        qserv/qserv:tickets_DM-2699_worker_ccqserv125.in2p3.fr   "/bin/sh -c '/qserv/r"   2 minutes ago       Up 2 minutes                            qserv
ccqserv145.in2p3.fr: 772d237f629c        qserv/qserv:tickets_DM-2699_worker_ccqserv125.in2p3.fr   "/bin/sh -c '/qserv/r"   2 minutes ago       Up 2 minutes                            qserv
ccqserv146.in2p3.fr: ac5cc927d7be        qserv/qserv:tickets_DM-2699_worker_ccqserv125.in2p3.fr   "/bin/sh -c '/qserv/r"   2 minutes ago       Up 2 minutes                            qserv
ccqserv149.in2p3.fr: 49e734e138bb        qserv/qserv:tickets_DM-2699_worker_ccqserv125.in2p3.fr   "/bin/sh -c '/qserv/r"   2 minutes ago       Up 2 minutes                            qserv
ccqserv147.in2p3.fr: 2fdbbc5913cf        qserv/qserv:tickets_DM-2699_worker_ccqserv125.in2p3.fr   "/bin/sh -c '/qserv/r"   2 minutes ago       Up 2 minutes                            qserv
ccqserv148.in2p3.fr: 413668548ee0        qserv/qserv:tickets_DM-2699_worker_ccqserv125.in2p3.fr   "/bin/sh -c '/qserv/r"   2 minutes ago       Up 2 minutes                            qserv

25 targets processed in 1 second.
Summary: 25 successes

fjammes@ccosvms0070:~/src/qserv-cluster/shmux (master *%=)$
fjammes@ccosvms0070:~/src/qserv-cluster/shmux (master *%=)$ source /qserv/jgates/stack/loadLSST.bash
fjammes@ccosvms0070:~/src/qserv-cluster/shmux (master *%=)$ setup mysql
fjammes@ccosvms0070:~/src/qserv-cluster/shmux (master *%=)$ time mysql --host ccqserv125 --port 4040 --user qsmaster LSST -e "SELECT ra, decl FROM Object WHERE deepSourceId = 2322920177142607;";
+------------------+-------------------+
| ra               | decl              |
+------------------+-------------------+
| 29.3088063472755 | -86.3088404611897 |
+------------------+-------------------+

real    0m0.490s
user    0m0.009s
sys     0m0.008s


On 11/24/2015 10:34 PM, John Gates wrote:
[log in to unmask]" type="cite"> The czar doesn't get anything back from the worker but it looks like the worker is sending it.
The last thing that happens on the czar is that xrootd calls QueryRequest::GetRequest and then it goes silent. I don't know if it passed an invalid payload to xrootd or if something else happened. It looks like xrootd is never calling QueryRequest::ProcessResponse.
I'm hoping Andy Hanushevsky can help me determine if this is the case.

-John


Worker xrootd.log specifically for chunk 5299 on ccqserv148, it looks fine:

[2015-11-24T15:07:42.989Z] [0x7ffbc8244700] INFO  root (core/modules/xrdsvc/SsiSession.cc:61) - ProcessRequest, service=/chk/LSST/5299
  [0x7ffbc8244700] DEBUG root (core/modules/wsched/BlendScheduler.cc:75) - BlendScheduler::queCmd tSeq=109
  [0x7ffbc8244700] DEBUG BlendSched (core/modules/wsched/BlendScheduler.cc:93) - Blend chose group
  [0x7ffbc8244700] DEBUG BlendSched (core/modules/wsched/BlendScheduler.cc:100) - Blend queCmd tSeq=109
  [0x7ffbc8244700] INFO  root (core/modules/xrdsvc/SsiSession.cc:125) - Enqueued TaskMsg for Resource(/chk/LSST/5299) in 0.000465 seconds
...
  [0x7ffbc3fff700] DEBUG Foreman (core/modules/wdb/QueryRunner.cc:120) - QueryRunner::runQuery() tSeq=109
  [0x7ffbc3fff700] DEBUG Foreman (core/modules/wdb/QueryRunner.cc:137) - Exec in flight for Db = q_3c48232a8ba6adfe0b316b74245d8dca
  [0x7ffbc3fff700] WARN  Foreman (core/modules/wdb/QueryRunner.cc:115) - QueryRunner overriding dbName with LSST
  [0x7ffbc3fff700] DEBUG root (core/modules/wdb/QueryRunner.cc:242) - _transmit last=1 tSeq=109
  [0x7ffbc3fff700] DEBUG root (core/modules/wdb/QueryRunner.cc:259) - _transmitHeader
  [0x7ffbc3fff700] INFO  root (core/modules/proto/ProtoHeaderWrap.cc:52) - msgBuf size=256 -> [[0]=40, [1]=13, [2]=2, [3]=0, [4]=0, ..., [251]=48, [252]=48, [253]=48, [254]=48, [255]=48]
  [0x7ffbc3fff700] INFO  root (core/modules/xrdsvc/SsiSession_ReplyChannel.cc:85) - sendStream, checking stream 0 len=256 last=0
  [0x7ffbc3fff700] INFO  root (core/modules/xrdsvc/ChannelStream.cc:91) - last=0 [[0]=40, [1]=13, [2]=2, [3]=0, [4]=0, [5]=0, [6]=21, [7]=47, [8]=0, [9]=0, ..., [246]=48, [247]=48, [248]=48, [249]=48, [250]=48, [251]=48, [252]=48, [253]=48, [254]=48, [255]=48]
  [0x7ffbc3fff700] INFO  root (core/modules/xrdsvc/ChannelStream.cc:94) -  trying to append message (flowing)
  [0x7ffbc3fff700] INFO  root (core/modules/wdb/QueryRunner.cc:253) - _transmit last=1 tSeq=109 resultString=[[0]=8, [1]=0, [2]=16, [3]=4, [4]=26, ..., [42]=51, [43]=51, [44]=56, [45]=16, [46]=0]
  [0x7ffbc3fff700] INFO  root (core/modules/xrdsvc/SsiSession_ReplyChannel.cc:85) - sendStream, checking stream 0x7ffb78008e70 len=47 last=1
  [0x7ffbc3fff700] INFO  root (core/modules/xrdsvc/ChannelStream.cc:91) - last=1 [[0]=8, [1]=0, [2]=16, [3]=4, [4]=26, [5]=29, [6]=10, [7]=27, [8]=10, [9]=9, ..., [37]=10, [38]=6, [39]=50, [40]=53, [41]=56, [42]=51, [43]=51, [44]=56, [45]=16, [46]=0]
  [0x7ffbc3fff700] INFO  root (core/modules/xrdsvc/ChannelStream.cc:94) -  trying to append message (flowing)
  [0x7ffbc3fff700] DEBUG BlendSched (core/modules/wsched/BlendScheduler.cc:132) - BlendScheduler::commandFinish tSeq=109
--- Transmitted message with last=1, that should be fine

On the czar for chunk 5299 and jobId=2639:

2015-11-24T15:07:42.974Z [0x7f06b37fe700] DEBUG root (core/modules/qproc/TaskMsgFactory2.cc:153) - SELECT count(*) AS QS1_COUNT FROM LSST.Object_5299 AS QST_1_
  [0x7f06b37fe700] INFO  root (core/modules/ccontrol/MergingHandler.cc:226) - setError: code: 0, message:
  [0x7f06b37fe700] DEBUG qdisp.Executive (core/modules/qdisp/Executive.cc:106) - Executive::add(job(id=2639 payload.len=123 ru=/chk/LSST/5299))
  [0x7f06b37fe700] DEBUG root (core/modules/qdisp/JobQuery.h:104) - JobQuery JQ_jobId=2639 desc=job(id=2639 payload.len=123 ru=/chk/LSST/5299)
  [0x7f06b37fe700] INFO  qdisp.Executive (core/modules/qdisp/Executive.cc:129) - Executive: Add job with path=/chk/LSST/5299
  [0x7f06b37fe700] DEBUG root (core/modules/qdisp/MessageStore.cc:53) - Add msg: 5299 1200 Executive: Add job with path=/chk/LSST/5299
  [0x7f06b37fe700] DEBUG root (core/modules/qdisp/JobQuery.cc:55) - runJob {job(id=2639 payload.len=123 ru=/chk/LSST/5299) : 2015-11-24T15:07:42+0000, Unknown, 0, }
  [0x7f06b37fe700] INFO  root (core/modules/ccontrol/MergingHandler.cc:226) - setError: code: 0, message:
  [0x7f06b37fe700] DEBUG root (core/modules/qdisp/QueryResource.cc:51) - QueryResource JQ_jobId=2639
  [0x7f06b37fe700] INFO  qproc.QuerySession (core/modules/qproc/QuerySession.cc:383) - Non-subchunkedn
  [0x7f06b37fe700] DEBUG root (core/modules/qproc/TaskMsgFactory2.cc:151) - no nextFragment
...
  [0x7f05dbfff700] INFO  root (core/modules/qdisp/QueryRequest.cc:57) - New QueryRequest with payload(123)
  [0x7f05dbfff700] DEBUG root (core/modules/qdisp/QueryRequest.cc:58) - QueryRequest JQ_jobId=2639
  [0x7f05dbfff700] DEBUG root (core/modules/qdisp/QueryRequest.cc:75) - Requesting, payload size: [123]
  [0x7f05dbfff700] DEBUG root (core/modules/qdisp/QueryResource.cc:55) - ~QueryResource() JQ_jobId=2639
??????? [0x7f05dbfff700] INFO  root (core/modules/qdisp/QueryRequest.cc:57) - New QueryRequest with payload(123)
**** And that's it for jobId=2639. It looks like QueryRequest::ProcessResponse is never called by xrootd for this query.
 

For job 2638 on the czar (which completes):

2015-11-24T15:07:42.973Z [0x7f06b37fe700] DEBUG root (core/modules/qproc/TaskMsgFactory2.cc:153) - SELECT count(*) AS QS1_COUNT FROM LSST.Object_5298 AS QST_1_
  [0x7f06b37fe700] INFO  qdisp.Executive (core/modules/qdisp/Executive.cc:129) - Executive: Add job with path=/chk/LSST/5298
  [0x7f06b37fe700] DEBUG root (core/modules/qdisp/MessageStore.cc:53) - Add msg: 5298 1200 Executive: Add job with path=/chk/LSST/5298
  [0x7f06b37fe700] DEBUG root (core/modules/qdisp/JobQuery.cc:55) - runJob {job(id=2638 payload.len=123 ru=/chk/LSST/5298) : 2015-11-24T15:07:42+0000, Unknown, 0, }
  [0x7f06b37fe700] INFO  root (core/modules/ccontrol/MergingHandler.cc:226) - setError: code: 0, message:
  [0x7f06b37fe700] DEBUG root (core/modules/qdisp/QueryResource.cc:51) - QueryResource JQ_jobId=2638
  [0x7f06b37fe700] INFO  qproc.QuerySession (core/modules/qproc/QuerySession.cc:383) - Non-subchunked
  [0x7f06b37fe700] DEBUG root (core/modules/qproc/TaskMsgFactory2.cc:151) - no nextFragment
...
  [0x7f067a7fc700] DEBUG root (core/modules/qdisp/QueryRequest.cc:58) - QueryRequest JQ_jobId=2638
  [0x7f067a7fc700] DEBUG root (core/modules/qdisp/QueryRequest.cc:75) - Requesting, payload size: [123]
  [0x7f067a7fc700] DEBUG root (core/modules/qdisp/QueryResource.cc:55) - ~QueryResource() JQ_jobId=2638
***  [0x7f067a7fc700] DEBUG root (core/modules/qdisp/QueryRequest.cc:127) - QueryRequest::_importStream buffer.size=256
  [0x7f067a7fc700] DEBUG root (core/modules/qdisp/QueryRequest.cc:130) - _importStream->GetResponseData size=256 0x7f06b46e91c0 [[0]=0, [1]=0, [2]=0, [3]=0, [4]=0, ..., [251]=0, [252]=0, [253]=0, [254]=0, [255]=0]
  [0x7f067a7fc700] DEBUG root (core/modules/qdisp/QueryRequest.cc:132) - Initiated request ok
  [0x7f067a7fc700] INFO  root (core/modules/qdisp/QueryRequest.cc:57) - New QueryRequest with payload(123)
...
  [0x7f05db7fe700] INFO  root (core/modules/qdisp/QueryRequest.cc:155) - ProcessResponseData with buflen=47 (more)
  [0x7f05db7fe700] INFO  root (core/modules/ccontrol/MergingHandler.cc:85) - From:ccqserv147 flush state=RESULT_WAIT blen=47 last=0
  [0x7f05db7fe700] INFO  root (core/modules/ccontrol/MergingHandler.cc:113) - From:ccqserv147 _buffer [[0]=8, [1]=0, [2]=16, [3]=4, [4]=26, ..., [42]=52, [43]=48, [44]=53, [45]=16, [46]=0]
  [0x7f05db7fe700] INFO  root (core/modules/ccontrol/MergingHandler.cc:123) - Messages ends, setting last=true
  [0x7f05db7fe700] INFO  root (core/modules/ccontrol/MergingHandler.cc:127) - Flushed msgContinues=0 last=1 for tableName=r_41ff8f47beaf890993226b08c275692c4_5298_0
   [0x7f05db7fe700] DEBUG rproc.InfileMerger (core/modules/rproc/InfileMerger.cc:279) - Executing InfileMerger::merge(sizes=40, 47, rowcount=1, errCode=0, hasErrorMsg=0)
  [0x7f05db7fe700] INFO  root (core/modules/qdisp/QueryRequest.cc:266) - Finished() ok.
  [0x7f05db7fe700] INFO  qdisp.Executive (core/modules/qdisp/Executive.cc:175) - Executive::markCompleted(2638,1)

 
 

On 11/24/15 10:25, John Gates wrote:
[log in to unmask]" type="cite"> It looks like all the tasks on ccqserv148 have completed, and it is expected that all the threads in the pool will block on getCmd() as there are no Tasks in the scheduler to run. For this query, there were 321 tasks which all seem to have run to completion. The last line in the worker log is

   INFO  root (core/modules/xrdsvc/SsiSession.cc:153) - RequestFinished type=isStream

That seems pretty reasonable. I've logged into a couple of other workers (147 and 140) and their log files look very similar. The thing that bothers me about all three logs is that the SsiSession is not being destroyed (I think it might be waiting on something from the czar but I need to look into it).

For a quick check, I did

  /opt/shmux/bin/shmux -c "sudo -u qserv docker exec -t qserv tail /qserv/run/var/log/worker/xrootd.log" ccqserv{126..146} > ~/junk

and the ends of all the log files look very similar (despite being mashed together by shmux) They all end with at least one RequestFinished and none of them appear to be deleteing SsiSession.

On 11/24/15 08:23, Fabrice Jammes wrote:
[log in to unmask]" type="cite">
On 11/19/2015 12:14 AM, Gates, John H wrote:
[log in to unmask]" type="cite">

Hi Fabrice,


I'd like to test things out on the cluster but I want to make sure that I'm not going to step on your toes. I also want to make sure that it's working before I get into. 


-John

Hi John

A recent Qserv version seems to run on ccqserv125...ccqserv149.
It works fine for small query but stall for large queries, on the worker side it seems...

I tried to diagnose it in this ticket description:
https://jira.lsstcorp.org/browse/DM-4407

Feel free to ask if you need information about connecting to docker machine.


I use:

ssh ccqservXXX
sudo -u qserv bash
docker exec -it qserv bash
# then I'm log in the container
ls /qserv/run/var/log
ps x
ls /qserv/data
...



Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1




Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1




Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1