The czar doesn't get anything back from the worker but it looks like
the worker is sending it.
The last thing that happens on the czar is that xrootd calls
QueryRequest::GetRequest and then it goes silent. I don't know if it
passed an invalid payload to xrootd or if something else happened.
It looks like xrootd is never calling QueryRequest::ProcessResponse.
I'm hoping Andy Hanushevsky can help me determine if this is the
case.
-John
Worker xrootd.log specifically for chunk 5299 on ccqserv148, it
looks fine:
[2015-11-24T15:07:42.989Z] [0x7ffbc8244700] INFO root
(core/modules/xrdsvc/SsiSession.cc:61) - ProcessRequest,
service=/chk/LSST/5299
[0x7ffbc8244700] DEBUG root
(core/modules/wsched/BlendScheduler.cc:75) - BlendScheduler::queCmd
tSeq=109
[0x7ffbc8244700] DEBUG BlendSched
(core/modules/wsched/BlendScheduler.cc:93) - Blend chose group
[0x7ffbc8244700] DEBUG BlendSched
(core/modules/wsched/BlendScheduler.cc:100) - Blend queCmd tSeq=109
[0x7ffbc8244700] INFO root
(core/modules/xrdsvc/SsiSession.cc:125) - Enqueued TaskMsg for
Resource(/chk/LSST/5299) in 0.000465 seconds
...
[0x7ffbc3fff700] DEBUG Foreman
(core/modules/wdb/QueryRunner.cc:120) - QueryRunner::runQuery()
tSeq=109
[0x7ffbc3fff700] DEBUG Foreman
(core/modules/wdb/QueryRunner.cc:137) - Exec in flight for Db =
q_3c48232a8ba6adfe0b316b74245d8dca
[0x7ffbc3fff700] WARN Foreman
(core/modules/wdb/QueryRunner.cc:115) - QueryRunner overriding
dbName with LSST
[0x7ffbc3fff700] DEBUG root (core/modules/wdb/QueryRunner.cc:242)
- _transmit last=1 tSeq=109
[0x7ffbc3fff700] DEBUG root (core/modules/wdb/QueryRunner.cc:259)
- _transmitHeader
[0x7ffbc3fff700] INFO root
(core/modules/proto/ProtoHeaderWrap.cc:52) - msgBuf size=256 ->
[[0]=40, [1]=13, [2]=2, [3]=0, [4]=0, ..., [251]=48, [252]=48,
[253]=48, [254]=48, [255]=48]
[0x7ffbc3fff700] INFO root
(core/modules/xrdsvc/SsiSession_ReplyChannel.cc:85) - sendStream,
checking stream 0 len=256 last=0
[0x7ffbc3fff700] INFO root
(core/modules/xrdsvc/ChannelStream.cc:91) - last=0 [[0]=40, [1]=13,
[2]=2, [3]=0, [4]=0, [5]=0, [6]=21, [7]=47, [8]=0, [9]=0, ...,
[246]=48, [247]=48, [248]=48, [249]=48, [250]=48, [251]=48,
[252]=48, [253]=48, [254]=48, [255]=48]
[0x7ffbc3fff700] INFO root
(core/modules/xrdsvc/ChannelStream.cc:94) - trying to append
message (flowing)
[0x7ffbc3fff700] INFO root (core/modules/wdb/QueryRunner.cc:253)
- _transmit last=1 tSeq=109 resultString=[[0]=8, [1]=0, [2]=16,
[3]=4, [4]=26, ..., [42]=51, [43]=51, [44]=56, [45]=16, [46]=0]
[0x7ffbc3fff700] INFO root
(core/modules/xrdsvc/SsiSession_ReplyChannel.cc:85) - sendStream,
checking stream 0x7ffb78008e70 len=47 last=1
[0x7ffbc3fff700] INFO root
(core/modules/xrdsvc/ChannelStream.cc:91) - last=1 [[0]=8, [1]=0,
[2]=16, [3]=4, [4]=26, [5]=29, [6]=10, [7]=27, [8]=10, [9]=9, ...,
[37]=10, [38]=6, [39]=50, [40]=53, [41]=56, [42]=51, [43]=51,
[44]=56, [45]=16, [46]=0]
[0x7ffbc3fff700] INFO root
(core/modules/xrdsvc/ChannelStream.cc:94) - trying to append
message (flowing)
[0x7ffbc3fff700] DEBUG BlendSched
(core/modules/wsched/BlendScheduler.cc:132) -
BlendScheduler::commandFinish tSeq=109
--- Transmitted message with last=1, that should be fine
On the czar for chunk 5299 and jobId=2639:
2015-11-24T15:07:42.974Z [0x7f06b37fe700] DEBUG root
(core/modules/qproc/TaskMsgFactory2.cc:153) - SELECT count(*) AS
QS1_COUNT FROM LSST.Object_5299 AS QST_1_
[0x7f06b37fe700] INFO root
(core/modules/ccontrol/MergingHandler.cc:226) - setError: code: 0,
message:
[0x7f06b37fe700] DEBUG qdisp.Executive
(core/modules/qdisp/Executive.cc:106) - Executive::add(job(id=2639
payload.len=123 ru=/chk/LSST/5299))
[0x7f06b37fe700] DEBUG root (core/modules/qdisp/JobQuery.h:104) -
JobQuery JQ_jobId=2639 desc=job(id=2639 payload.len=123
ru=/chk/LSST/5299)
[0x7f06b37fe700] INFO qdisp.Executive
(core/modules/qdisp/Executive.cc:129) - Executive: Add job with
path=/chk/LSST/5299
[0x7f06b37fe700] DEBUG root
(core/modules/qdisp/MessageStore.cc:53) - Add msg: 5299 1200
Executive: Add job with path=/chk/LSST/5299
[0x7f06b37fe700] DEBUG root (core/modules/qdisp/JobQuery.cc:55) -
runJob {job(id=2639 payload.len=123 ru=/chk/LSST/5299) :
2015-11-24T15:07:42+0000, Unknown, 0, }
[0x7f06b37fe700] INFO root
(core/modules/ccontrol/MergingHandler.cc:226) - setError: code: 0,
message:
[0x7f06b37fe700] DEBUG root
(core/modules/qdisp/QueryResource.cc:51) - QueryResource
JQ_jobId=2639
[0x7f06b37fe700] INFO qproc.QuerySession
(core/modules/qproc/QuerySession.cc:383) - Non-subchunkedn
[0x7f06b37fe700] DEBUG root
(core/modules/qproc/TaskMsgFactory2.cc:151) - no nextFragment
...
[0x7f05dbfff700] INFO root
(core/modules/qdisp/QueryRequest.cc:57) - New QueryRequest with
payload(123)
[0x7f05dbfff700] DEBUG root
(core/modules/qdisp/QueryRequest.cc:58) - QueryRequest JQ_jobId=2639
[0x7f05dbfff700] DEBUG root
(core/modules/qdisp/QueryRequest.cc:75) - Requesting, payload size:
[123]
[0x7f05dbfff700] DEBUG root
(core/modules/qdisp/QueryResource.cc:55) - ~QueryResource()
JQ_jobId=2639
??????? [0x7f05dbfff700] INFO root
(core/modules/qdisp/QueryRequest.cc:57) - New QueryRequest with
payload(123)
**** And that's it for jobId=2639. It looks like
QueryRequest::ProcessResponse is never called by xrootd for this
query.
For job 2638 on the czar (which completes):
2015-11-24T15:07:42.973Z [0x7f06b37fe700] DEBUG root
(core/modules/qproc/TaskMsgFactory2.cc:153) - SELECT count(*) AS
QS1_COUNT FROM LSST.Object_5298 AS QST_1_
[0x7f06b37fe700] INFO qdisp.Executive
(core/modules/qdisp/Executive.cc:129) - Executive: Add job with
path=/chk/LSST/5298
[0x7f06b37fe700] DEBUG root
(core/modules/qdisp/MessageStore.cc:53) - Add msg: 5298 1200
Executive: Add job with path=/chk/LSST/5298
[0x7f06b37fe700] DEBUG root (core/modules/qdisp/JobQuery.cc:55) -
runJob {job(id=2638 payload.len=123 ru=/chk/LSST/5298) :
2015-11-24T15:07:42+0000, Unknown, 0, }
[0x7f06b37fe700] INFO root
(core/modules/ccontrol/MergingHandler.cc:226) - setError: code: 0,
message:
[0x7f06b37fe700] DEBUG root
(core/modules/qdisp/QueryResource.cc:51) - QueryResource
JQ_jobId=2638
[0x7f06b37fe700] INFO qproc.QuerySession
(core/modules/qproc/QuerySession.cc:383) - Non-subchunked
[0x7f06b37fe700] DEBUG root
(core/modules/qproc/TaskMsgFactory2.cc:151) - no nextFragment
...
[0x7f067a7fc700] DEBUG root
(core/modules/qdisp/QueryRequest.cc:58) - QueryRequest JQ_jobId=2638
[0x7f067a7fc700] DEBUG root
(core/modules/qdisp/QueryRequest.cc:75) - Requesting, payload size:
[123]
[0x7f067a7fc700] DEBUG root
(core/modules/qdisp/QueryResource.cc:55) - ~QueryResource()
JQ_jobId=2638
*** [0x7f067a7fc700] DEBUG root
(core/modules/qdisp/QueryRequest.cc:127) -
QueryRequest::_importStream buffer.size=256
[0x7f067a7fc700] DEBUG root
(core/modules/qdisp/QueryRequest.cc:130) -
_importStream->GetResponseData size=256 0x7f06b46e91c0 [[0]=0,
[1]=0, [2]=0, [3]=0, [4]=0, ..., [251]=0, [252]=0, [253]=0, [254]=0,
[255]=0]
[0x7f067a7fc700] DEBUG root
(core/modules/qdisp/QueryRequest.cc:132) - Initiated request ok
[0x7f067a7fc700] INFO root
(core/modules/qdisp/QueryRequest.cc:57) - New QueryRequest with
payload(123)
...
[0x7f05db7fe700] INFO root
(core/modules/qdisp/QueryRequest.cc:155) - ProcessResponseData with
buflen=47 (more)
[0x7f05db7fe700] INFO root
(core/modules/ccontrol/MergingHandler.cc:85) - From:ccqserv147 flush
state=RESULT_WAIT blen=47 last=0
[0x7f05db7fe700] INFO root
(core/modules/ccontrol/MergingHandler.cc:113) - From:ccqserv147
_buffer [[0]=8, [1]=0, [2]=16, [3]=4, [4]=26, ..., [42]=52, [43]=48,
[44]=53, [45]=16, [46]=0]
[0x7f05db7fe700] INFO root
(core/modules/ccontrol/MergingHandler.cc:123) - Messages ends,
setting last=true
[0x7f05db7fe700] INFO root
(core/modules/ccontrol/MergingHandler.cc:127) - Flushed
msgContinues=0 last=1 for
tableName=r_41ff8f47beaf890993226b08c275692c4_5298_0
[0x7f05db7fe700] DEBUG rproc.InfileMerger
(core/modules/rproc/InfileMerger.cc:279) - Executing
InfileMerger::merge(sizes=40, 47, rowcount=1, errCode=0,
hasErrorMsg=0)
[0x7f05db7fe700] INFO root
(core/modules/qdisp/QueryRequest.cc:266) - Finished() ok.
[0x7f05db7fe700] INFO qdisp.Executive
(core/modules/qdisp/Executive.cc:175) -
Executive::markCompleted(2638,1)
On 11/24/15 10:25, John Gates wrote:
[log in to unmask]"
type="cite">
It looks like all the tasks on ccqserv148 have completed, and it
is expected that all the threads in the pool will block on
getCmd() as there are no Tasks in the scheduler to run. For this
query, there were 321 tasks which all seem to have run to
completion. The last line in the worker log is
INFO root (core/modules/xrdsvc/SsiSession.cc:153) -
RequestFinished type=isStream
That seems pretty reasonable. I've logged into a couple of other
workers (147 and 140) and their log files look very similar. The
thing that bothers me about all three logs is that the SsiSession
is not being destroyed (I think it might be waiting on something
from the czar but I need to look into it).
For a quick check, I did
/opt/shmux/bin/shmux -c "sudo -u qserv docker exec -t qserv tail
/qserv/run/var/log/worker/xrootd.log" ccqserv{126..146} >
~/junk
and the ends of all the log files look very similar (despite being
mashed together by shmux) They all end with at least one
RequestFinished and none of them appear to be deleteing
SsiSession.
On 11/24/15 08:23, Fabrice Jammes
wrote:
[log in to unmask]" type="cite">
On 11/19/2015 12:14 AM, Gates, John
H wrote:
[log in to unmask]"
type="cite">
Hi Fabrice,
I'd like to test things out on the cluster but I want to
make sure that I'm not going to step on your toes. I also
want to make sure that it's working before I get into.
-John
Hi John
A recent Qserv version seems to run on ccqserv125...ccqserv149.
It works fine for small query but stall for large queries, on
the worker side it seems...
I tried to diagnose it in this ticket description:
https://jira.lsstcorp.org/browse/DM-4407
Feel free to ask if you need information about connecting to
docker machine.
I use:
ssh ccqservXXX
sudo -u qserv bash
docker exec -it qserv bash
# then I'm log in the container
ls /qserv/run/var/log
ps x
ls /qserv/data
...
Use REPLY-ALL to reply to list
To unsubscribe from the QSERV-L list, click the
following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1
Use REPLY-ALL to reply to list
To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1