Print

Print


[+AndyH]

XrdSsiProviderServer is a global that will only be destroyed if you explicitly destruct it, if you call exit(), or if you return normally from main(). It might also get called if somebody dlclose()s the worker shared library, but I really don’t know for sure what’s guaranteed in that case.

Our worker is being stopped via delivery of SIGTERM to the xrootd process (followed up by SIGKILL if that doesn’t work). Since exit() is not guaranteed to be async-signal-safe and we have no idea who might be using the object at arbitrary points, I don’t think we can tie SIGTERM to a signal handler that does any of those things and call it a day. I’m not sure it’s possible to do this correctly from inside an xrootd plugin, i.e. xrootd might have to provide graceful shutdown for us to be able to assume that our global destructors will get called.

From what I’ve read, one is supposed to do this sort of thing by having main() sigwait() for SIGTERM or SIGINT, or by having main() read from a “termination” pipe that is written to from inside signal handlers. When a graceful shutdown request is received, one runs cleanup code and returns from main() normally. Xrd/XrdMain.cc doesn’t seem to do that, but I could easily be missing another approach or not understanding the code very well.

Andy - can you advise?

Serge

On Aug 28, 2015, at 12:16 PM, John Gates <[log in to unmask]> wrote:

If that one destructor gets called, it should get everything on the list. The thing is we need to know when xrootd is done with it.

-John

On 08/28/15 12:08, Becla, Jacek wrote:
[log in to unmask]" type="cite" class=""> “a list of destructors”… that sounds like it might be less trivial that I expected. If you think you need more than 1SP go ahead and tweak it today

Jacek


On Aug 28, 2015, at 9:00 AM, Gates, John H <[log in to unmask]> wrote:

Also, there's a long list of destructors that aren't being called, including the one that should release the lock. I think this isn't being deleted.
   XrdSsiProvider *XrdSsiProviderServer = new lsst::qserv::xrdsvc::SsiProviderServer;

On 08/27/15 20:54, Becla, Jacek wrote:
[log in to unmask]" type="cite" class=""> For the record, I have a fix in DM-3618

John, FYI, AndyS is restarting xrootd in wmgr because apparently that is the only way to force xrootd to refresh chunk inventory (at the moment)

Jacek



On Aug 27, 2015, at 7:39 PM, Becla, Jacek <[log in to unmask]> wrote:

John

Your latest code is giving me troubles. Integration test killed xrootd, the tail of the log is similar to what you observed intermittently:

[2015-08-27T21:32:13.084-0500] [0x7f92c2df8720] INFO  root (build/xrdsvc/SsiService.cc:142) - Cleaning up scratchDb: qservScratch.
[2015-08-27T21:32:13.086-0500] [0x7f92c2df8720] WARN  root (build/wdb/ChunkResource.cc:302) - memLockStatus LOCKED_OTHER wrong uid. Expected 29962 got 28572 err=
[2015-08-27T21:32:13.086-0500] [0x7f92c2df8720] WARN  root (build/wdb/ChunkResource.cc:316) - Memory tables were not released cleanly! LockStatus=LOCKED_OTHER
[2015-08-27T21:32:13.086-0500] [0x7f92c2df8720] DEBUG root (build/wdb/ChunkResource.cc:272) - execLockSql CREATE DATABASE IF NOT EXISTS q_memoryLockDb;CREATE TABLE IF NOT EXISTS q_memoryLockDb.memoryLockTbl ( keyId INT UNIQUE, uid INT ) ENGINE = MEMORY;
[2015-08-27T21:32:13.086-0500] [0x7f92c2df8720] DEBUG root (build/wdb/ChunkResource.cc:272) - execLockSql INSERT INTO q_memoryLockDb.memoryLockTbl (keyId, uid) VALUES(1, 29962 )
[2015-08-27T21:32:13.086-0500] [0x7f92c2df8720] ERROR root (build/wdb/ChunkResource.cc:373) - Lock failed, exiting. query=INSERT INTO q_memoryLockDb.memoryLockTbl (keyId, uid) VALUES(1, 29962 ) err=Error 1062: Duplicate entry '1' for key 'keyId' Unable to execute query: INSERT INTO q_memoryLockDb.memoryLockTbl (keyId, uid) VALUES(1, 29962 )


I saved full log here:

/home/becla/qserv-run/2015_08/var/log/worker/xrootd.log

I hope we will be able to resolve it very soon, if we won’t, backup plan: 
a) back off all changes
b) disable the code that is causing xrootd to die while you investigate

I’ll try to debug it tonight

Jacek


Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1




Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1






Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1




Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1