Oh I saw these “.xrd/=core” and I ignored them.

let me see…. I’m restarting



On Aug 7, 2015, at 10:44 PM, Serge Monkewitz <[log in to unmask]> wrote:

Arg. Notice that e.g. on cqserv102:

ls -l /qserv/run-jgates/tmp/worker/.xrd/=/core

lrwxrwxrwx 1 qserv qserv 34 Aug  8 03:47 cmsd -> /afs/in2p3.fr/home/j/jgates/worker
lrwxrwxrwx 1 qserv qserv 34 Aug  8 03:47 xrootd -> /afs/in2p3.fr/home/j/jgates/worker

If I `ls` the worker subdir of John’s homedir, I see this:

ls -l
ls: cannot access core.37266: Permission denied
ls: cannot access core.23446: Permission denied
ls: cannot access core.23902: Permission denied
ls: cannot access core.9968: Permission denied
ls: cannot access core.34980: Permission denied
ls: cannot access core.45368: Permission denied
ls: cannot access core.34512: Permission denied
ls: cannot access core.31811: Permission denied
ls: cannot access core.29108: Permission denied
ls: cannot access core.31489: Permission denied
total 0
-rw------- 1 jgates qserv 0 Aug  8 04:19 core.17330
?????????? ? ?      ?     ?            ? core.23446
?????????? ? ?      ?     ?            ? core.23902
?????????? ? ?      ?     ?            ? core.29108
?????????? ? ?      ?     ?            ? core.31489
?????????? ? ?      ?     ?            ? core.31811
?????????? ? ?      ?     ?            ? core.34512
?????????? ? ?      ?     ?            ? core.34980
?????????? ? ?      ?     ?            ? core.37266
?????????? ? ?      ?     ?            ? core.45368
?????????? ? ?      ?     ?            ? core.9968

(Aside: I don’t understand how a process run as qserv was allowed to create an empty file in there, as the directory is not group writeable) Also, `cat /tmp/xrootd.worker.env`gives:


Finally, if you look at a node with a running xrootd, you’ll see that /proc/<pid>/cwd is indeed a symlink to that directory. So my guess is there is some sort of problem writing to AFS. If you stop the cluster and start with

/opt/shmux/bin/shmux -c "sudo -u qserv sh -c 'cd /qserv/run-jgates; bin/qserv-start.sh'" ccqserv{100..124}

(which should change the CWD the daemons get, and hence the location in which core files are produced), do you get the core files you want (in /qserv/run-jgates/worker/)?

On Aug 7, 2015, at 8:47 PM, Becla, Jacek <[log in to unmask]> wrote:

So now I made xrootd fail on several machines

tail from the log is on ccqserv 102, 108, 109, 124 is the same as previously reported, but on ccqserv03, 120, 122 I see:

0808 04:19:45.608 [0x7f37bd949700] INFO  root (build/proto/ProtoHeaderWrap.cc:52) - msgBuf size=256 -> [[0]=40, [1]=13, [2]=2, [3]=0, [4]=0, ..., [251]=48, [252]=48, [253]=48, [254]=48, [255]=48]
0808 04:19:45.608 [0x7f37bd949700] INFO  root (build/xrdsvc/SsiSession_ReplyChannel.cc:85) - sendStream, checking stream 0 len=256 last=0
pure virtual method called
terminate called without an active exception


sudo -u qserv find /qserv/run-jgates/ | grep core

does not find any core file. 

Serge, do you want to have a look at the cluster before I run things again?


On Aug 7, 2015, at 7:25 PM, Serge Monkewitz <[log in to unmask]> wrote:

I’ve never heard of Linux treating daemon’s specially with regards to core dumping - if you set the ulimit in the appropriate init.d script, you should get a core dump as usual. Can you provide a link?

Serge

On Aug 7, 2015, at 7:02 PM, Andrew Hanushevsky <[log in to unmask]> wrote:

Oh yes, if the thing runs as a daemon, Linux will still suppress teh core file. Does it?

Andy

On Fri, 7 Aug 2015, Fritz Mueller wrote:

I'd vote yes on this, thanks.

On 08/07/2015 06:52 PM, Becla, Jacek wrote:
ok, I am running with unlimited now.
The question is: do we want to add that to all our init scripts?
Ill create a story and will do it
Jacek
On Aug 7, 2015, at 4:25 PM, Serge Monkewitz <[log in to unmask] <mailto:[log in to unmask]>> wrote:
ulimit -c unlimited
------------------------------------------------------------------------
Use REPLY-ALL to reply to list
To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1






Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1





Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1