Arg. Notice that e.g. on cqserv102: ls -l /qserv/run-jgates/tmp/worker/.xrd/=/core lrwxrwxrwx 1 qserv qserv 34 Aug 8 03:47 cmsd -> /afs/in2p3.fr/home/j/jgates/worker lrwxrwxrwx 1 qserv qserv 34 Aug 8 03:47 xrootd -> /afs/in2p3.fr/home/j/jgates/worker If I `ls` the worker subdir of John’s homedir, I see this: ls -l ls: cannot access core.37266: Permission denied ls: cannot access core.23446: Permission denied ls: cannot access core.23902: Permission denied ls: cannot access core.9968: Permission denied ls: cannot access core.34980: Permission denied ls: cannot access core.45368: Permission denied ls: cannot access core.34512: Permission denied ls: cannot access core.31811: Permission denied ls: cannot access core.29108: Permission denied ls: cannot access core.31489: Permission denied total 0 -rw------- 1 jgates qserv 0 Aug 8 04:19 core.17330 ?????????? ? ? ? ? ? core.23446 ?????????? ? ? ? ? ? core.23902 ?????????? ? ? ? ? ? core.29108 ?????????? ? ? ? ? ? core.31489 ?????????? ? ? ? ? ? core.31811 ?????????? ? ? ? ? ? core.34512 ?????????? ? ? ? ? ? core.34980 ?????????? ? ? ? ? ? core.37266 ?????????? ? ? ? ? ? core.45368 ?????????? ? ? ? ? ? core.9968 (Aside: I don’t understand how a process run as qserv was allowed to create an empty file in there, as the directory is not group writeable) Also, `cat /tmp/xrootd.worker.env`gives: pid=17330&host=ccqserv102.in2p3.fr&inst=worker&ver=xrdssi-1.0.5&cfgfn=/qserv/run-jgates/etc/lsp.cf&cwd=/afs/in2p3.fr/home/j/jgates/worker&apath=/qserv/run-jgates/tmp/worker/&logfn=/qserv/run-jgates/var/log/worker/xrootd.log Finally, if you look at a node with a running xrootd, you’ll see that /proc/<pid>/cwd is indeed a symlink to that directory. So my guess is there is some sort of problem writing to AFS. If you stop the cluster and start with /opt/shmux/bin/shmux -c "sudo -u qserv sh -c 'cd /qserv/run-jgates; bin/qserv-start.sh'" ccqserv{100..124} (which should change the CWD the daemons get, and hence the location in which core files are produced), do you get the core files you want (in /qserv/run-jgates/worker/)? > On Aug 7, 2015, at 8:47 PM, Becla, Jacek <[log in to unmask]> wrote: > > So now I made xrootd fail on several machines > > tail from the log is on ccqserv 102, 108, 109, 124 is the same as previously reported, but on ccqserv03, 120, 122 I see: > > 0808 04:19:45.608 [0x7f37bd949700] INFO root (build/proto/ProtoHeaderWrap.cc <http://protoheaderwrap.cc/>:52) - msgBuf size=256 -> [[0]=40, [1]=13, [2]=2, [3]=0, [4]=0, ..., [251]=48, [252]=48, [253]=48, [254]=48, [255]=48] > 0808 04:19:45.608 [0x7f37bd949700] INFO root (build/xrdsvc/SsiSession_ReplyChannel.cc:85) - sendStream, checking stream 0 len=256 last=0 > pure virtual method called > terminate called without an active exception > > > sudo -u qserv find /qserv/run-jgates/ | grep core > > does not find any core file. > > Serge, do you want to have a look at the cluster before I run things again? > > >> On Aug 7, 2015, at 7:25 PM, Serge Monkewitz <[log in to unmask] <mailto:[log in to unmask]>> wrote: >> >> I’ve never heard of Linux treating daemon’s specially with regards to core dumping - if you set the ulimit in the appropriate init.d script, you should get a core dump as usual. Can you provide a link? >> >> Serge >> >>> On Aug 7, 2015, at 7:02 PM, Andrew Hanushevsky <[log in to unmask] <mailto:[log in to unmask]>> wrote: >>> >>> Oh yes, if the thing runs as a daemon, Linux will still suppress teh core file. Does it? >>> >>> Andy >> >>> On Fri, 7 Aug 2015, Fritz Mueller wrote: >>> >>>> I'd vote yes on this, thanks. >>>> >>>> On 08/07/2015 06:52 PM, Becla, Jacek wrote: >>>>> ok, I am running with unlimited now. >>>>> The question is: do we want to add that to all our init scripts? >>>>> Ill create a story and will do it >>>>> Jacek >>>>>> On Aug 7, 2015, at 4:25 PM, Serge Monkewitz <[log in to unmask] <mailto:[log in to unmask]> <mailto:[log in to unmask] <mailto:[log in to unmask]>>> wrote: >>>>>> ulimit -c unlimited >>>>> ------------------------------------------------------------------------ >>>>> Use REPLY-ALL to reply to list >>>>> To unsubscribe from the QSERV-L list, click the following link: >>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1 <https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1> >>>> >>>> >> > > > Use REPLY-ALL to reply to list > > To unsubscribe from the QSERV-L list, click the following link: > https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1 <https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1> ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the QSERV-L list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1