Oh I saw these “.xrd/=core” and I ignored them. let me see…. I’m restarting > On Aug 7, 2015, at 10:44 PM, Serge Monkewitz <[log in to unmask]> wrote: > > Arg. Notice that e.g. on cqserv102: > > ls -l /qserv/run-jgates/tmp/worker/.xrd/=/core > > lrwxrwxrwx 1 qserv qserv 34 Aug 8 03:47 cmsd -> /afs/in2p3.fr/home/j/jgates/worker <http://in2p3.fr/home/j/jgates/worker> > lrwxrwxrwx 1 qserv qserv 34 Aug 8 03:47 xrootd -> /afs/in2p3.fr/home/j/jgates/worker <http://in2p3.fr/home/j/jgates/worker> > > If I `ls` the worker subdir of John’s homedir, I see this: > > ls -l > ls: cannot access core.37266: Permission denied > ls: cannot access core.23446: Permission denied > ls: cannot access core.23902: Permission denied > ls: cannot access core.9968: Permission denied > ls: cannot access core.34980: Permission denied > ls: cannot access core.45368: Permission denied > ls: cannot access core.34512: Permission denied > ls: cannot access core.31811: Permission denied > ls: cannot access core.29108: Permission denied > ls: cannot access core.31489: Permission denied > total 0 > -rw------- 1 jgates qserv 0 Aug 8 04:19 core.17330 > ?????????? ? ? ? ? ? core.23446 > ?????????? ? ? ? ? ? core.23902 > ?????????? ? ? ? ? ? core.29108 > ?????????? ? ? ? ? ? core.31489 > ?????????? ? ? ? ? ? core.31811 > ?????????? ? ? ? ? ? core.34512 > ?????????? ? ? ? ? ? core.34980 > ?????????? ? ? ? ? ? core.37266 > ?????????? ? ? ? ? ? core.45368 > ?????????? ? ? ? ? ? core.9968 > > (Aside: I don’t understand how a process run as qserv was allowed to create an empty file in there, as the directory is not group writeable) Also, `cat /tmp/xrootd.worker.env`gives: > > pid=17330&host=ccqserv102.in2p3.fr <http://ccqserv102.in2p3.fr/>&inst=worker&ver=xrdssi-1.0.5&cfgfn=/qserv/run-jgates/etc/lsp.cf&cwd=/afs/in2p3.fr/home/j/jgates/worker&apath=/qserv/run-jgates/tmp/worker/&logfn=/qserv/run-jgates/var/log/worker/xrootd.log <http://in2p3.fr/home/j/jgates/worker&apath=/qserv/run-jgates/tmp/worker/&logfn=/qserv/run-jgates/var/log/worker/xrootd.log> > > Finally, if you look at a node with a running xrootd, you’ll see that /proc/<pid>/cwd is indeed a symlink to that directory. So my guess is there is some sort of problem writing to AFS. If you stop the cluster and start with > > /opt/shmux/bin/shmux -c "sudo -u qserv sh -c 'cd /qserv/run-jgates; bin/qserv-start.sh'" ccqserv{100..124} > > (which should change the CWD the daemons get, and hence the location in which core files are produced), do you get the core files you want (in /qserv/run-jgates/worker/)? > >> On Aug 7, 2015, at 8:47 PM, Becla, Jacek <[log in to unmask] <mailto:[log in to unmask]>> wrote: >> >> So now I made xrootd fail on several machines >> >> tail from the log is on ccqserv 102, 108, 109, 124 is the same as previously reported, but on ccqserv03, 120, 122 I see: >> >> 0808 04:19:45.608 [0x7f37bd949700] INFO root (build/proto/ProtoHeaderWrap.cc <http://protoheaderwrap.cc/>:52) - msgBuf size=256 -> [[0]=40, [1]=13, [2]=2, [3]=0, [4]=0, ..., [251]=48, [252]=48, [253]=48, [254]=48, [255]=48] >> 0808 04:19:45.608 [0x7f37bd949700] INFO root (build/xrdsvc/SsiSession_ReplyChannel.cc:85) - sendStream, checking stream 0 len=256 last=0 >> pure virtual method called >> terminate called without an active exception >> >> >> sudo -u qserv find /qserv/run-jgates/ | grep core >> >> does not find any core file. >> >> Serge, do you want to have a look at the cluster before I run things again? >> >> >>> On Aug 7, 2015, at 7:25 PM, Serge Monkewitz <[log in to unmask] <mailto:[log in to unmask]>> wrote: >>> >>> I’ve never heard of Linux treating daemon’s specially with regards to core dumping - if you set the ulimit in the appropriate init.d script, you should get a core dump as usual. Can you provide a link? >>> >>> Serge >>> >>>> On Aug 7, 2015, at 7:02 PM, Andrew Hanushevsky <[log in to unmask] <mailto:[log in to unmask]>> wrote: >>>> >>>> Oh yes, if the thing runs as a daemon, Linux will still suppress teh core file. Does it? >>>> >>>> Andy >>> >>>> On Fri, 7 Aug 2015, Fritz Mueller wrote: >>>> >>>>> I'd vote yes on this, thanks. >>>>> >>>>> On 08/07/2015 06:52 PM, Becla, Jacek wrote: >>>>>> ok, I am running with unlimited now. >>>>>> The question is: do we want to add that to all our init scripts? >>>>>> Ill create a story and will do it >>>>>> Jacek >>>>>>> On Aug 7, 2015, at 4:25 PM, Serge Monkewitz <[log in to unmask] <mailto:[log in to unmask]> <mailto:[log in to unmask] <mailto:[log in to unmask]>>> wrote: >>>>>>> ulimit -c unlimited >>>>>> ------------------------------------------------------------------------ >>>>>> Use REPLY-ALL to reply to list >>>>>> To unsubscribe from the QSERV-L list, click the following link: >>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1 <https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1> >>>>> >>>>> >>> >> >> >> Use REPLY-ALL to reply to list >> >> To unsubscribe from the QSERV-L list, click the following link: >> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1 <https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1> ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the QSERV-L list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1