Print

Print


Oh I saw these “.xrd/=core” and I ignored them.

let me see…. I’m restarting



> On Aug 7, 2015, at 10:44 PM, Serge Monkewitz <[log in to unmask]> wrote:
> 
> Arg. Notice that e.g. on cqserv102:
> 
> ls -l /qserv/run-jgates/tmp/worker/.xrd/=/core
> 
> lrwxrwxrwx 1 qserv qserv 34 Aug  8 03:47 cmsd -> /afs/in2p3.fr/home/j/jgates/worker <http://in2p3.fr/home/j/jgates/worker>
> lrwxrwxrwx 1 qserv qserv 34 Aug  8 03:47 xrootd -> /afs/in2p3.fr/home/j/jgates/worker <http://in2p3.fr/home/j/jgates/worker>
> 
> If I `ls` the worker subdir of John’s homedir, I see this:
> 
> ls -l
> ls: cannot access core.37266: Permission denied
> ls: cannot access core.23446: Permission denied
> ls: cannot access core.23902: Permission denied
> ls: cannot access core.9968: Permission denied
> ls: cannot access core.34980: Permission denied
> ls: cannot access core.45368: Permission denied
> ls: cannot access core.34512: Permission denied
> ls: cannot access core.31811: Permission denied
> ls: cannot access core.29108: Permission denied
> ls: cannot access core.31489: Permission denied
> total 0
> -rw------- 1 jgates qserv 0 Aug  8 04:19 core.17330
> ?????????? ? ?      ?     ?            ? core.23446
> ?????????? ? ?      ?     ?            ? core.23902
> ?????????? ? ?      ?     ?            ? core.29108
> ?????????? ? ?      ?     ?            ? core.31489
> ?????????? ? ?      ?     ?            ? core.31811
> ?????????? ? ?      ?     ?            ? core.34512
> ?????????? ? ?      ?     ?            ? core.34980
> ?????????? ? ?      ?     ?            ? core.37266
> ?????????? ? ?      ?     ?            ? core.45368
> ?????????? ? ?      ?     ?            ? core.9968
> 
> (Aside: I don’t understand how a process run as qserv was allowed to create an empty file in there, as the directory is not group writeable) Also, `cat /tmp/xrootd.worker.env`gives:
> 
> pid=17330&host=ccqserv102.in2p3.fr <http://ccqserv102.in2p3.fr/>&inst=worker&ver=xrdssi-1.0.5&cfgfn=/qserv/run-jgates/etc/lsp.cf&cwd=/afs/in2p3.fr/home/j/jgates/worker&apath=/qserv/run-jgates/tmp/worker/&logfn=/qserv/run-jgates/var/log/worker/xrootd.log <http://in2p3.fr/home/j/jgates/worker&apath=/qserv/run-jgates/tmp/worker/&logfn=/qserv/run-jgates/var/log/worker/xrootd.log>
> 
> Finally, if you look at a node with a running xrootd, you’ll see that /proc/<pid>/cwd is indeed a symlink to that directory. So my guess is there is some sort of problem writing to AFS. If you stop the cluster and start with
> 
> /opt/shmux/bin/shmux -c "sudo -u qserv sh -c 'cd /qserv/run-jgates; bin/qserv-start.sh'" ccqserv{100..124}
> 
> (which should change the CWD the daemons get, and hence the location in which core files are produced), do you get the core files you want (in /qserv/run-jgates/worker/)?
> 
>> On Aug 7, 2015, at 8:47 PM, Becla, Jacek <[log in to unmask] <mailto:[log in to unmask]>> wrote:
>> 
>> So now I made xrootd fail on several machines
>> 
>> tail from the log is on ccqserv 102, 108, 109, 124 is the same as previously reported, but on ccqserv03, 120, 122 I see:
>> 
>> 0808 04:19:45.608 [0x7f37bd949700] INFO  root (build/proto/ProtoHeaderWrap.cc <http://protoheaderwrap.cc/>:52) - msgBuf size=256 -> [[0]=40, [1]=13, [2]=2, [3]=0, [4]=0, ..., [251]=48, [252]=48, [253]=48, [254]=48, [255]=48]
>> 0808 04:19:45.608 [0x7f37bd949700] INFO  root (build/xrdsvc/SsiSession_ReplyChannel.cc:85) - sendStream, checking stream 0 len=256 last=0
>> pure virtual method called
>> terminate called without an active exception
>> 
>> 
>> sudo -u qserv find /qserv/run-jgates/ | grep core
>> 
>> does not find any core file.
>> 
>> Serge, do you want to have a look at the cluster before I run things again?
>> 
>> 
>>> On Aug 7, 2015, at 7:25 PM, Serge Monkewitz <[log in to unmask] <mailto:[log in to unmask]>> wrote:
>>> 
>>> I’ve never heard of Linux treating daemon’s specially with regards to core dumping - if you set the ulimit in the appropriate init.d script, you should get a core dump as usual. Can you provide a link?
>>> 
>>> Serge
>>> 
>>>> On Aug 7, 2015, at 7:02 PM, Andrew Hanushevsky <[log in to unmask] <mailto:[log in to unmask]>> wrote:
>>>> 
>>>> Oh yes, if the thing runs as a daemon, Linux will still suppress teh core file. Does it?
>>>> 
>>>> Andy
>>> 
>>>> On Fri, 7 Aug 2015, Fritz Mueller wrote:
>>>> 
>>>>> I'd vote yes on this, thanks.
>>>>> 
>>>>> On 08/07/2015 06:52 PM, Becla, Jacek wrote:
>>>>>> ok, I am running with unlimited now.
>>>>>> The question is: do we want to add that to all our init scripts?
>>>>>> Ill create a story and will do it
>>>>>> Jacek
>>>>>>> On Aug 7, 2015, at 4:25 PM, Serge Monkewitz <[log in to unmask] <mailto:[log in to unmask]> <mailto:[log in to unmask] <mailto:[log in to unmask]>>> wrote:
>>>>>>> ulimit -c unlimited
>>>>>> ------------------------------------------------------------------------
>>>>>> Use REPLY-ALL to reply to list
>>>>>> To unsubscribe from the QSERV-L list, click the following link:
>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1 <https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1>
>>>>> 
>>>>> 
>>> 
>> 
>> 
>> Use REPLY-ALL to reply to list
>> 
>> To unsubscribe from the QSERV-L list, click the following link:
>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1 <https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1>


########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1