Print

Print


Arg. Notice that e.g. on cqserv102:

ls -l /qserv/run-jgates/tmp/worker/.xrd/=/core

lrwxrwxrwx 1 qserv qserv 34 Aug  8 03:47 cmsd -> /afs/in2p3.fr/home/j/jgates/worker
lrwxrwxrwx 1 qserv qserv 34 Aug  8 03:47 xrootd -> /afs/in2p3.fr/home/j/jgates/worker

If I `ls` the worker subdir of John’s homedir, I see this:

ls -l
ls: cannot access core.37266: Permission denied
ls: cannot access core.23446: Permission denied
ls: cannot access core.23902: Permission denied
ls: cannot access core.9968: Permission denied
ls: cannot access core.34980: Permission denied
ls: cannot access core.45368: Permission denied
ls: cannot access core.34512: Permission denied
ls: cannot access core.31811: Permission denied
ls: cannot access core.29108: Permission denied
ls: cannot access core.31489: Permission denied
total 0
-rw------- 1 jgates qserv 0 Aug  8 04:19 core.17330
?????????? ? ?      ?     ?            ? core.23446
?????????? ? ?      ?     ?            ? core.23902
?????????? ? ?      ?     ?            ? core.29108
?????????? ? ?      ?     ?            ? core.31489
?????????? ? ?      ?     ?            ? core.31811
?????????? ? ?      ?     ?            ? core.34512
?????????? ? ?      ?     ?            ? core.34980
?????????? ? ?      ?     ?            ? core.37266
?????????? ? ?      ?     ?            ? core.45368
?????????? ? ?      ?     ?            ? core.9968

(Aside: I don’t understand how a process run as qserv was allowed to create an empty file in there, as the directory is not group writeable) Also, `cat /tmp/xrootd.worker.env`gives:

pid=17330&host=ccqserv102.in2p3.fr&inst=worker&ver=xrdssi-1.0.5&cfgfn=/qserv/run-jgates/etc/lsp.cf&cwd=/afs/in2p3.fr/home/j/jgates/worker&apath=/qserv/run-jgates/tmp/worker/&logfn=/qserv/run-jgates/var/log/worker/xrootd.log

Finally, if you look at a node with a running xrootd, you’ll see that /proc/<pid>/cwd is indeed a symlink to that directory. So my guess is there is some sort of problem writing to AFS. If you stop the cluster and start with

/opt/shmux/bin/shmux -c "sudo -u qserv sh -c 'cd /qserv/run-jgates; bin/qserv-start.sh'" ccqserv{100..124}

(which should change the CWD the daemons get, and hence the location in which core files are produced), do you get the core files you want (in /qserv/run-jgates/worker/)?

> On Aug 7, 2015, at 8:47 PM, Becla, Jacek <[log in to unmask]> wrote:
> 
> So now I made xrootd fail on several machines
> 
> tail from the log is on ccqserv 102, 108, 109, 124 is the same as previously reported, but on ccqserv03, 120, 122 I see:
> 
> 0808 04:19:45.608 [0x7f37bd949700] INFO  root (build/proto/ProtoHeaderWrap.cc <http://protoheaderwrap.cc/>:52) - msgBuf size=256 -> [[0]=40, [1]=13, [2]=2, [3]=0, [4]=0, ..., [251]=48, [252]=48, [253]=48, [254]=48, [255]=48]
> 0808 04:19:45.608 [0x7f37bd949700] INFO  root (build/xrdsvc/SsiSession_ReplyChannel.cc:85) - sendStream, checking stream 0 len=256 last=0
> pure virtual method called
> terminate called without an active exception
> 
> 
> sudo -u qserv find /qserv/run-jgates/ | grep core
> 
> does not find any core file. 
> 
> Serge, do you want to have a look at the cluster before I run things again?
> 
> 
>> On Aug 7, 2015, at 7:25 PM, Serge Monkewitz <[log in to unmask] <mailto:[log in to unmask]>> wrote:
>> 
>> I’ve never heard of Linux treating daemon’s specially with regards to core dumping - if you set the ulimit in the appropriate init.d script, you should get a core dump as usual. Can you provide a link?
>> 
>> Serge
>> 
>>> On Aug 7, 2015, at 7:02 PM, Andrew Hanushevsky <[log in to unmask] <mailto:[log in to unmask]>> wrote:
>>> 
>>> Oh yes, if the thing runs as a daemon, Linux will still suppress teh core file. Does it?
>>> 
>>> Andy
>> 
>>> On Fri, 7 Aug 2015, Fritz Mueller wrote:
>>> 
>>>> I'd vote yes on this, thanks.
>>>> 
>>>> On 08/07/2015 06:52 PM, Becla, Jacek wrote:
>>>>> ok, I am running with unlimited now.
>>>>> The question is: do we want to add that to all our init scripts?
>>>>> Ill create a story and will do it
>>>>> Jacek
>>>>>> On Aug 7, 2015, at 4:25 PM, Serge Monkewitz <[log in to unmask] <mailto:[log in to unmask]> <mailto:[log in to unmask] <mailto:[log in to unmask]>>> wrote:
>>>>>> ulimit -c unlimited
>>>>> ------------------------------------------------------------------------
>>>>> Use REPLY-ALL to reply to list
>>>>> To unsubscribe from the QSERV-L list, click the following link:
>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1 <https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1>
>>>> 
>>>> 
>> 
> 
> 
> Use REPLY-ALL to reply to list
> 
> To unsubscribe from the QSERV-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1 <https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the QSERV-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=QSERV-L&A=1