Print

Print


Hello Pete and Fabrizio

I repeated the test with xrdcp and I still see some crashes.

I used xrdcp from the head, checked out on Friday.
The core files are in ~wilko/work/xrootd/core .
Look at:
  core.26531 and xrdcp_debugLog.26531
and
  core.612 and xrdcp_debugLog.612

The xrdcp_debugLog.NNNNN contain the xrdcp debug output. I used debug
level 2.
The binaries and libs are in ~wilko/bbtest/xrootd/head/xrootd

The difference between the two is that in the first case the file read
from xrootd is 800010 bytes large and in the second case the file
is 100MB (100*1024*1024) bytes large.
The crashes are reproducible but it might take a little bit (The longest
I had to wait was about 4 hours).

I also transfered a files that are a multiple of 800000 bytes. I didn't
see any crash where a core was produced but in one case xrdcp
was hanging and using 100% cpu. I have to run these test for a little bit
longer but I have the impression that there is a difference if the file
if a multiple of 800000 or not.

Here is the gdb output of the two different types of crashes I saw:

1)  core.26531   (file size was 800010)
(gdb) where
#0  0x001fecab in _int_malloc () from /lib/tls/libc.so.6
#1  0x001fde9d in malloc () from /lib/tls/libc.so.6
#2  0x00c1a89e in operator new(unsigned) () from /usr/lib/libstdc++.so.5
#3  0x08077d22 in XrdCpMthrQueue::PutBuffer(void*, int) (this=0x808eba0,
    buf=0x0, len=0) at XrdCpMthrQueue.cc:60
#4  0x08053c8f in ReaderThread_xrd(void*) () at Xrdcp.cc:78
#5  0x0807f8c2 in XrdOucThread_Xeq (myargs=0x84aba20) at
    XrdOucPthread.cc:80
#6  0x00181dec in start_thread () from /lib/tls/libpthread.so.0
#7  0x00269a2a in clone () from /lib/tls/libc.so.6


2) core.612  (large file 100MB)
(gdb) where
#0  0x001fe027 in _int_free () from /lib/tls/libc.so.6
#1  0x001fd018 in free () from /lib/tls/libc.so.6
#2  0x0806d0fc in ~XrdClientReadCacheItem (this=0x9a91db8) at
    XrdClientReadCache.cc:40
#3  0x0806dc77 in XrdClientReadCache::RemoveItems() (this=0x9a94ab8) at
    XrdClientReadCache.cc:218
#4  0x0806d610 in ~XrdClientReadCache (this=0x9a94ab8) at
    XrdClientReadCache.cc:100
#5  0x0805cd07 in ~XrdClientConn (this=0x9a93440) at XrdClientConn.cc:117
#6  0x08058411 in ~XrdClient (this=0x9a91ba8) at XrdClient.cc:67
#7  0x080551ef in doCp_xrd2loc(char const*, char const*) (
    src=0x9a97308 "root://datadevsol02:2094////prod/test/small2.test",
    dst=0x9a9a898 "-") at Xrdcp.cc:412
#8  0x08056261 in main (argc=5, argv=0xbfff7144) at Xrdcp.cc:618


Cheers,
   Wilko



On Thu, 3 Mar 2005, Peter Elmer wrote:

>   Hi Wilko,
>
>   Just for the record, Fabrizio just wrote (as part of a CVS commit):
>
> On Thu, Mar 03, 2005 at 07:33:42PM +0000, Fabrizio Furano wrote:
> > Hi again,
> <...>
> >  With this one I am no longer able to make xrdcp crash under heavy load
> > in the client/server machine. I am still investigating on the occasional
> > cpu eating, but it seems that that's more difficult, since in my tests,
> > the problem disappears when enabling the client side log, and for some
> > strange reason I am not able to spot it by attaching gdb to the process.
> >
> > Fabrizio
>
>                                    Pete
>
>
> On Mon, Feb 28, 2005 at 12:26:48AM -0800, Wilko Kroeger wrote:
> > Hello Fabrizio
> >
> > I run the xrdcp test again and I can reproduce crashes in xrdcp
> > (some times it take 30-60 mins).
> > I used the xrootd version 20050226-0825 and xrdcp is running on a RHEL3
> > machine. I read the same file over and over:
> >   xrdcp -DIDebugLevel 2 root://${xrdhost}:2094///prod/test/small.test - > /dev/null
> >
> > The size of the small.test file is:
> > > ls -l small.test
> > rw-r--r--   1 wilko  ec  31457280 Feb 27 18:09 /u1/wilko/kanga/prod/test/small.test
> > which is 30 MB (30*1024*1024)
> >
> > I used debugLevel 1 and 2.
> >
> > You can find the core file and the debug output files in:
> > ~wilko/bbdev/work/xrootd/core/20050227_2233_d1/
> > ~wilko/bbdev/work/xrootd/core/20050227_2302_d1/
> > ~wilko/bbdev/work/xrootd/core/20050227_2314_d2/
> > ~wilko/bbdev/work/xrootd/core/20050227_2350_d2/
> >
> > each directory contains a core file and the debug output file
> > (wk_log...). The ending d1 or d2 means debuglevel 1 or 2.
> >
> > With debug option = 1, gdb shows:
> > #0  0x0018b17c in memcpy () from /lib/tls/libc.so.6
> > #1  0x0806edbc in XrdClientReadCacheItem::GetPartialInterval(void const*,
> >     long long, long long) (this=0x9f107d0, buffer=0xb5750d08,
> >     begin_offs=31457280, end_offs=31714559) at XrdClientReadCache.hh:93
> >
> > whereas with debugLevel=2, gdb shows:
> >
> > #0  0x00a4e027 in _int_free () from /lib/tls/libc.so.6
> > #1  0x00a4d018 in free () from /lib/tls/libc.so.6
> > #2  0x0806d984 in ~XrdClientReadCacheItem (this=0x96b3db8) at
> >     XrdClientReadCache.cc:40
> >
> >
> > On the xrootd site I see the error:
> > 050227 23:54:39 064 XrdLink: Unable to receive from wilko.30110:17@tori0001;
> >        connection reset by peer
> > 050227 23:54:39 064 XrootdXeq: wilko.30110:17@tori0001 disc 1:02:03 (link
> >        read error)
> >
> > (the corresponding client crash was around 23:50)
> >
> >
> > Thanks for looking into this,
> >
> > Wilko
> >
>
>
>
> -------------------------------------------------------------------------
> Peter Elmer     E-mail: [log in to unmask]      Phone: +41 (22) 767-4644
> Address: CERN Division PPE, Bat. 32 2C-14, CH-1211 Geneva 23, Switzerland
> -------------------------------------------------------------------------
>