Hi Alvise, Yes that would make sense. Andy On Tue, 8 Mar 2005, Alvise Dorigo wrote: > Hi Andy, > my client was not crashing at the time of those tests, I simply unplug > the ethernet cable (or do "ifdown eth0" as superuser). > There's no way to "close" a connection if the ehternet cable has been > unplugged, in fact a "close" implies some kind of low level TCP > handshake that the kernel does even when you do CTRL-C on the server or > on the client. But without physical connection no TCP byte can arrive to > the server. > That said, I understand that probably this problem could be handled at > the client side: when the network is up again, make sure that the old > TCP connection is closed by closing it after the new one has been > succesfully opened *AND* *before* try to open the same file the client > was writing into before the net outage. Does it make sense to you ? > > Alvise > > abh wrote: > > > Hi Fabrizio, > > > > Yes, I think you should not use the flag (of course you should provide > > for an option to turn it on). I understand the problem Alvise had. His > > client crashed and never closed the connection. So, the server thought > > the file was still opened for write and refused to let him rewrite it. > > It is also possible for this to happen if you close a connection and > > immediately try to open a new one. I think having the open to force > > the copy is probably sufficiemt. > > > > Andy > > > > ----- Original Message ----- From: "Fabrizio Furano" > > <[log in to unmask]> > > To: "Wilko Kroeger" <[log in to unmask]> > > Cc: "Peter Elmer" <[log in to unmask]>; <[log in to unmask]>; > > "Andrew Hanushevsky" <[log in to unmask]> > > Sent: Monday, March 07, 2005 2:20 AM > > Subject: Re: crashes in xrdcp > > > > > >> Hi, > >> > >> about the kxr_force, I remember that I put it into the flags to > >> override some situation in which a write retry could not succeed > >> because the former server had not already understood that the > >> previous connection was down. That was part of a bunch of little > >> problems spotted by Alvise in his tests. > >> > >> Andy, do you agree for me to cut that flag off? > >> > >> Fabrizio > >> > >> Wilko Kroeger wrote: > >> > >>> Hello Pete > >>> > >>> Ok, thanks. I will try out the head. > >>> If possible we should also fix the problem that the kXR_force > >>> is used. It seems to me quite dangerous that two clients > >>> can write to the same file or one could over write piece > >>> of an existing file. > >>> > >>> Cheers, > >>> Wilko > >>> > >>> > >>> On Thu, 3 Mar 2005, Peter Elmer wrote: > >>> > >>> > >>>> Hi Wilko, > >>>> > >>>> Just for the record, Fabrizio just wrote (as part of a CVS commit): > >>>> > >>>> On Thu, Mar 03, 2005 at 07:33:42PM +0000, Fabrizio Furano wrote: > >>>> > >>>>> Hi again, > >>>> > >>>> > >>>> <...> > >>>> > >>>>> With this one I am no longer able to make xrdcp crash under heavy > >>>>> load > >>>>> in the client/server machine. I am still investigating on the > >>>>> occasional > >>>>> cpu eating, but it seems that that's more difficult, since in my > >>>>> tests, > >>>>> the problem disappears when enabling the client side log, and for > >>>>> some > >>>>> strange reason I am not able to spot it by attaching gdb to the > >>>>> process. > >>>>> > >>>>> Fabrizio > >>>> > >>>> > >>>> Pete > >>>> > >>>> > >>>> On Mon, Feb 28, 2005 at 12:26:48AM -0800, Wilko Kroeger wrote: > >>>> > >>>>> Hello Fabrizio > >>>>> > >>>>> I run the xrdcp test again and I can reproduce crashes in xrdcp > >>>>> (some times it take 30-60 mins). > >>>>> I used the xrootd version 20050226-0825 and xrdcp is running on a > >>>>> RHEL3 > >>>>> machine. I read the same file over and over: > >>>>> xrdcp -DIDebugLevel 2 > >>>>> root://${xrdhost}:2094///prod/test/small.test - > /dev/null > >>>>> > >>>>> The size of the small.test file is: > >>>>> > >>>>>> ls -l small.test > >>>>> > >>>>> > >>>>> rw-r--r-- 1 wilko ec 31457280 Feb 27 18:09 > >>>>> /u1/wilko/kanga/prod/test/small.test > >>>>> which is 30 MB (30*1024*1024) > >>>>> > >>>>> I used debugLevel 1 and 2. > >>>>> > >>>>> You can find the core file and the debug output files in: > >>>>> ~wilko/bbdev/work/xrootd/core/20050227_2233_d1/ > >>>>> ~wilko/bbdev/work/xrootd/core/20050227_2302_d1/ > >>>>> ~wilko/bbdev/work/xrootd/core/20050227_2314_d2/ > >>>>> ~wilko/bbdev/work/xrootd/core/20050227_2350_d2/ > >>>>> > >>>>> each directory contains a core file and the debug output file > >>>>> (wk_log...). The ending d1 or d2 means debuglevel 1 or 2. > >>>>> > >>>>> With debug option = 1, gdb shows: > >>>>> #0 0x0018b17c in memcpy () from /lib/tls/libc.so.6 > >>>>> #1 0x0806edbc in XrdClientReadCacheItem::GetPartialInterval(void > >>>>> const*, > >>>>> long long, long long) (this=0x9f107d0, buffer=0xb5750d08, > >>>>> begin_offs=31457280, end_offs=31714559) at > >>>>> XrdClientReadCache.hh:93 > >>>>> > >>>>> whereas with debugLevel=2, gdb shows: > >>>>> > >>>>> #0 0x00a4e027 in _int_free () from /lib/tls/libc.so.6 > >>>>> #1 0x00a4d018 in free () from /lib/tls/libc.so.6 > >>>>> #2 0x0806d984 in ~XrdClientReadCacheItem (this=0x96b3db8) at > >>>>> XrdClientReadCache.cc:40 > >>>>> > >>>>> > >>>>> On the xrootd site I see the error: > >>>>> 050227 23:54:39 064 XrdLink: Unable to receive from > >>>>> wilko.30110:17@tori0001; > >>>>> connection reset by peer > >>>>> 050227 23:54:39 064 XrootdXeq: wilko.30110:17@tori0001 disc > >>>>> 1:02:03 (link > >>>>> read error) > >>>>> > >>>>> (the corresponding client crash was around 23:50) > >>>>> > >>>>> > >>>>> Thanks for looking into this, > >>>>> > >>>>> Wilko > >>>>> > >>>> > >>>> > >>>> > >>>> ------------------------------------------------------------------------- > >>>> > >>>> Peter Elmer E-mail: [log in to unmask] Phone: +41 (22) > >>>> 767-4644 > >>>> Address: CERN Division PPE, Bat. 32 2C-14, CH-1211 Geneva 23, > >>>> Switzerland > >>>> ------------------------------------------------------------------------- > >>>> > >>>> > >> >