Print

Print


Hi Alvise,

Yes that would make sense.

Andy

On Tue, 8 Mar 2005, Alvise Dorigo wrote:

> Hi Andy,
> my client was not crashing at the time of those tests, I simply unplug
> the ethernet cable (or do "ifdown eth0" as superuser).
> There's no way to "close" a connection if the ehternet cable has been
> unplugged, in fact a "close" implies some kind of low level TCP
> handshake that the kernel does even when you do CTRL-C on the server or
> on the client. But without physical connection no TCP byte can arrive to
> the server.
> That said, I understand that probably this problem could be handled at
> the client side: when the network is up again, make sure that the old
> TCP connection is closed by closing it after the new one has been
> succesfully opened *AND* *before* try to open the same file the client
> was writing into before the net outage. Does it make sense to you ?
>
>     Alvise
>
> abh wrote:
>
> > Hi Fabrizio,
> >
> > Yes, I think you should not use the flag (of course you should provide
> > for an option to turn it on). I understand the problem Alvise had. His
> > client crashed and never closed the connection. So, the server thought
> > the file was still opened for write and refused to let him rewrite it.
> > It is also possible for this to happen if you close a connection and
> > immediately try to open a new one. I think having the open to force
> > the copy is probably sufficiemt.
> >
> > Andy
> >
> > ----- Original Message ----- From: "Fabrizio Furano"
> > <[log in to unmask]>
> > To: "Wilko Kroeger" <[log in to unmask]>
> > Cc: "Peter Elmer" <[log in to unmask]>; <[log in to unmask]>;
> > "Andrew Hanushevsky" <[log in to unmask]>
> > Sent: Monday, March 07, 2005 2:20 AM
> > Subject: Re: crashes in xrdcp
> >
> >
> >> Hi,
> >>
> >>  about the kxr_force, I remember that I put it into the flags to
> >> override some situation in which a write retry could not succeed
> >> because the former server had not already understood that the
> >> previous connection was down. That was part of a bunch of little
> >> problems spotted by Alvise in his tests.
> >>
> >>  Andy, do you agree for me to cut that flag off?
> >>
> >> Fabrizio
> >>
> >> Wilko Kroeger wrote:
> >>
> >>> Hello Pete
> >>>
> >>> Ok, thanks. I will try out the head.
> >>> If possible we should also fix the problem that the kXR_force
> >>> is used. It seems to me quite dangerous that two clients
> >>> can write to the same file or one could over write piece
> >>> of an existing file.
> >>>
> >>> Cheers,
> >>>   Wilko
> >>>
> >>>
> >>> On Thu, 3 Mar 2005, Peter Elmer wrote:
> >>>
> >>>
> >>>>  Hi Wilko,
> >>>>
> >>>>  Just for the record, Fabrizio just wrote (as part of a CVS commit):
> >>>>
> >>>> On Thu, Mar 03, 2005 at 07:33:42PM +0000, Fabrizio Furano wrote:
> >>>>
> >>>>> Hi again,
> >>>>
> >>>>
> >>>> <...>
> >>>>
> >>>>> With this one I am no longer able to make xrdcp crash under heavy
> >>>>> load
> >>>>> in the client/server machine. I am still investigating on the
> >>>>> occasional
> >>>>> cpu eating, but it seems that that's more difficult, since in my
> >>>>> tests,
> >>>>> the problem disappears when enabling the client side log, and for
> >>>>> some
> >>>>> strange reason I am not able to spot it by attaching gdb to the
> >>>>> process.
> >>>>>
> >>>>> Fabrizio
> >>>>
> >>>>
> >>>>                                   Pete
> >>>>
> >>>>
> >>>> On Mon, Feb 28, 2005 at 12:26:48AM -0800, Wilko Kroeger wrote:
> >>>>
> >>>>> Hello Fabrizio
> >>>>>
> >>>>> I run the xrdcp test again and I can reproduce crashes in xrdcp
> >>>>> (some times it take 30-60 mins).
> >>>>> I used the xrootd version 20050226-0825 and xrdcp is running on a
> >>>>> RHEL3
> >>>>> machine. I read the same file over and over:
> >>>>>  xrdcp -DIDebugLevel 2
> >>>>> root://${xrdhost}:2094///prod/test/small.test -  > /dev/null
> >>>>>
> >>>>> The size of the small.test file is:
> >>>>>
> >>>>>> ls -l small.test
> >>>>>
> >>>>>
> >>>>> rw-r--r--   1 wilko  ec  31457280 Feb 27 18:09
> >>>>> /u1/wilko/kanga/prod/test/small.test
> >>>>> which is 30 MB (30*1024*1024)
> >>>>>
> >>>>> I used debugLevel 1 and 2.
> >>>>>
> >>>>> You can find the core file and the debug output files in:
> >>>>> ~wilko/bbdev/work/xrootd/core/20050227_2233_d1/
> >>>>> ~wilko/bbdev/work/xrootd/core/20050227_2302_d1/
> >>>>> ~wilko/bbdev/work/xrootd/core/20050227_2314_d2/
> >>>>> ~wilko/bbdev/work/xrootd/core/20050227_2350_d2/
> >>>>>
> >>>>> each directory contains a core file and the debug output file
> >>>>> (wk_log...). The ending d1 or d2 means debuglevel 1 or 2.
> >>>>>
> >>>>> With debug option = 1, gdb shows:
> >>>>> #0  0x0018b17c in memcpy () from /lib/tls/libc.so.6
> >>>>> #1  0x0806edbc in XrdClientReadCacheItem::GetPartialInterval(void
> >>>>> const*,
> >>>>>    long long, long long) (this=0x9f107d0, buffer=0xb5750d08,
> >>>>>    begin_offs=31457280, end_offs=31714559) at
> >>>>> XrdClientReadCache.hh:93
> >>>>>
> >>>>> whereas with debugLevel=2, gdb shows:
> >>>>>
> >>>>> #0  0x00a4e027 in _int_free () from /lib/tls/libc.so.6
> >>>>> #1  0x00a4d018 in free () from /lib/tls/libc.so.6
> >>>>> #2  0x0806d984 in ~XrdClientReadCacheItem (this=0x96b3db8) at
> >>>>>    XrdClientReadCache.cc:40
> >>>>>
> >>>>>
> >>>>> On the xrootd site I see the error:
> >>>>> 050227 23:54:39 064 XrdLink: Unable to receive from
> >>>>> wilko.30110:17@tori0001;
> >>>>>       connection reset by peer
> >>>>> 050227 23:54:39 064 XrootdXeq: wilko.30110:17@tori0001 disc
> >>>>> 1:02:03 (link
> >>>>>       read error)
> >>>>>
> >>>>> (the corresponding client crash was around 23:50)
> >>>>>
> >>>>>
> >>>>> Thanks for looking into this,
> >>>>>
> >>>>> Wilko
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> -------------------------------------------------------------------------
> >>>>
> >>>> Peter Elmer     E-mail: [log in to unmask]      Phone: +41 (22)
> >>>> 767-4644
> >>>> Address: CERN Division PPE, Bat. 32 2C-14, CH-1211 Geneva 23,
> >>>> Switzerland
> >>>> -------------------------------------------------------------------------
> >>>>
> >>>>
> >>
>