Yes, this is a recoverable error from the client side. It's very difficult
for the server to recover from this error. The problem is
that the server is not sure at this point where the outstanding reads
are in the pipeline. It needs to know that to be sure that the client
gets the data in the right order. That's why it simply flushes the
pipeline but this invalidates the restart offset.
So, yes. When you get this error, you have the last offset of
the uncompleted read. You can simply re-issue the read at that point.
You probably want to establish a limit on how many times you will do that
in the case that you immediately get the same error (an even more
impossible situation). This, of course, doesn't mean I won't try to fix
the problem. So far my research has shown that it *may* be possible for
such a situation to occur under certain timing conditions, though it's not
at all clear that those actually can happen.
On Thu, 13 Oct 2005, Fabrizio Furano wrote:
> Hi Andy,
> I quite understand, but this is a little cryptic to me.
> The thing that I dislike in line of principle is that a server side
> resource related trouble is treated by the whole system as a deadly
> situation. And the result is a truncated file copy, which is really bad.
> I wonder if the garble can be solved by making the client treat that
> kind of error as a recoverable one instead of a fatal one. If you think
> that this policy won't make bad things worse, I'll have a look into it.
> Andrew Hanushevsky wrote:
> > Hi Fabrizio,
> > I looked into this some more. Apparently, it is not a kernel issue. What
> > xrootd is complaining about is that it should have had an available buffer
> > to do the I/O request but there was not one to be found. I'm not sure what
> > this indicates exactly since it's one of those conditions that should
> > never happen. As xrootd found itself in an untenable situation it threw
> > up it's hands and terminated your request.
> > Andy
> > On Wed, 12 Oct 2005, Fabrizio Furano wrote:
> >>Hi Andy,
> >> copying files with xrdcp I got this error from the kan cluster:
> >>051012 09:04:16 14869 Xrd: ReadBuffer: Server
> >>[kan007.slac.stanford.edu:1094] did not return OK message for last
> >>051012 09:04:16 14869 Xrd: SendGenCommand: Server declared error
> >>3008:XrdXrootdAio: Unable to
> >>read /store/PRskims/R14/14.4.2a/BToDlnu/04/BToDlnu_0416.03HUBCA.root; No
> >>buffer space available
> >> To me it sounds ugly. It happened approaching the end of this copy. What do you think?