I saw something similar at SLAC. I put a file into the kan xrootd cluster
but not into HPSS. Using xrdcp to access the file I saw failures from time
to time because xrootd redirected the client to a server that didn't have
the file. The client then went back to the redirector but again it was
redirected to a different server that didn't have the file and at some
moment the client gave up and failed (note the server with the file was
up all the time). AFAIK if the client asks for a file again (the refresh
bit turned on) xrootd will locate the file again and it should find that
the file is on a particular server. However, xrootd might not select the
server because it is heavily loaded. In all of the old xrootd versions the
load values are totally off and it could be that xrootd is not selecting a
server even if it is not so busy.
Do you have the olbd log file (from the redirector) when you saw this
problem? Maybe the load value for the data server was very high and the
machine didn't get selected.
This is just a guess but I will see if I can test it.
On Thu, 27 Oct 2005, Brew, CAJ (Chris) wrote:
> I still don't really understand why this is a client issue not a server
> issue, but whatever it is it still seems to be present in the latest SP
> A quick recap of the problem as it presented in this case.
> The disk containing one of the BkgTrigger files was take offline for
> checks - This was a run 5 collection that had just been imported and so
> hadn't yet been written to tape.
> When a client job requested the file the olbd redirected it to one of
> the stage servers to get if from tape, which obviously failed as did the
> subsequent jobs.
> However when the disk was brought back online and the file was once
> again available 50% of the jobs continued to fail.
> Looking at the olbd logs I see it redirecting these jobs to the stage
> server irrespective of the fact that it FAILED to import the file.
> As far as I can tell the only way to fix this is to stop and start the
> oblds on both redirectors.