Hi Wilko,
Wilko Kroeger wrote:
>
> Hello
>
> I did some test using xrdcp to write to xrootd and what happens if a data
> server goes down. I am using xrootd version 20050525-0946 for the data
> servers and redirector and 20050509-2006 for xrdcp.
>
> The main question is:
> Should xrdcp after a data server goes down ever be redirected to a new
> data server or should it just exit ?
The same behavior applies to all kinds of requests: if an error
occurs, the client goes back to the redirector and asks again what to
do. So, xrdcp is not supposed to exit.
> That's what I assumed but as described in the two tests below the current
> behavior is different.
>
>
> First Test:
> ===========
>
> For the first test writing is done via the redirector and once xrdcp
> is redirected to the data server and starts transferring data the
> xrootd on the data server is stopped.
>
> I observed the following:
> 1) xrdcp goes back to the redirector after the data server is stopped.
> 2) the redirector tells the client to wait which ends up to about 9 mins
> 3) After waiting for 9 mins the client tries to open the file on the
> redirector and
> a) crashes with a core
> or
> b) tries to write, doesn't succeed and stops (no core)
>
> The log files are available in:
> /nfs/objyserv01/objy/databases/wilko/xrootd/problems/xrdcp_writeServerGoesDown
>
> xrdcp.log : output of xrdcp
> rdr_xrdlog.log : xrootd log file of the redirector (datadevsol12)
> datadevsol02_xrdlog.log : xrootd log file of the data server
> core.3067 : core file from xrootd (the binary is
> ~wilko/bbtest/xrootd/20050509-2006/bin/xrdcp on RHEL3)
>
>
> In the case that xrdcp doesn't create a core the log files are:
> xrdcp_noCore.log
> rdr_xrdlog_noCore.log
> datadevsol02_xrdlog_noCore.log
>
>
> In the case of writing, should xrdcp go back to the redirector, or should
> it just wait for a certain time, and if the data server isn't available
> just quit ?
>
>
I am going to have a look at those. Unfortunately today I have lessons
up to late afternoon.
> Second Test:
> ===========
>
> For the second test the first test is run twice with a restart of
> one of the data servers in between.
> Assuming there are two data servers, DS0 and DS1, the following is done:
>
> 1) xrdcp a new file via the redirector
> 2) xrdcp is redirected to DS0, and starts to transfer the file
> 3) xrootd on DS0 is stopped, xrdcp waits for 9 mins and then exits
> leaving an incomplete copy of the file on DS0
> 4) xrdcp is restarted again with the same file name
> 5) xrdcp is redirected to DS1 (DS0 is still down), and starts to transfer
> the file
> 6) xrootd on DS1 is stopped. xrdcp goes back to the redirector
> 7) xrootd on DS0 is restarted
> 8) xrdcp is redirected to DS0 and continues to transfer the file.
> It doesn't start over writing the file but it continues where it
> stopped in step 6, which means the file could be corrupted (e.g.:
> in step 2 the first 100MB are transfered and in step 6 it starts
> at an offset of 150MB).
>
>
> This question again is:
> Shouldn't xrdcp just give up after it failed to write the file.
>
> If it get redirected after a data server goes down, shouldn't
> it start to write the file from the beginning?
>
> I couldn't test the latest version of xrdcp (session id problem) and I
> don't know if the latest version behaves the same way.
>
>
The latest version is supposed to behave the same way, except for the
session id problem, which hopefully I will be able to fix quickly.
The problem is that the system exposes something like a "relaxed"
filesystem, where you cannot expect total adherence to what is commonly
considered "file system semantics".
I will have a look, and fix asap the problems I find.
Fabrizio
|