Print

Print


Hi Wilko,

Wilko Kroeger wrote:
> 
> Hello
> 
> I did some test using xrdcp to write to xrootd and what happens if a data
> server goes down. I am using xrootd version 20050525-0946 for the data
> servers and redirector and 20050509-2006 for xrdcp.
> 
> The main question is:
> Should xrdcp after a data server goes down ever be redirected to a new
> data server or should it just exit ?

  The same behavior applies to all kinds of requests: if an error 
occurs,  the client goes back to the redirector and asks again what to 
do. So, xrdcp is not supposed to exit.

> That's what I assumed but as described in the two tests below the current
> behavior is different.
> 
> 
> First Test:
> ===========
> 
> For the first test writing is done via the redirector and once xrdcp
> is redirected to the data server and starts transferring data the
> xrootd on the data server is stopped.
> 
> I observed the following:
> 1) xrdcp goes back to the redirector after the data server is stopped.
> 2) the redirector tells the client to wait which ends up to about 9 mins
> 3) After waiting for 9 mins the client tries to open the file on the
>    redirector and
>    a) crashes with a core
>      or
>    b) tries to write, doesn't succeed and stops (no core)
> 
> The log files are available in:
>  /nfs/objyserv01/objy/databases/wilko/xrootd/problems/xrdcp_writeServerGoesDown
> 
> xrdcp.log : output of xrdcp
> rdr_xrdlog.log : xrootd log file of the redirector (datadevsol12)
> datadevsol02_xrdlog.log : xrootd log file of the data server
> core.3067 : core file from xrootd (the binary is
> ~wilko/bbtest/xrootd/20050509-2006/bin/xrdcp  on RHEL3)
> 
> 
> In the case that xrdcp doesn't create a core the log files are:
> xrdcp_noCore.log
> rdr_xrdlog_noCore.log
> datadevsol02_xrdlog_noCore.log
> 
> 
> In the case of writing, should xrdcp go back to the redirector, or should
> it just wait for a certain time, and if the data server isn't available
> just quit ?
> 
> 

  I am going to have a look at those. Unfortunately today I have lessons 
up to late afternoon.



> Second Test:
> ===========
> 
> For the second test the first test is run twice with a restart of
> one of the data servers in between.
> Assuming there are two data servers, DS0 and DS1, the following is done:
> 
> 1) xrdcp a new file via the redirector
> 2) xrdcp is redirected to DS0, and starts to transfer the file
> 3) xrootd on DS0 is stopped, xrdcp waits for 9 mins and then exits
>    leaving an incomplete copy of the file on DS0
> 4) xrdcp is restarted again with the same file name
> 5) xrdcp is redirected to DS1 (DS0 is still down), and starts to transfer
>    the file
> 6) xrootd on DS1 is stopped. xrdcp goes back to the redirector
> 7) xrootd on DS0 is restarted
> 8) xrdcp is redirected to DS0 and continues to transfer the file.
>    It doesn't start over writing the file but it continues where it
>    stopped in step 6, which means the file could be corrupted (e.g.:
>    in step 2 the first 100MB are transfered and in step 6 it starts
>    at an offset of 150MB).
> 
> 
> This question again is:
> Shouldn't xrdcp just give up after it failed to write the file.
> 
> If it get redirected after a data server goes down, shouldn't
> it start to write the file from the beginning?
> 
> I couldn't test the latest version of xrdcp (session id problem) and I
> don't know if the latest version behaves the same way.
> 
> 

  The latest version is supposed to behave the same way, except for the 
session id problem, which hopefully I will be able to fix quickly.
  The problem is that the system exposes something like a "relaxed" 
filesystem, where you cannot expect total adherence to what is commonly 
considered "file system semantics".

  I will have a look, and fix asap the problems I find.

  Fabrizio