Print

Print



Hello

I did some test using xrdcp to write to xrootd and what happens if a data
server goes down. I am using xrootd version 20050525-0946 for the data
servers and redirector and 20050509-2006 for xrdcp.

The main question is:
Should xrdcp after a data server goes down ever be redirected to a new
data server or should it just exit ?
That's what I assumed but as described in the two tests below the current
behavior is different.


First Test:
===========

For the first test writing is done via the redirector and once xrdcp
is redirected to the data server and starts transferring data the
xrootd on the data server is stopped.

I observed the following:
1) xrdcp goes back to the redirector after the data server is stopped.
2) the redirector tells the client to wait which ends up to about 9 mins
3) After waiting for 9 mins the client tries to open the file on the
   redirector and
   a) crashes with a core
     or
   b) tries to write, doesn't succeed and stops (no core)

The log files are available in:
 /nfs/objyserv01/objy/databases/wilko/xrootd/problems/xrdcp_writeServerGoesDown

xrdcp.log : output of xrdcp
rdr_xrdlog.log : xrootd log file of the redirector (datadevsol12)
datadevsol02_xrdlog.log : xrootd log file of the data server
core.3067 : core file from xrootd (the binary is
~wilko/bbtest/xrootd/20050509-2006/bin/xrdcp  on RHEL3)


In the case that xrdcp doesn't create a core the log files are:
xrdcp_noCore.log
rdr_xrdlog_noCore.log
datadevsol02_xrdlog_noCore.log


In the case of writing, should xrdcp go back to the redirector, or should
it just wait for a certain time, and if the data server isn't available
just quit ?


Second Test:
===========

For the second test the first test is run twice with a restart of
one of the data servers in between.
Assuming there are two data servers, DS0 and DS1, the following is done:

1) xrdcp a new file via the redirector
2) xrdcp is redirected to DS0, and starts to transfer the file
3) xrootd on DS0 is stopped, xrdcp waits for 9 mins and then exits
   leaving an incomplete copy of the file on DS0
4) xrdcp is restarted again with the same file name
5) xrdcp is redirected to DS1 (DS0 is still down), and starts to transfer
   the file
6) xrootd on DS1 is stopped. xrdcp goes back to the redirector
7) xrootd on DS0 is restarted
8) xrdcp is redirected to DS0 and continues to transfer the file.
   It doesn't start over writing the file but it continues where it
   stopped in step 6, which means the file could be corrupted (e.g.:
   in step 2 the first 100MB are transfered and in step 6 it starts
   at an offset of 150MB).


This question again is:
Shouldn't xrdcp just give up after it failed to write the file.

If it get redirected after a data server goes down, shouldn't
it start to write the file from the beginning?

I couldn't test the latest version of xrdcp (session id problem) and I
don't know if the latest version behaves the same way.


Cheers,
   wilko