Hi Wilko, Wilko Kroeger wrote: > > Hello > > I did some test using xrdcp to write to xrootd and what happens if a data > server goes down. I am using xrootd version 20050525-0946 for the data > servers and redirector and 20050509-2006 for xrdcp. > > The main question is: > Should xrdcp after a data server goes down ever be redirected to a new > data server or should it just exit ? The same behavior applies to all kinds of requests: if an error occurs, the client goes back to the redirector and asks again what to do. So, xrdcp is not supposed to exit. > That's what I assumed but as described in the two tests below the current > behavior is different. > > > First Test: > =========== > > For the first test writing is done via the redirector and once xrdcp > is redirected to the data server and starts transferring data the > xrootd on the data server is stopped. > > I observed the following: > 1) xrdcp goes back to the redirector after the data server is stopped. > 2) the redirector tells the client to wait which ends up to about 9 mins > 3) After waiting for 9 mins the client tries to open the file on the > redirector and > a) crashes with a core > or > b) tries to write, doesn't succeed and stops (no core) > > The log files are available in: > /nfs/objyserv01/objy/databases/wilko/xrootd/problems/xrdcp_writeServerGoesDown > > xrdcp.log : output of xrdcp > rdr_xrdlog.log : xrootd log file of the redirector (datadevsol12) > datadevsol02_xrdlog.log : xrootd log file of the data server > core.3067 : core file from xrootd (the binary is > ~wilko/bbtest/xrootd/20050509-2006/bin/xrdcp on RHEL3) > > > In the case that xrdcp doesn't create a core the log files are: > xrdcp_noCore.log > rdr_xrdlog_noCore.log > datadevsol02_xrdlog_noCore.log > > > In the case of writing, should xrdcp go back to the redirector, or should > it just wait for a certain time, and if the data server isn't available > just quit ? > > I am going to have a look at those. Unfortunately today I have lessons up to late afternoon. > Second Test: > =========== > > For the second test the first test is run twice with a restart of > one of the data servers in between. > Assuming there are two data servers, DS0 and DS1, the following is done: > > 1) xrdcp a new file via the redirector > 2) xrdcp is redirected to DS0, and starts to transfer the file > 3) xrootd on DS0 is stopped, xrdcp waits for 9 mins and then exits > leaving an incomplete copy of the file on DS0 > 4) xrdcp is restarted again with the same file name > 5) xrdcp is redirected to DS1 (DS0 is still down), and starts to transfer > the file > 6) xrootd on DS1 is stopped. xrdcp goes back to the redirector > 7) xrootd on DS0 is restarted > 8) xrdcp is redirected to DS0 and continues to transfer the file. > It doesn't start over writing the file but it continues where it > stopped in step 6, which means the file could be corrupted (e.g.: > in step 2 the first 100MB are transfered and in step 6 it starts > at an offset of 150MB). > > > This question again is: > Shouldn't xrdcp just give up after it failed to write the file. > > If it get redirected after a data server goes down, shouldn't > it start to write the file from the beginning? > > I couldn't test the latest version of xrdcp (session id problem) and I > don't know if the latest version behaves the same way. > > The latest version is supposed to behave the same way, except for the session id problem, which hopefully I will be able to fix quickly. The problem is that the system exposes something like a "relaxed" filesystem, where you cannot expect total adherence to what is commonly considered "file system semantics". I will have a look, and fix asap the problems I find. Fabrizio