Hi Wilko,
all this does make sense. Since that version I had to introduce the
concept of "transaction timeout", different from the "request timeout".
The historical reason had to do with some deployments which were simply
delaying the client forever, with continuous kxr_wait responses in
smaller steps.
The meaning of the transaction timeout is that there is an overall
maximum time that the client is given to finish the activity, no matter
how many protocol requests are involved.
Probably at the time (~1 year ago) I chose an unfortunate default
value (300s), good for disk access but not very good for tapes. It's
just sufficient to override it with the value you prefer, e.g. 1 hour.
With xrdcp you can just use -DITransactionTimeout 3600
Hope this helps
Fabrizio
Wilko Kroeger ha scritto:
>
> Hello
>
> If the xrd client opens a file that has to be staged to disk and the
> staging takes a long time (> ~300s) the client aborts after the open
> request timed out (~300s). It does not wait for the server requested
> wait and a retry of the open request fails. I see this behavior with
> xrdcp or any of the recent (5.26/27 root releases).
>
> The debug output using xrdcp (build from xrootd head) looks like:
>
> 101006 14:39:43 22811 Xrd: SendGenCommand: Sending command Open
> 101006 14:39:43 22811 Xrd: ReadPartialAnswer: Reading a XrdClientMessage
> from the server [sysdev4500.slac.stanford.edu:2299]...
> 101006 14:39:43 22811 Xrd: ReadPartialAnswer: Server
> [sysdev4500.slac.stanford.edu:2299] answered [kXR_waitresp] (4006)
> 101006 14:39:43 22811 Xrd: WaitResp: Waiting response for 1800 secs.
> 101006 14:44:53 22811 Xrd: WaitResp: Timeout elapsed.
> 101006 14:44:53 22811 Xrd: SendGenCommand: Sending command Open
> 101006 14:44:53 22811 Xrd: ReadPartialAnswer: Reading a XrdClientMessage
> from the server [sysdev4500.slac.stanford.edu:2299]...
> 101006 14:44:53 22811 Xrd: ReadPartialAnswer: Server
> [sysdev4500.slac.stanford.edu:2299] answered [kXR_waitresp] (4006)
> 101006 14:44:53 22811 Xrd: SendGenCommand: Max time limit elapsed for
> request kXR_open. Aborting command.
>
> The client sends an open request and receives a wait response (1800s).
> The client waits but after 300s (RequestTimeout) it times out. This
> causes the client to re-issue the open command and the server responds
> with another wait but the client aborts because it exceeded the time limit.
>
> It looks to me that there are two issues in
> XrdClientConn::SendGenCommand().
>
> 1) Once the client entered XrdClientConn::SendGenCommand() it will issue
> the open request and if it receives the kXR_waitresp it will call
>
> if (!WaitResp(mw)) { (line ~582)
> in this case mw is 1800.
> WaitResp() will wait as long as the timeouts from the kXR_waitresp
> (1800s) and the timeout of the open request (300s) haven't expired (in
> WaitResp():
>
> if ((timenow < timelimit) && !IsOpTimeLimitElapsed(timenow)) {
> time_t tt = xrdmin(timelimit - timenow, 10);
> fREQWaitResp->Wait(tt);
> So even if the server requests a 1800s wait the client will wait only
> 300s (RequestTimeout) from IsOpTimeLimitElapsed(timenow).
>
> 2) If the client didn't get any response from the server while waiting,
> WaitResp(mw) will return true and in XrdClientConn::SendGenCommand()
> it will go back to the beginning of the loop:
> while (!abortcmd && !resp) { (line ~435)
>
> and issues another open request. It will then get to
> if (IsOpTimeLimitElapsed(time(0))) { (line 485)
> Error("SendGenCommand", "Max time limit elapsed for request " <<
> convertRequestIdToChar(req->header.requestid) << ".
> Aborting command.");
>
> which already returned true earlier in the WaitResp(mw) and cause the
> client to abort.
> IsOpTimeLimitElapsed(timenow) just returns:
> return (timenow > fOpTimeLimit);
>
> and fOpTimeLimit is set to the time limit from the first open request.
> No new timeout limit was set when the second open request was issued.
>
> I hope this makes sense.
>
> Cheers,
> Wilko
>
>
>
>
>
>
>
>
>
>
>
>
|