Hello
If the xrd client opens a file that has to be staged to disk and the
staging takes a long time (> ~300s) the client aborts after the open
request timed out (~300s). It does not wait for the server requested wait
and a retry of the open request fails. I see this behavior with xrdcp or
any of the recent (5.26/27 root releases).
The debug output using xrdcp (build from xrootd head) looks like:
101006 14:39:43 22811 Xrd: SendGenCommand: Sending command Open
101006 14:39:43 22811 Xrd: ReadPartialAnswer: Reading a XrdClientMessage from the server [sysdev4500.slac.stanford.edu:2299]...
101006 14:39:43 22811 Xrd: ReadPartialAnswer: Server [sysdev4500.slac.stanford.edu:2299] answered [kXR_waitresp] (4006)
101006 14:39:43 22811 Xrd: WaitResp: Waiting response for 1800 secs.
101006 14:44:53 22811 Xrd: WaitResp: Timeout elapsed.
101006 14:44:53 22811 Xrd: SendGenCommand: Sending command Open
101006 14:44:53 22811 Xrd: ReadPartialAnswer: Reading a XrdClientMessage from the server [sysdev4500.slac.stanford.edu:2299]...
101006 14:44:53 22811 Xrd: ReadPartialAnswer: Server [sysdev4500.slac.stanford.edu:2299] answered [kXR_waitresp] (4006)
101006 14:44:53 22811 Xrd: SendGenCommand: Max time limit elapsed for request kXR_open. Aborting command.
The client sends an open request and receives a wait response (1800s). The
client waits but after 300s (RequestTimeout) it times out. This causes the
client to re-issue the open command and the server responds with another
wait but the client aborts because it exceeded the time limit.
It looks to me that there are two issues in
XrdClientConn::SendGenCommand().
1) Once the client entered XrdClientConn::SendGenCommand() it will issue
the open request and if it receives the kXR_waitresp it will call
if (!WaitResp(mw)) { (line ~582)
in this case mw is 1800.
WaitResp() will wait as long as the timeouts from the kXR_waitresp
(1800s) and the timeout of the open request (300s) haven't expired (in
WaitResp():
if ((timenow < timelimit) && !IsOpTimeLimitElapsed(timenow)) {
time_t tt = xrdmin(timelimit - timenow, 10);
fREQWaitResp->Wait(tt);
So even if the server requests a 1800s wait the client will wait only 300s
(RequestTimeout) from IsOpTimeLimitElapsed(timenow).
2) If the client didn't get any response from the server while waiting,
WaitResp(mw) will return true and in XrdClientConn::SendGenCommand()
it will go back to the beginning of the loop:
while (!abortcmd && !resp) { (line ~435)
and issues another open request. It will then get to
if (IsOpTimeLimitElapsed(time(0))) { (line 485)
Error("SendGenCommand", "Max time limit elapsed for request " <<
convertRequestIdToChar(req->header.requestid) << ". Aborting command.");
which already returned true earlier in the WaitResp(mw) and cause the
client to abort.
IsOpTimeLimitElapsed(timenow) just returns:
return (timenow > fOpTimeLimit);
and fOpTimeLimit is set to the time limit from the first open request. No
new timeout limit was set when the second open request was issued.
I hope this makes sense.
Cheers,
Wilko
|