Hi Elvin,

Indeed, what you discovered is the problem. I wonder when that code changed as the open retry has been used for the longest time in sites that do staging. It may be that it was always wrong but the old client was more forgiving than the new one. Anyway, please push your fix.

Andy

From: Elvin Sindrilaru
Sent: Tuesday, October 06, 2015 1:50 AM
To: xrootd/xrootd
Cc: Andrew Hanushevsky
Subject: Re: [xrootd] XrdOfsTPCAuth deadlock (#290)

Hi Andy,

Indeed using the current HEAD of the master the problem of the Invalid message goes way. At this point the client received a kXR_wait response but for 1 second and not for 0 seconds as it is specified here: https://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdCallBack.cc#L141

The client logs look like this:

[2015-10-06 09:45:29.953368 +0200][Dump ][XRootDTransport ] [msg: 0xb8000b70] Expecting 20 bytes of message body
[2015-10-06 09:45:29.953429 +0200][Dump ][AsyncSock ] [lxc2dev6d1.cern.ch:1095 #0.0] Received message header for 0xb8000b70 size: 8
[2015-10-06 09:45:29.953458 +0200][Dump ][AsyncSock ] [lxc2dev6d1.cern.ch:1095 #0.0] Received message 0xb8000b70 of 28 bytes
[2015-10-06 09:45:29.953483 +0200][Dump ][PostMaster ] [lxc2dev6d1.cern.ch:1095 #0] Queuing received message: 0xb8000b70.
[2015-10-06 09:45:29.953572 +0200][Dump ][XRootD ] [lxc2dev6d1.cern.ch:1095] Got an async response to message kXR_open (file: /castor/cern.ch/dev/e/esindril/dir_default/test2G_1.dat?tpc.key=000e6309712f6faf56137c15&tpc.or
[log in to unmask], mode: 00, flags: kXR_open_read kXR_async kXR_retstat ), processing it
[2015-10-06 09:45:29.953646 +0200][Dump ][XRootD ] [lxc2dev6d1.cern.ch:1095] Got kXR_wait response of 1 seconds to message kXR_open (file: /castor/cern.ch/dev/e/esindril/dir_default/test2G_1.dat?tpc.key=000e6309712f6faf56137c15&[log in to unmask], mode: 00, flags: kXR_open_read kXR_async kXR_retstat ): I've looked deeper into this and the problem actually comes for the latest commit. Therefore, instead of using commit f8ec5c6, I used the following patch:

diff --git a/src/XrdOfs/XrdOfsTPCAuth.cc b/src/XrdOfs/XrdOfsTPCAuth.cc
index b1f7f62..2367717 100644
--- a/src/XrdOfs/XrdOfsTPCAuth.cc
+++ b/src/XrdOfs/XrdOfsTPCAuth.cc
@@ -87,7 +87,7 @@ int XrdOfsTPCAuth::Add(XrdOfsTPC::Facts &Args)
{if (aP->Info.cbP)
{aP->expT = expT;
aP->Next = authQ; authQ = aP;
- aP->Info.Reply(SFS_STALL, 0, "", &authMutex);
+ aP->Info.Reply(SFS_OK, 0, "", &authMutex);
return 1;
} else {
authMutex.UnLock();
diff --git a/src/XrdXrootd/XrdXrootdCallBack.cc b/src/XrdXrootd/XrdXrootdCallBack.cc
index aaf300f..6d37572 100644
--- a/src/XrdXrootd/XrdXrootdCallBack.cc
+++ b/src/XrdXrootd/XrdXrootdCallBack.cc
@@ -143,7 +143,7 @@ void XrdXrootdCBJob::DoIt()
// the client to wait zero seconds. Protocol demands a client retry.
//
if (SFS_OK == Result)
- {if (*(cbFunc->Func()) == 'o') cbFunc->sendResp(eInfo, kXR_wait, 0);
+ {if (*(cbFunc->Func()) == 'o') {int rc = 0; cbFunc->sendResp(eInfo, kXR_wait, &rc);}
else {if (*(cbFunc->Func()) == 'x') DoStatx(eInfo);
cbFunc->sendResp(eInfo, kXR_ok, 0, eInfo->getErrText(),
eInfo->getErrTextLen());This I believe fixes the underlying problem as in the XrdXrootdCBJob::DoIt function there is a special code path dealing with async responses for open which is not used if we return SFS_STALL in the XrdOfsTPCAuth::Add. The Invalid message was coming form the fact that the XrdXrootdCBJob::sendResp called above, was not properly building the message.

Let me know you thoughts on this and if it makes sense I can push it to the master.

Thanks,
Elvin


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub.



Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1