Hi Dan,
You're right, the potential problem is exceedingly rare. Thanks for the fix!
It will be in the next release.
Andy
----- Original Message -----
From: "Dan Riley" <[log in to unmask]>
To: <[log in to unmask]>
Sent: Monday, July 27, 2009 10:19 AM
Subject: bug in XrdCnsDaemon::getEvents()
> In src/XrdCns/XrdCNSDaemon.cc, XrdCnsDaemon::getEvents() has a rather
> glaring bug that I guess must be of the "almost never happens" sort
> since it is very obvious when it happens. If parsing fails for a CNS
> event, the "Miss" error flag is never cleared, so every subsequent
> event fails. The fix is trivial,
>
> --- XrdCnsDaemon.cc.orig 2008-08-19 22:36:57.000000000 -0400
> +++ XrdCnsDaemon.cc 2009-07-27 11:13:35.000000000 -0400
> @@ -128,6 +128,7 @@
>
> if (Miss) {XrdLog.Emsg("doEvents", Miss, "missing in event", eP);
> evP->Recycle();
> + Miss = 0;
> continue;
> }
>
>
> We did manage to trip this while load testing a simple setup wtih two
> data servers and one cluster master (with xrootdfs and BestMan).
> While under heavy load, an SRM transfer failed and the partial
> transfer was deleted via SRM. This apparently hit a race condition
> that resulted in a malformed CNS closew message:
>
> 090721 18:07:28 7152 XrdCnsdoEvents: Event: 'uscms01.17671:104@osg-se
> create 644
> /xrootd/path/cms/store/mc/Summer08/WJets-madgraph/USER/IDEAL_V9_PAT_v4/0007/FC20368F-B7EA-DD11-912A-001BFCDBD1BA.root?'
> [...]
> 090721 18:58:52 7152 XrdCnsdoEvents: Event: 'uscms01.17671:104@osg-se
> closew 68464080'
> 090721 18:58:52 7152 XrdCnsdoEvents: size missing in event closew
>
> (note the empty filename in the closew event). Due to the error flag
> bug, every subsequent CNS message to that node also failed with a
> "size missing" error.
>
> After restarting the affected xrootd, we saw some inconsistencies
> affecting all the files with failed events--the affected files weren't
> listable, didn't exist, couldn't be created or deleted via SRM, but
> were still associated with the particular data server. Attempting to
> write to one of the affected paths would result in a CNS "trunc" to
> the data server, which would fail because the file wasn't actually
> there. Deleting the files via the rootd protocol cleaned this up.
>
> -dan
>
|