Print

Print


Hi Dan,

You're right, the potential problem is exceedingly rare. Thanks for the fix! 
It will be in the next release.

Andy

----- Original Message ----- 
From: "Dan Riley" <[log in to unmask]>
To: <[log in to unmask]>
Sent: Monday, July 27, 2009 10:19 AM
Subject: bug in XrdCnsDaemon::getEvents()


> In src/XrdCns/XrdCNSDaemon.cc, XrdCnsDaemon::getEvents() has a rather
> glaring bug that I guess must be of the "almost never happens" sort
> since it is very obvious when it happens.  If parsing fails for a CNS
> event, the "Miss" error flag is never cleared, so every subsequent
> event fails.  The fix is trivial,
>
> --- XrdCnsDaemon.cc.orig        2008-08-19 22:36:57.000000000 -0400
> +++ XrdCnsDaemon.cc     2009-07-27 11:13:35.000000000 -0400
> @@ -128,6 +128,7 @@
>
>          if (Miss) {XrdLog.Emsg("doEvents", Miss, "missing in event", eP);
>                     evP->Recycle();
> +                    Miss = 0;
>                     continue;
>                    }
>
>
> We did manage to trip this while load testing a simple setup wtih two
> data servers and one cluster master (with xrootdfs and BestMan).
> While under heavy load, an SRM transfer failed and the partial
> transfer was deleted via SRM.  This apparently hit a race condition
> that resulted in a malformed CNS closew message:
>
> 090721 18:07:28 7152 XrdCnsdoEvents: Event: 'uscms01.17671:104@osg-se 
> create 644 
> /xrootd/path/cms/store/mc/Summer08/WJets-madgraph/USER/IDEAL_V9_PAT_v4/0007/FC20368F-B7EA-DD11-912A-001BFCDBD1BA.root?'
> [...]
> 090721 18:58:52 7152 XrdCnsdoEvents: Event: 'uscms01.17671:104@osg-se 
> closew  68464080'
> 090721 18:58:52 7152 XrdCnsdoEvents: size missing in event closew
>
> (note the empty filename in the closew event).  Due to the error flag
> bug, every subsequent CNS message to that node also failed with a
> "size missing" error.
>
> After restarting the affected xrootd, we saw some inconsistencies
> affecting all the files with failed events--the affected files weren't
> listable, didn't exist, couldn't be created or deleted via SRM, but
> were still associated with the particular data server.  Attempting to
> write to one of the affected paths would result in a CNS "trunc" to
> the data server, which would fail because the file wasn't actually
> there.  Deleting the files via the rootd protocol cleaned this up.
>
> -dan
>