Print

Print


In src/XrdCns/XrdCNSDaemon.cc, XrdCnsDaemon::getEvents() has a rather
glaring bug that I guess must be of the "almost never happens" sort
since it is very obvious when it happens.  If parsing fails for a CNS
event, the "Miss" error flag is never cleared, so every subsequent
event fails.  The fix is trivial,

--- XrdCnsDaemon.cc.orig        2008-08-19 22:36:57.000000000 -0400
+++ XrdCnsDaemon.cc     2009-07-27 11:13:35.000000000 -0400
@@ -128,6 +128,7 @@
 
          if (Miss) {XrdLog.Emsg("doEvents", Miss, "missing in event", eP);
                     evP->Recycle();
+                    Miss = 0;
                     continue;
                    }
 

We did manage to trip this while load testing a simple setup wtih two
data servers and one cluster master (with xrootdfs and BestMan).
While under heavy load, an SRM transfer failed and the partial
transfer was deleted via SRM.  This apparently hit a race condition
that resulted in a malformed CNS closew message:

090721 18:07:28 7152 XrdCnsdoEvents: Event: 'uscms01.17671:104@osg-se create 644 /xrootd/path/cms/store/mc/Summer08/WJets-madgraph/USER/IDEAL_V9_PAT_v4/0007/FC20368F-B7EA-DD11-912A-001BFCDBD1BA.root?'
[...]
090721 18:58:52 7152 XrdCnsdoEvents: Event: 'uscms01.17671:104@osg-se closew  68464080'
090721 18:58:52 7152 XrdCnsdoEvents: size missing in event closew

(note the empty filename in the closew event).  Due to the error flag
bug, every subsequent CNS message to that node also failed with a
"size missing" error.

After restarting the affected xrootd, we saw some inconsistencies
affecting all the files with failed events--the affected files weren't
listable, didn't exist, couldn't be created or deleted via SRM, but
were still associated with the particular data server.  Attempting to
write to one of the affected paths would result in a CNS "trunc" to
the data server, which would fail because the file wasn't actually
there.  Deleting the files via the rootd protocol cleaned this up.

-dan