Hi Dan, You're right, the potential problem is exceedingly rare. Thanks for the fix! It will be in the next release. Andy ----- Original Message ----- From: "Dan Riley" <[log in to unmask]> To: <[log in to unmask]> Sent: Monday, July 27, 2009 10:19 AM Subject: bug in XrdCnsDaemon::getEvents() > In src/XrdCns/XrdCNSDaemon.cc, XrdCnsDaemon::getEvents() has a rather > glaring bug that I guess must be of the "almost never happens" sort > since it is very obvious when it happens. If parsing fails for a CNS > event, the "Miss" error flag is never cleared, so every subsequent > event fails. The fix is trivial, > > --- XrdCnsDaemon.cc.orig 2008-08-19 22:36:57.000000000 -0400 > +++ XrdCnsDaemon.cc 2009-07-27 11:13:35.000000000 -0400 > @@ -128,6 +128,7 @@ > > if (Miss) {XrdLog.Emsg("doEvents", Miss, "missing in event", eP); > evP->Recycle(); > + Miss = 0; > continue; > } > > > We did manage to trip this while load testing a simple setup wtih two > data servers and one cluster master (with xrootdfs and BestMan). > While under heavy load, an SRM transfer failed and the partial > transfer was deleted via SRM. This apparently hit a race condition > that resulted in a malformed CNS closew message: > > 090721 18:07:28 7152 XrdCnsdoEvents: Event: 'uscms01.17671:104@osg-se > create 644 > /xrootd/path/cms/store/mc/Summer08/WJets-madgraph/USER/IDEAL_V9_PAT_v4/0007/FC20368F-B7EA-DD11-912A-001BFCDBD1BA.root?' > [...] > 090721 18:58:52 7152 XrdCnsdoEvents: Event: 'uscms01.17671:104@osg-se > closew 68464080' > 090721 18:58:52 7152 XrdCnsdoEvents: size missing in event closew > > (note the empty filename in the closew event). Due to the error flag > bug, every subsequent CNS message to that node also failed with a > "size missing" error. > > After restarting the affected xrootd, we saw some inconsistencies > affecting all the files with failed events--the affected files weren't > listable, didn't exist, couldn't be created or deleted via SRM, but > were still associated with the particular data server. Attempting to > write to one of the affected paths would result in a CNS "trunc" to > the data server, which would fail because the file wasn't actually > there. Deleting the files via the rootd protocol cleaned this up. > > -dan >