Hi Pavel,

Then you are lucky. I suppose it is possible for the server to recover,
depending on the server/redirector combination you are running. But do
look for this pattern. Anyway, still looking for a gcore of when the
server thinks it's connected but it is not (we couldn't find an actual
case here at INFN). The only problem found was the one I described and
they have resolved it by upgrading.


On Thu, 22 Mar 2007, Pavel Jakl wrote:

> Hi Andy,
> if I got your question correctly, for BNL/STAR, it is preferable to
> upgrade redirector rather than dataservers on the production cluster. We
> have many things which we would need to test on new production version
> (Name2Name, prepare etc.) before putting it into the production.
> From my observation, I don't see any downward spiral there. I see those
> messages when the cluster is going up, but after while they disappear and
> redirector will get normal login request.
> All dataservers have 20060920 version and redirector has 20070130.
> If you want, I can try to spot the pattern and be sure that servers that
> firstly showed at redirector log with an invalid login request are then
> really connected somewhere.
> Pavel
> > Several people have been noticing that at times data servers can no longer
> > communicate with the redirector. After a lot of sluething through INFN log
> > files and discussions about which versions of what they are running, I
> > have finally tracked down one (if not  *the*) problem. Should you notice
> > the following line in your olbd redirector log file
> >
> > XrdProtocol: ?:15@bbr-datamove30 terminated matching protocol not found
> >
> > (substitute bbr-datamove30 with your own favorite host name) then you've
> > just tripped over the "bug" and it's unlikely that the indicated host will
> > be able to connect to the redirector.
> >
> > The problem was actually introduced several months ago when the olbd was
> > switched to use the same plugin architecture that xrootd uses. This was
> > done in preparation for DPM and Castor2 integration (plus reducing the
> > need to maintain separate but equal classes). The problem is triggered
> > when the redirector becomes unavailable for about 10 minutes from the data
> > server's point of view. When that occurs, the data server makes use of an
> > obscure protocol element that causes the redirector to think that it is
> > getting an invalid login request. So, the redirector terminates the
> > connection. This, of course, makes the redirector even more unavailable
> > and the dataserver wants to use that protocol element even more. You
> > shouldalready see the downward spiral here.
> >
> > The problem was finally resolved in version 1.42 of (or
> > v20070305-1056 -- Match 5th). If you are running a data server created
> > prior to March 5th and a redirector created *after* 2006/04/05 02:28:03
> > then you are potentially sitting on this problem.
> >
> > Currently, I am thinking that the best solution is to upgrade to the soon
> > to be production release (based on the forth comming development release
> > -- in a day or so). At a minimum, all data servers have to change.
> >
> > The alternative is to acomodate the odd protocol element used by the
> > data server. But even that would require that at least the redirector
> > would have to get upgraded to the latest release. And, frankly, that
> > solution would in a way be a hack (though understandably maintaining
> > backward compatibility).
> >
> > Anybody has a preference here?
> >
> > Andy
> >