if I got your question correctly, for BNL/STAR, it is preferable to
upgrade redirector rather than dataservers on the production cluster. We
have many things which we would need to test on new production version
(Name2Name, prepare etc.) before putting it into the production.
>From my observation, I don't see any downward spiral there. I see those
messages when the cluster is going up, but after while they disappear and
redirector will get normal login request.
All dataservers have 20060920 version and redirector has 20070130.
If you want, I can try to spot the pattern and be sure that servers that
firstly showed at redirector log with an invalid login request are then
really connected somewhere.
> Several people have been noticing that at times data servers can no longer
> communicate with the redirector. After a lot of sluething through INFN log
> files and discussions about which versions of what they are running, I
> have finally tracked down one (if not *the*) problem. Should you notice
> the following line in your olbd redirector log file
> XrdProtocol: ?:15@bbr-datamove30 terminated matching protocol not found
> (substitute bbr-datamove30 with your own favorite host name) then you've
> just tripped over the "bug" and it's unlikely that the indicated host will
> be able to connect to the redirector.
> The problem was actually introduced several months ago when the olbd was
> switched to use the same plugin architecture that xrootd uses. This was
> done in preparation for DPM and Castor2 integration (plus reducing the
> need to maintain separate but equal classes). The problem is triggered
> when the redirector becomes unavailable for about 10 minutes from the data
> server's point of view. When that occurs, the data server makes use of an
> obscure protocol element that causes the redirector to think that it is
> getting an invalid login request. So, the redirector terminates the
> connection. This, of course, makes the redirector even more unavailable
> and the dataserver wants to use that protocol element even more. You
> shouldalready see the downward spiral here.
> The problem was finally resolved in version 1.42 of XrdOlbServer.cc (or
> v20070305-1056 -- Match 5th). If you are running a data server created
> prior to March 5th and a redirector created *after* 2006/04/05 02:28:03
> then you are potentially sitting on this problem.
> Currently, I am thinking that the best solution is to upgrade to the soon
> to be production release (based on the forth comming development release
> -- in a day or so). At a minimum, all data servers have to change.
> The alternative is to acomodate the odd protocol element used by the
> data server. But even that would require that at least the redirector
> would have to get upgraded to the latest release. And, frankly, that
> solution would in a way be a hack (though understandably maintaining
> backward compatibility).
> Anybody has a preference here?