Print

Print


Several people have been noticing that at times data servers can no longer
communicate with the redirector. After a lot of sluething through INFN log
files and discussions about which versions of what they are running, I
have finally tracked down one (if not  *the*) problem. Should you notice
the following line in your olbd redirector log file

XrdProtocol: ?:15@bbr-datamove30 terminated matching protocol not found

(substitute bbr-datamove30 with your own favorite host name) then you've
just tripped over the "bug" and it's unlikely that the indicated host will
be able to connect to the redirector.

The problem was actually introduced several months ago when the olbd was
switched to use the same plugin architecture that xrootd uses. This was
done in preparation for DPM and Castor2 integration (plus reducing the
need to maintain separate but equal classes). The problem is triggered
when the redirector becomes unavailable for about 10 minutes from the data
server's point of view. When that occurs, the data server makes use of an
obscure protocol element that causes the redirector to think that it is
getting an invalid login request. So, the redirector terminates the
connection. This, of course, makes the redirector even more unavailable
and the dataserver wants to use that protocol element even more. You
shouldalready see the downward spiral here.

The problem was finally resolved in version 1.42 of XrdOlbServer.cc (or
v20070305-1056 -- Match 5th). If you are running a data server created
prior to March 5th and a redirector created *after* 2006/04/05 02:28:03
then you are potentially sitting on this problem.

Currently, I am thinking that the best solution is to upgrade to the soon
to be production release (based on the forth comming development release
-- in a day or so). At a minimum, all data servers have to change.

The alternative is to acomodate the odd protocol element used by the
data server. But even that would require that at least the redirector
would have to get upgraded to the latest release. And, frankly, that
solution would in a way be a hack (though understandably maintaining
backward compatibility).

Anybody has a preference here?

Andy