Hi Pavel, Then you are lucky. I suppose it is possible for the server to recover, depending on the server/redirector combination you are running. But do look for this pattern. Anyway, still looking for a gcore of when the server thinks it's connected but it is not (we couldn't find an actual case here at INFN). The only problem found was the one I described and they have resolved it by upgrading. Andy On Thu, 22 Mar 2007, Pavel Jakl wrote: > Hi Andy, > > if I got your question correctly, for BNL/STAR, it is preferable to > upgrade redirector rather than dataservers on the production cluster. We > have many things which we would need to test on new production version > (Name2Name, prepare etc.) before putting it into the production. > > From my observation, I don't see any downward spiral there. I see those > messages when the cluster is going up, but after while they disappear and > redirector will get normal login request. > All dataservers have 20060920 version and redirector has 20070130. > > If you want, I can try to spot the pattern and be sure that servers that > firstly showed at redirector log with an invalid login request are then > really connected somewhere. > > Pavel > > > Several people have been noticing that at times data servers can no longer > > communicate with the redirector. After a lot of sluething through INFN log > > files and discussions about which versions of what they are running, I > > have finally tracked down one (if not *the*) problem. Should you notice > > the following line in your olbd redirector log file > > > > XrdProtocol: ?:15@bbr-datamove30 terminated matching protocol not found > > > > (substitute bbr-datamove30 with your own favorite host name) then you've > > just tripped over the "bug" and it's unlikely that the indicated host will > > be able to connect to the redirector. > > > > The problem was actually introduced several months ago when the olbd was > > switched to use the same plugin architecture that xrootd uses. This was > > done in preparation for DPM and Castor2 integration (plus reducing the > > need to maintain separate but equal classes). The problem is triggered > > when the redirector becomes unavailable for about 10 minutes from the data > > server's point of view. When that occurs, the data server makes use of an > > obscure protocol element that causes the redirector to think that it is > > getting an invalid login request. So, the redirector terminates the > > connection. This, of course, makes the redirector even more unavailable > > and the dataserver wants to use that protocol element even more. You > > shouldalready see the downward spiral here. > > > > The problem was finally resolved in version 1.42 of XrdOlbServer.cc (or > > v20070305-1056 -- Match 5th). If you are running a data server created > > prior to March 5th and a redirector created *after* 2006/04/05 02:28:03 > > then you are potentially sitting on this problem. > > > > Currently, I am thinking that the best solution is to upgrade to the soon > > to be production release (based on the forth comming development release > > -- in a day or so). At a minimum, all data servers have to change. > > > > The alternative is to acomodate the odd protocol element used by the > > data server. But even that would require that at least the redirector > > would have to get upgraded to the latest release. And, frankly, that > > solution would in a way be a hack (though understandably maintaining > > backward compatibility). > > > > Anybody has a preference here? > > > > Andy > > > >