Hi Andy,

so, I have spot some servers having a problem when xrootd process is
trying to connect to olbd process.

The lsof output gave this:

olbd      14643 starlib   12u  unix 0xe029a880      283617437
olbd      14643 starlib   16u  unix 0xcd52b800      283617493
xrootd    31054 starlib   13u  unix 0xedd77480      139733486

As you can see the socket file for xrootd process is still located on /tmp
which is periodically purged. I have found this also on all other nodes
having this problem. I guess these server were not restarted correctly and
didn't pick up the new configuration for /var/xrootd ....

I will restart those nodes and continue in monitoring this issue.


> Hi Pavel,
> Then you are lucky. I suppose it is possible for the server to recover,
> depending on the server/redirector combination you are running. But do
> look for this pattern. Anyway, still looking for a gcore of when the
> server thinks it's connected but it is not (we couldn't find an actual
> case here at INFN). The only problem found was the one I described and
> they have resolved it by upgrading.
> Andy
> On Thu, 22 Mar 2007, Pavel Jakl wrote:
>> Hi Andy,
>> if I got your question correctly, for BNL/STAR, it is preferable to
>> upgrade redirector rather than dataservers on the production cluster. We
>> have many things which we would need to test on new production version
>> (Name2Name, prepare etc.) before putting it into the production.
>> From my observation, I don't see any downward spiral there. I see those
>> messages when the cluster is going up, but after while they disappear
>> and
>> redirector will get normal login request.
>> All dataservers have 20060920 version and redirector has 20070130.
>> If you want, I can try to spot the pattern and be sure that servers that
>> firstly showed at redirector log with an invalid login request are then
>> really connected somewhere.
>> Pavel
>> > Several people have been noticing that at times data servers can no
>> longer
>> > communicate with the redirector. After a lot of sluething through INFN
>> log
>> > files and discussions about which versions of what they are running, I
>> > have finally tracked down one (if not  *the*) problem. Should you
>> notice
>> > the following line in your olbd redirector log file
>> >
>> > XrdProtocol: ?:15@bbr-datamove30 terminated matching protocol not
>> found
>> >
>> > (substitute bbr-datamove30 with your own favorite host name) then
>> you've
>> > just tripped over the "bug" and it's unlikely that the indicated host
>> will
>> > be able to connect to the redirector.
>> >
>> > The problem was actually introduced several months ago when the olbd
>> was
>> > switched to use the same plugin architecture that xrootd uses. This
>> was
>> > done in preparation for DPM and Castor2 integration (plus reducing the
>> > need to maintain separate but equal classes). The problem is triggered
>> > when the redirector becomes unavailable for about 10 minutes from the
>> data
>> > server's point of view. When that occurs, the data server makes use of
>> an
>> > obscure protocol element that causes the redirector to think that it
>> is
>> > getting an invalid login request. So, the redirector terminates the
>> > connection. This, of course, makes the redirector even more
>> unavailable
>> > and the dataserver wants to use that protocol element even more. You
>> > shouldalready see the downward spiral here.
>> >
>> > The problem was finally resolved in version 1.42 of
>> (or
>> > v20070305-1056 -- Match 5th). If you are running a data server created
>> > prior to March 5th and a redirector created *after* 2006/04/05
>> 02:28:03
>> > then you are potentially sitting on this problem.
>> >
>> > Currently, I am thinking that the best solution is to upgrade to the
>> soon
>> > to be production release (based on the forth comming development
>> release
>> > -- in a day or so). At a minimum, all data servers have to change.
>> >
>> > The alternative is to acomodate the odd protocol element used by the
>> > data server. But even that would require that at least the redirector
>> > would have to get upgraded to the latest release. And, frankly, that
>> > solution would in a way be a hack (though understandably maintaining
>> > backward compatibility).
>> >
>> > Anybody has a preference here?
>> >
>> > Andy
>> >