Print

Print


Hi Andy,

so, I have spot some servers having a problem when xrootd process is
trying to connect to olbd process.

The lsof output gave this:

olbd      14643 starlib   12u  unix 0xe029a880      283617437
/var/xrootd/slave/.olb/olbd.admin
olbd      14643 starlib   16u  unix 0xcd52b800      283617493
/var/xrootd/slave/.olb/olbd.notes
xrootd    31054 starlib   13u  unix 0xedd77480      139733486
/tmp/XROOTD_ADMIN/slave//.xrootd/admin

As you can see the socket file for xrootd process is still located on /tmp
which is periodically purged. I have found this also on all other nodes
having this problem. I guess these server were not restarted correctly and
didn't pick up the new configuration for /var/xrootd ....

I will restart those nodes and continue in monitoring this issue.

Thanks
Pavel

> Hi Pavel,
>
> Then you are lucky. I suppose it is possible for the server to recover,
> depending on the server/redirector combination you are running. But do
> look for this pattern. Anyway, still looking for a gcore of when the
> server thinks it's connected but it is not (we couldn't find an actual
> case here at INFN). The only problem found was the one I described and
> they have resolved it by upgrading.
>
> Andy
>
> On Thu, 22 Mar 2007, Pavel Jakl wrote:
>
>> Hi Andy,
>>
>> if I got your question correctly, for BNL/STAR, it is preferable to
>> upgrade redirector rather than dataservers on the production cluster. We
>> have many things which we would need to test on new production version
>> (Name2Name, prepare etc.) before putting it into the production.
>>
>> From my observation, I don't see any downward spiral there. I see those
>> messages when the cluster is going up, but after while they disappear
>> and
>> redirector will get normal login request.
>> All dataservers have 20060920 version and redirector has 20070130.
>>
>> If you want, I can try to spot the pattern and be sure that servers that
>> firstly showed at redirector log with an invalid login request are then
>> really connected somewhere.
>>
>> Pavel
>>
>> > Several people have been noticing that at times data servers can no
>> longer
>> > communicate with the redirector. After a lot of sluething through INFN
>> log
>> > files and discussions about which versions of what they are running, I
>> > have finally tracked down one (if not  *the*) problem. Should you
>> notice
>> > the following line in your olbd redirector log file
>> >
>> > XrdProtocol: ?:15@bbr-datamove30 terminated matching protocol not
>> found
>> >
>> > (substitute bbr-datamove30 with your own favorite host name) then
>> you've
>> > just tripped over the "bug" and it's unlikely that the indicated host
>> will
>> > be able to connect to the redirector.
>> >
>> > The problem was actually introduced several months ago when the olbd
>> was
>> > switched to use the same plugin architecture that xrootd uses. This
>> was
>> > done in preparation for DPM and Castor2 integration (plus reducing the
>> > need to maintain separate but equal classes). The problem is triggered
>> > when the redirector becomes unavailable for about 10 minutes from the
>> data
>> > server's point of view. When that occurs, the data server makes use of
>> an
>> > obscure protocol element that causes the redirector to think that it
>> is
>> > getting an invalid login request. So, the redirector terminates the
>> > connection. This, of course, makes the redirector even more
>> unavailable
>> > and the dataserver wants to use that protocol element even more. You
>> > shouldalready see the downward spiral here.
>> >
>> > The problem was finally resolved in version 1.42 of XrdOlbServer.cc
>> (or
>> > v20070305-1056 -- Match 5th). If you are running a data server created
>> > prior to March 5th and a redirector created *after* 2006/04/05
>> 02:28:03
>> > then you are potentially sitting on this problem.
>> >
>> > Currently, I am thinking that the best solution is to upgrade to the
>> soon
>> > to be production release (based on the forth comming development
>> release
>> > -- in a day or so). At a minimum, all data servers have to change.
>> >
>> > The alternative is to acomodate the odd protocol element used by the
>> > data server. But even that would require that at least the redirector
>> > would have to get upgraded to the latest release. And, frankly, that
>> > solution would in a way be a hack (though understandably maintaining
>> > backward compatibility).
>> >
>> > Anybody has a preference here?
>> >
>> > Andy
>> >
>>
>>
>