Hi Andy, so, I have spot some servers having a problem when xrootd process is trying to connect to olbd process. The lsof output gave this: olbd 14643 starlib 12u unix 0xe029a880 283617437 /var/xrootd/slave/.olb/olbd.admin olbd 14643 starlib 16u unix 0xcd52b800 283617493 /var/xrootd/slave/.olb/olbd.notes xrootd 31054 starlib 13u unix 0xedd77480 139733486 /tmp/XROOTD_ADMIN/slave//.xrootd/admin As you can see the socket file for xrootd process is still located on /tmp which is periodically purged. I have found this also on all other nodes having this problem. I guess these server were not restarted correctly and didn't pick up the new configuration for /var/xrootd .... I will restart those nodes and continue in monitoring this issue. Thanks Pavel > Hi Pavel, > > Then you are lucky. I suppose it is possible for the server to recover, > depending on the server/redirector combination you are running. But do > look for this pattern. Anyway, still looking for a gcore of when the > server thinks it's connected but it is not (we couldn't find an actual > case here at INFN). The only problem found was the one I described and > they have resolved it by upgrading. > > Andy > > On Thu, 22 Mar 2007, Pavel Jakl wrote: > >> Hi Andy, >> >> if I got your question correctly, for BNL/STAR, it is preferable to >> upgrade redirector rather than dataservers on the production cluster. We >> have many things which we would need to test on new production version >> (Name2Name, prepare etc.) before putting it into the production. >> >> From my observation, I don't see any downward spiral there. I see those >> messages when the cluster is going up, but after while they disappear >> and >> redirector will get normal login request. >> All dataservers have 20060920 version and redirector has 20070130. >> >> If you want, I can try to spot the pattern and be sure that servers that >> firstly showed at redirector log with an invalid login request are then >> really connected somewhere. >> >> Pavel >> >> > Several people have been noticing that at times data servers can no >> longer >> > communicate with the redirector. After a lot of sluething through INFN >> log >> > files and discussions about which versions of what they are running, I >> > have finally tracked down one (if not *the*) problem. Should you >> notice >> > the following line in your olbd redirector log file >> > >> > XrdProtocol: ?:15@bbr-datamove30 terminated matching protocol not >> found >> > >> > (substitute bbr-datamove30 with your own favorite host name) then >> you've >> > just tripped over the "bug" and it's unlikely that the indicated host >> will >> > be able to connect to the redirector. >> > >> > The problem was actually introduced several months ago when the olbd >> was >> > switched to use the same plugin architecture that xrootd uses. This >> was >> > done in preparation for DPM and Castor2 integration (plus reducing the >> > need to maintain separate but equal classes). The problem is triggered >> > when the redirector becomes unavailable for about 10 minutes from the >> data >> > server's point of view. When that occurs, the data server makes use of >> an >> > obscure protocol element that causes the redirector to think that it >> is >> > getting an invalid login request. So, the redirector terminates the >> > connection. This, of course, makes the redirector even more >> unavailable >> > and the dataserver wants to use that protocol element even more. You >> > shouldalready see the downward spiral here. >> > >> > The problem was finally resolved in version 1.42 of XrdOlbServer.cc >> (or >> > v20070305-1056 -- Match 5th). If you are running a data server created >> > prior to March 5th and a redirector created *after* 2006/04/05 >> 02:28:03 >> > then you are potentially sitting on this problem. >> > >> > Currently, I am thinking that the best solution is to upgrade to the >> soon >> > to be production release (based on the forth comming development >> release >> > -- in a day or so). At a minimum, all data servers have to change. >> > >> > The alternative is to acomodate the odd protocol element used by the >> > data server. But even that would require that at least the redirector >> > would have to get upgraded to the latest release. And, frankly, that >> > solution would in a way be a hack (though understandably maintaining >> > backward compatibility). >> > >> > Anybody has a preference here? >> > >> > Andy >> > >> >> >