On Thu, 17 Feb 2005, Gregory Schott wrote:
> 1) There are two network interfaced on the redirector. The issue is that
> using the wrong one in the KanAccess file (the one that is not the same as
> in the xrootd config files) causes to crash the xrootd (and not olbd)
> proccess on the redirector.
> The core file reveals:
> > gdb /opt/xrootd/bin/xrootd core.9418
> #0 0x0806349c in XrdNetwork::getHostName ()
Yes, this is a known problem. The xrootd will crash if it cannot do a
reverse translation of IP address to DNS name for it's own machine. I
should be more graceful and issue an error message and exit. That's on the
list of corrections.
> 2) Strange disk size in the olb redirector log file. I'm using the sep 04
> prod version.
> 050216 18:49:32 9419 olb_a2i: tot dsk value -1463812096 is too small
> 050216 18:49:32 9419 olb_Server: invalid response from f01-001-118.gridka.de:1094
What release are you using? This was a problem in some early releases
dealing with the way parameters were being passed to a function and how
the compiler treated them. It worked using Sun CC but not in g++.
> 3) Probably it's a result of the computing center's instability of
> last weekbut most of the dataservers are not registering with olbd on
> redirector. Only the dataservers that I restarted are now registering. The
> olb logfile of the ones not registering are showing:
> 050211 00:00:00 20383 olb_Config: (c) 2004 SLAC olbd version 20040907-0403 executing as Server
> 050211 14:33:02 20388 olb_Manager: Manager babar2 appears to be dead.
> 050211 14:35:15 20388 olb_Manager: Manager h^RBh^RB appears to be dead.
> 050211 15:03:05 1143 olb_Config: (c) 2004 SLAC olbd version 20040907-0403 initializing as Server
> 050211 15:03:05 1143 olb_Config: Server initialization completed.
> 050211 15:03:05 1155 olb_Start: Waiting for primary server to login.
> 050211 15:03:06 1157 olb_Admin_Login: Primary server 1142 logged in
> 050211 15:03:06 1143 olb_Server: Logged into babar2
> 050211 16:14:01 1149 olb_Manager: Manager babar2 appears to be dead.
> 050211 16:16:14 1149 olb_Manager: Manager h^RBh^RB appears to be dead.
> 050211 16:18:27 1149 olb_Manager: Manager p^p^ appears to be dead.
> 050211 16:20:40 1149 olb_Manager: Manager p^p^ appears to be dead.
> 050211 16:22:54 1149 olb_Manager: Manager p^p^ appears to be dead.
> 050211 16:25:07 1149 olb_Manager: Manager p^p^ appears to be dead.
> Indeed, babar2 has been that day (11/02/2005) around 16:12. This message
> kept going on in the log file until I restarted all the dataserver's
> processes yesterday.
Other people have complained about this and it appears that the 20040907
release is definitely bad. Please switch to the 200408 release we are
using for BaBar analysis (or try the 200502 development release if you are
adevnturous). Also, it appears that the DNS name is getting screwed up (at
least in the messages). Please do a gcore and send me the executable and
core file (or place it in an accessible area).
All of this points out to a packaging problem we have. The only way we
really test releases is to create what we call a development release. That,
unfortunately, makes it available to everyone else -- even before we can
certify it as being materially correct. I do know we've had some
development releases that should have never seen the light of day, but
unfortuantely the process lets them out. We are trying to get a new
process in to place that will *never* cut a release unless we know that it
will actually work on a reasonablly sized system.