Print

Print


Hello Andrew,

>> 1) There are two network interfaced on the redirector. The issue is that
>> using the wrong one in the KanAccess file (the one that is not the same as
>> in the xrootd config files) causes to crash the xrootd (and not olbd)
>> proccess on the redirector.
>>
>> The core file reveals:
>>> gdb /opt/xrootd/bin/xrootd core.9418
>> #0  0x0806349c in XrdNetwork::getHostName ()
> Yes, this is a known problem. The xrootd will crash if it cannot do a
> reverse translation of IP address to DNS name for it's own machine. I
> should be more graceful and issue an error message and exit. That's on the
> list of corrections.
>>
>> 2) Strange disk size in the olb redirector log file. I'm using the sep 04
>> prod version.
>>
>> 050216 18:49:32 9419 olb_a2i: tot dsk value -1463812096 is too small
>> 050216 18:49:32 9419 olb_Server: invalid response from f01-001-118.gridka.de:1094
> What release are you using? This was a problem in some early releases
> dealing with the way parameters were being passed to a function and how
> the compiler treated them. It worked using Sun CC but not in g++.


As I said, it's the september 2004 production version that I'm running. I 
think I remember Jean-Yves told me that using the same release he's not 
seeing this problem at IN2P3. He's not running on linux there though.


>> 3) Probably it's a result of the computing center's instability of
>> last weekbut most of the dataservers are not registering with olbd on
> the
>> redirector. Only the dataservers that I restarted are now registering. The
>> olb logfile of the ones not registering are showing:
>>
>> 050211 00:00:00 20383 olb_Config: (c) 2004 SLAC olbd version 20040907-0403 executing as  Server
>> 050211 14:33:02 20388 olb_Manager: Manager babar2 appears to be dead.
>> 050211 14:35:15 20388 olb_Manager: Manager h^RBh^RB appears to be dead.
>> 050211 15:03:05 1143 olb_Config: (c) 2004 SLAC olbd version 20040907-0403 initializing as Server
>> 050211 15:03:05 1143 olb_Config: Server initialization completed.
>> 050211 15:03:05 1155 olb_Start: Waiting for primary server to login.
>> 050211 15:03:06 1157 olb_Admin_Login: Primary server 1142 logged in
>> 050211 15:03:06 1143 olb_Server: Logged into babar2
>> 050211 16:14:01 1149 olb_Manager: Manager babar2 appears to be dead.
>> 050211 16:16:14 1149 olb_Manager: Manager h^RBh^RB appears to be dead.
>> 050211 16:18:27 1149 olb_Manager: Manager p^p^ appears to be dead.
>> 050211 16:20:40 1149 olb_Manager: Manager p^p^ appears to be dead.
>> 050211 16:22:54 1149 olb_Manager: Manager p^p^ appears to be dead.
>> 050211 16:25:07 1149 olb_Manager: Manager p^p^ appears to be dead.
>>
>> Indeed, babar2 has been that day (11/02/2005) around 16:12. This message
>> kept  going on in the log file until I restarted all the dataserver's
>> processes yesterday.
> Other people have complained about this and it appears that the 20040907
> release is definitely bad. Please switch to the 200408 release we are
> using for BaBar analysis (or try the 200502 development release if you are
> adevnturous). Also, it appears that the DNS name is getting screwed up (at
> least in the messages). Please do a gcore and send me the executable and
> core file (or place it in an accessible area).


One problem of the xrootd page is that I can't find any link to that 
version and cannot list the download directory. I'm not sure I want to 
install this version, but where can I find it?

How and where should I do this gcore? (On the redirector or dataserver?)
Please, also note that since babar2 reboot I've killed and restarted all 
xrootd/olbd processes so I don't have processes in the state described 
above until next babar2 reboot.


-- Gregory


> All of this points out to a packaging problem we have. The only way we
> really test releases is to create what we call a development release. That,
> unfortunately, makes it available to everyone else -- even before we can
> certify it as being materially correct. I do know we've had some
> development releases that should have never seen the light of day, but
> unfortuantely the process lets them out. We are trying to get a new
> process in to place that will *never* cut a release unless we know that it
> will actually work on a reasonablly sized system.
>
> Andy

-------------- Dr. Gregory Schott --------------
  Institut fuer Experimentelle Kernphysik (IEKP)
      Universitaet Karlsruhe - Postfach 3640
            76021 Karlsruhe  (Germany)
             tel.: +49-(0)724782-3537
             fax.: +49-(0)724782-3414
            e-mail: [log in to unmask]
-----------------------------------------------