Print

Print


Hi Andreas,

Don't know. Fabrizio?

Andy

----- Original Message ----- 
From: "Andreas Petzold" <[log in to unmask]>
To: "Andy Hanushevsky" <[log in to unmask]>
Cc: "Gregory Schott" <[log in to unmask]>; <[log in to unmask]>; "xrootd 
mailing list" <[log in to unmask]>
Sent: Tuesday, October 11, 2005 4:09 PM
Subject: Re: your mail


> Hi,
>
> Andy Hanushevsky wrote:
>> Hi Gregory,
>>
>> This is a client problem. You are right, you should have been able to 
>> restart the server with no problems. Fabrizio, do you see what happened 
>> here? The file was opened, the server was restarted, the connection we 
>> remade to that server, but the file was not re-opened. Instead, the 
>> original file handle was used for the read. Apparently, there is a small 
>> timing window where that could happen and that causes the job to crash. 
>> Two solutions a) (the better one) close the tming window, b) (the 
>> sloppier one) re-open the file if you get that particular error.
>
> hmm, does that mean we don't have a chance of  fixing this for the current 
> BABAR sw releases?
>
> Cheers,
>
> Andreas
>
>>
>> Andy
>>
>> ----- Original Message ----- From: "Gregory Schott" <[log in to unmask]>
>> To: "Miriam Fritsch" <[log in to unmask]>
>> Cc: "xrootd mailing list" <[log in to unmask]>; "SkimSOS" 
>> <[log in to unmask]>
>> Sent: Tuesday, October 11, 2005 11:14 AM
>> Subject: Re: your mail
>>
>>
>>> Hello Miriam,
>>>
>>> OK. This was at the time one of the servers was restarted (it got ofline 
>>> just a second or two). Andreas thought that in this case the currently 
>>> reading processes would reconnect to the redirector for re-assignemrnt 
>>> of a dataserver. Apparently it crashes instead.
>>>
>>> I am forwarding to the xrootd experts to ask them for their opinion. We 
>>> are using the latest (July) production version and the config files 
>>> looks like:
>>>
>>> $ cat config/redirector.cf
>>> olb.allow host babar2.gridka.de
>>> olb.allow host f01-014-108.gridka.de
>>> olb.allow host f01-016-102.gridka.de
>>> olb.allow host f01-016-101.gridka.de
>>> olb.allow host f01-014-106.gridka.de
>>> olb.allow host f01-016-108.gridka.de
>>> olb.allow host f01-016-109.gridka.de
>>> olb.allow host f01-016-106.gridka.de
>>> olb.allow host f01-016-107.gridka.de
>>> olb.allow host f01-014-103.gridka.de
>>> olb.allow host f01-014-107.gridka.de
>>> olb.allow host f01-005-151.gridka.de
>>> olb.allow host f01-010-110.gridka.de
>>> olb.allow host f01-005-115.gridka.de
>>> olb.allow host f01-010-107.gridka.de
>>> olb.allow host l01-001-122.gridka.de
>>> olb.port 3121
>>>
>>> odc.manager l01-001-122.gridka.de 3121
>>>
>>> xrootd.fslib /home/xrootd/software/current/lib/libXrdOfs.so
>>> xrootd.export /prod
>>> xrootd.export /store
>>>
>>> odc.trace redirect
>>> ---
>>> $ cat config/dataserver.cfg
>>> odc.manager l01-001-122.gridka.de 3121
>>>
>>> olb.allow host babar2.gridka.de
>>> olb.allow host f01-014-108.gridka.de
>>> olb.allow host f01-016-102.gridka.de
>>> olb.allow host f01-016-101.gridka.de
>>> olb.allow host f01-014-106.gridka.de
>>> olb.allow host f01-016-108.gridka.de
>>> olb.allow host f01-016-109.gridka.de
>>> olb.allow host f01-016-106.gridka.de
>>> olb.allow host f01-016-107.gridka.de
>>> olb.allow host f01-014-103.gridka.de
>>> olb.allow host f01-014-107.gridka.de
>>> olb.allow host f01-005-151.gridka.de
>>> olb.allow host 10.65.10.110
>>> olb.allow host f01-010-110.gridka.de
>>> olb.allow host 10.65.5.115
>>> olb.allow host f01-005-115.gridka.de
>>> olb.allow host f01-010-107.gridka.de
>>> olb.allow host l01-001-122.gridka.de
>>>
>>> olb.path r /store
>>> olb.path w /prod
>>> olb.port 3121
>>> olb.sched cpu 100
>>> olb.subscribe l01-001-122.gridka.de
>>> olb.wait
>>>
>>> ofs.redirect remote if l01-001-122.gridka.de
>>> ofs.redirect target
>>>
>>> oss.alloc * * 80
>>> oss.fdlimit * max
>>> oss.localroot /home/xrootd/disk/kanga-export/EventStore/
>>>
>>> xrd.protocol xrootd *
>>> xrootd.async off
>>> xrootd.export /prod
>>> xrootd.export /store
>>> xrootd.fslib /home/xrootd/software/current/lib/libXrdOfs.so
>>> xrootd.chksum crc32 /home/xrootd/bin/getCRC32.sh
>>>
>>> odc.trace redirect
>>> ---
>>>
>>> Did anything also happen at 18:33 or 18:45 when the redirector got 
>>> reset? In principle nothing happened from your point of view.
>>>
>>> Cheers,
>>>
>>> -- Gregory
>>>
>>>
>>>
>>> On Tue, 11 Oct 2005, Miriam Fritsch wrote:
>>>
>>>>
>>>> Hi Gregory,
>>>>
>>>> some jobs crash with the following error message:
>>>>
>>>> ---------------------------------------------------------------------------
>>>>
>>>> 18:21:37.524 EvtCounter: processing event # 12085 [
>>>> 1d:ffffffff:04ee72/3f73bb1d:V ]
>>>> 2005-10-11 18:21:37 19228 Err : TXMessage::ReadRaw             - Error
>>>> reading 8 bytes
>>>> 2005-10-11 18:21:37 19228 Err : ReadPartialAnswer              - Error
>>>> reading msg from connmgr (server [f01-010-107.gridka.de:1094]).
>>>> 18:21:44.575 EvtCounter: processing event # 12086 [
>>>> 1d:ffffffff:04ee72/3f73be86:J ]
>>>> 2005-10-11 18:21:44 19228 Err : TXNetFile::ReadBuffer          - Server
>>>> [f01-010-107.gridka.de:1094] did not return OK message for last reque
>>>> st.
>>>> 2005-10-11 18:21:44 19228 Err : SendGenCommand                 - Server
>>>> declared error 3004: 'read does not refer to an open file'
>>>> -- JOB 
>>>> DONE --------------------------------------------------------------
>>>>
>>>> Cheers,
>>>>
>>>> Miriam
>>>>
>>>>
>>>> *************************************************************************
>>>>
>>>> Dr. Miriam Fritsch
>>>>
>>>> Institut fuer Experimentalphysik I
>>>> Ruhr-Universitaet Bochum, Germany               email: 
>>>> [log in to unmask]
>>>> c/o SLAC                                        tel:  +1 (650) 926-3565
>>>> 2575 Sand Hill Road #34                         fax:  +1 (650) 926-3882
>>>> Menlo Park, CA 94025, USA                       home: +1 (650) 324-2813
>>>>
>>>> *************************************************************************
>>>>
>>>>
>>>
>
>