Print

Print


	Hi,

Andy Hanushevsky wrote:
> Hi Gregory,
> 
> This is a client problem. You are right, you should have been able to 
> restart the server with no problems. Fabrizio, do you see what happened 
> here? The file was opened, the server was restarted, the connection we 
> remade to that server, but the file was not re-opened. Instead, the 
> original file handle was used for the read. Apparently, there is a small 
> timing window where that could happen and that causes the job to crash. 
> Two solutions a) (the better one) close the tming window, b) (the 
> sloppier one) re-open the file if you get that particular error.

hmm, does that mean we don't have a chance of  fixing this for the 
current BABAR sw releases?

	Cheers,

		Andreas

> 
> Andy
> 
> ----- Original Message ----- From: "Gregory Schott" <[log in to unmask]>
> To: "Miriam Fritsch" <[log in to unmask]>
> Cc: "xrootd mailing list" <[log in to unmask]>; "SkimSOS" 
> <[log in to unmask]>
> Sent: Tuesday, October 11, 2005 11:14 AM
> Subject: Re: your mail
> 
> 
>> Hello Miriam,
>>
>> OK. This was at the time one of the servers was restarted (it got 
>> ofline just a second or two). Andreas thought that in this case the 
>> currently reading processes would reconnect to the redirector for 
>> re-assignemrnt of a dataserver. Apparently it crashes instead.
>>
>> I am forwarding to the xrootd experts to ask them for their opinion. 
>> We are using the latest (July) production version and the config files 
>> looks like:
>>
>> $ cat config/redirector.cf
>> olb.allow host babar2.gridka.de
>> olb.allow host f01-014-108.gridka.de
>> olb.allow host f01-016-102.gridka.de
>> olb.allow host f01-016-101.gridka.de
>> olb.allow host f01-014-106.gridka.de
>> olb.allow host f01-016-108.gridka.de
>> olb.allow host f01-016-109.gridka.de
>> olb.allow host f01-016-106.gridka.de
>> olb.allow host f01-016-107.gridka.de
>> olb.allow host f01-014-103.gridka.de
>> olb.allow host f01-014-107.gridka.de
>> olb.allow host f01-005-151.gridka.de
>> olb.allow host f01-010-110.gridka.de
>> olb.allow host f01-005-115.gridka.de
>> olb.allow host f01-010-107.gridka.de
>> olb.allow host l01-001-122.gridka.de
>> olb.port 3121
>>
>> odc.manager l01-001-122.gridka.de 3121
>>
>> xrootd.fslib /home/xrootd/software/current/lib/libXrdOfs.so
>> xrootd.export /prod
>> xrootd.export /store
>>
>> odc.trace redirect
>> ---
>> $ cat config/dataserver.cfg
>> odc.manager l01-001-122.gridka.de 3121
>>
>> olb.allow host babar2.gridka.de
>> olb.allow host f01-014-108.gridka.de
>> olb.allow host f01-016-102.gridka.de
>> olb.allow host f01-016-101.gridka.de
>> olb.allow host f01-014-106.gridka.de
>> olb.allow host f01-016-108.gridka.de
>> olb.allow host f01-016-109.gridka.de
>> olb.allow host f01-016-106.gridka.de
>> olb.allow host f01-016-107.gridka.de
>> olb.allow host f01-014-103.gridka.de
>> olb.allow host f01-014-107.gridka.de
>> olb.allow host f01-005-151.gridka.de
>> olb.allow host 10.65.10.110
>> olb.allow host f01-010-110.gridka.de
>> olb.allow host 10.65.5.115
>> olb.allow host f01-005-115.gridka.de
>> olb.allow host f01-010-107.gridka.de
>> olb.allow host l01-001-122.gridka.de
>>
>> olb.path r /store
>> olb.path w /prod
>> olb.port 3121
>> olb.sched cpu 100
>> olb.subscribe l01-001-122.gridka.de
>> olb.wait
>>
>> ofs.redirect remote if l01-001-122.gridka.de
>> ofs.redirect target
>>
>> oss.alloc * * 80
>> oss.fdlimit * max
>> oss.localroot /home/xrootd/disk/kanga-export/EventStore/
>>
>> xrd.protocol xrootd *
>> xrootd.async off
>> xrootd.export /prod
>> xrootd.export /store
>> xrootd.fslib /home/xrootd/software/current/lib/libXrdOfs.so
>> xrootd.chksum crc32 /home/xrootd/bin/getCRC32.sh
>>
>> odc.trace redirect
>> ---
>>
>> Did anything also happen at 18:33 or 18:45 when the redirector got 
>> reset? In principle nothing happened from your point of view.
>>
>> Cheers,
>>
>> -- Gregory
>>
>>
>>
>> On Tue, 11 Oct 2005, Miriam Fritsch wrote:
>>
>>>
>>> Hi Gregory,
>>>
>>> some jobs crash with the following error message:
>>>
>>> --------------------------------------------------------------------------- 
>>>
>>> 18:21:37.524 EvtCounter: processing event # 12085 [
>>> 1d:ffffffff:04ee72/3f73bb1d:V ]
>>> 2005-10-11 18:21:37 19228 Err : TXMessage::ReadRaw             - Error
>>> reading 8 bytes
>>> 2005-10-11 18:21:37 19228 Err : ReadPartialAnswer              - Error
>>> reading msg from connmgr (server [f01-010-107.gridka.de:1094]).
>>> 18:21:44.575 EvtCounter: processing event # 12086 [
>>> 1d:ffffffff:04ee72/3f73be86:J ]
>>> 2005-10-11 18:21:44 19228 Err : TXNetFile::ReadBuffer          - Server
>>> [f01-010-107.gridka.de:1094] did not return OK message for last reque
>>> st.
>>> 2005-10-11 18:21:44 19228 Err : SendGenCommand                 - Server
>>> declared error 3004: 'read does not refer to an open file'
>>> -- JOB DONE 
>>> --------------------------------------------------------------
>>>
>>> Cheers,
>>>
>>> Miriam
>>>
>>>
>>> ************************************************************************* 
>>>
>>>
>>> Dr. Miriam Fritsch
>>>
>>> Institut fuer Experimentalphysik I
>>> Ruhr-Universitaet Bochum, Germany               email: [log in to unmask]
>>> c/o SLAC                                        tel:  +1 (650) 926-3565
>>> 2575 Sand Hill Road #34                         fax:  +1 (650) 926-3882
>>> Menlo Park, CA 94025, USA                       home: +1 (650) 324-2813
>>>
>>> ************************************************************************* 
>>>
>>>
>>>
>>