Print

Print


[I've taken this off the skimsos-hn]

	Hi Fabrizio,

Fabrizio Furano wrote:
> Hi,
> 
>  in principle I agree. Maybe some introduced bug, maybe not.
> 
>  Which client are you using, Gregory? Are you using the one which is
> coming from Root 4?

we are using the 18series (18.2.1b) BABAR releases, so I guess that 
means we are using the client from ROOT 4.03-02.

	Cheers,

		Andreas

> 
> Fabrizio
> 
> On Tue, 2005-10-11 at 14:48 -0700, Andy Hanushevsky wrote:
> 
>>Hi Gregory,
>>
>>This is a client problem. You are right, you should have been able to 
>>restart the server with no problems. Fabrizio, do you see what happened 
>>here? The file was opened, the server was restarted, the connection we 
>>remade to that server, but the file was not re-opened. Instead, the original 
>>file handle was used for the read. Apparently, there is a small timing 
>>window where that could happen and that causes the job to crash. Two 
>>solutions a) (the better one) close the tming window, b) (the sloppier one) 
>>re-open the file if you get that particular error.
>>
>>Andy
>>
>>----- Original Message ----- 
>>From: "Gregory Schott" <[log in to unmask]>
>>To: "Miriam Fritsch" <[log in to unmask]>
>>Cc: "xrootd mailing list" <[log in to unmask]>; "SkimSOS" 
>><[log in to unmask]>
>>Sent: Tuesday, October 11, 2005 11:14 AM
>>Subject: Re: your mail
>>
>>
>>
>>>Hello Miriam,
>>>
>>>OK. This was at the time one of the servers was restarted (it got ofline 
>>>just a second or two). Andreas thought that in this case the currently 
>>>reading processes would reconnect to the redirector for re-assignemrnt of 
>>>a dataserver. Apparently it crashes instead.
>>>
>>>I am forwarding to the xrootd experts to ask them for their opinion. We 
>>>are using the latest (July) production version and the config files looks 
>>>like:
>>>
>>>$ cat config/redirector.cf
>>>olb.allow host babar2.gridka.de
>>>olb.allow host f01-014-108.gridka.de
>>>olb.allow host f01-016-102.gridka.de
>>>olb.allow host f01-016-101.gridka.de
>>>olb.allow host f01-014-106.gridka.de
>>>olb.allow host f01-016-108.gridka.de
>>>olb.allow host f01-016-109.gridka.de
>>>olb.allow host f01-016-106.gridka.de
>>>olb.allow host f01-016-107.gridka.de
>>>olb.allow host f01-014-103.gridka.de
>>>olb.allow host f01-014-107.gridka.de
>>>olb.allow host f01-005-151.gridka.de
>>>olb.allow host f01-010-110.gridka.de
>>>olb.allow host f01-005-115.gridka.de
>>>olb.allow host f01-010-107.gridka.de
>>>olb.allow host l01-001-122.gridka.de
>>>olb.port 3121
>>>
>>>odc.manager l01-001-122.gridka.de 3121
>>>
>>>xrootd.fslib /home/xrootd/software/current/lib/libXrdOfs.so
>>>xrootd.export /prod
>>>xrootd.export /store
>>>
>>>odc.trace redirect
>>>---
>>>$ cat config/dataserver.cfg
>>>odc.manager l01-001-122.gridka.de 3121
>>>
>>>olb.allow host babar2.gridka.de
>>>olb.allow host f01-014-108.gridka.de
>>>olb.allow host f01-016-102.gridka.de
>>>olb.allow host f01-016-101.gridka.de
>>>olb.allow host f01-014-106.gridka.de
>>>olb.allow host f01-016-108.gridka.de
>>>olb.allow host f01-016-109.gridka.de
>>>olb.allow host f01-016-106.gridka.de
>>>olb.allow host f01-016-107.gridka.de
>>>olb.allow host f01-014-103.gridka.de
>>>olb.allow host f01-014-107.gridka.de
>>>olb.allow host f01-005-151.gridka.de
>>>olb.allow host 10.65.10.110
>>>olb.allow host f01-010-110.gridka.de
>>>olb.allow host 10.65.5.115
>>>olb.allow host f01-005-115.gridka.de
>>>olb.allow host f01-010-107.gridka.de
>>>olb.allow host l01-001-122.gridka.de
>>>
>>>olb.path r /store
>>>olb.path w /prod
>>>olb.port 3121
>>>olb.sched cpu 100
>>>olb.subscribe l01-001-122.gridka.de
>>>olb.wait
>>>
>>>ofs.redirect remote if l01-001-122.gridka.de
>>>ofs.redirect target
>>>
>>>oss.alloc * * 80
>>>oss.fdlimit * max
>>>oss.localroot /home/xrootd/disk/kanga-export/EventStore/
>>>
>>>xrd.protocol xrootd *
>>>xrootd.async off
>>>xrootd.export /prod
>>>xrootd.export /store
>>>xrootd.fslib /home/xrootd/software/current/lib/libXrdOfs.so
>>>xrootd.chksum crc32 /home/xrootd/bin/getCRC32.sh
>>>
>>>odc.trace redirect
>>>---
>>>
>>>Did anything also happen at 18:33 or 18:45 when the redirector got reset? 
>>>In principle nothing happened from your point of view.
>>>
>>>Cheers,
>>>
>>>-- Gregory
>>>
>>>
>>>
>>>On Tue, 11 Oct 2005, Miriam Fritsch wrote:
>>>
>>>
>>>>Hi Gregory,
>>>>
>>>>some jobs crash with the following error message:
>>>>
>>>>---------------------------------------------------------------------------
>>>>18:21:37.524 EvtCounter: processing event # 12085 [
>>>>1d:ffffffff:04ee72/3f73bb1d:V ]
>>>>2005-10-11 18:21:37 19228 Err : TXMessage::ReadRaw             - Error
>>>>reading 8 bytes
>>>>2005-10-11 18:21:37 19228 Err : ReadPartialAnswer              - Error
>>>>reading msg from connmgr (server [f01-010-107.gridka.de:1094]).
>>>>18:21:44.575 EvtCounter: processing event # 12086 [
>>>>1d:ffffffff:04ee72/3f73be86:J ]
>>>>2005-10-11 18:21:44 19228 Err : TXNetFile::ReadBuffer          - Server
>>>>[f01-010-107.gridka.de:1094] did not return OK message for last reque
>>>>st.
>>>>2005-10-11 18:21:44 19228 Err : SendGenCommand                 - Server
>>>>declared error 3004: 'read does not refer to an open file'
>>>>-- JOB 
>>>>DONE --------------------------------------------------------------
>>>>
>>>>Cheers,
>>>>
>>>>Miriam
>>>>
>>>>
>>>>*************************************************************************
>>>>
>>>>Dr. Miriam Fritsch
>>>>
>>>>Institut fuer Experimentalphysik I
>>>>Ruhr-Universitaet Bochum, Germany               email: [log in to unmask]
>>>>c/o SLAC                                        tel:  +1 (650) 926-3565
>>>>2575 Sand Hill Road #34                         fax:  +1 (650) 926-3882
>>>>Menlo Park, CA 94025, USA                       home: +1 (650) 324-2813
>>>>
>>>>*************************************************************************
>>>>
>>>>
>>>