Print

Print


Hi Gregory,

This is a client problem. You are right, you should have been able to 
restart the server with no problems. Fabrizio, do you see what happened 
here? The file was opened, the server was restarted, the connection we 
remade to that server, but the file was not re-opened. Instead, the original 
file handle was used for the read. Apparently, there is a small timing 
window where that could happen and that causes the job to crash. Two 
solutions a) (the better one) close the tming window, b) (the sloppier one) 
re-open the file if you get that particular error.

Andy

----- Original Message ----- 
From: "Gregory Schott" <[log in to unmask]>
To: "Miriam Fritsch" <[log in to unmask]>
Cc: "xrootd mailing list" <[log in to unmask]>; "SkimSOS" 
<[log in to unmask]>
Sent: Tuesday, October 11, 2005 11:14 AM
Subject: Re: your mail


> Hello Miriam,
>
> OK. This was at the time one of the servers was restarted (it got ofline 
> just a second or two). Andreas thought that in this case the currently 
> reading processes would reconnect to the redirector for re-assignemrnt of 
> a dataserver. Apparently it crashes instead.
>
> I am forwarding to the xrootd experts to ask them for their opinion. We 
> are using the latest (July) production version and the config files looks 
> like:
>
> $ cat config/redirector.cf
> olb.allow host babar2.gridka.de
> olb.allow host f01-014-108.gridka.de
> olb.allow host f01-016-102.gridka.de
> olb.allow host f01-016-101.gridka.de
> olb.allow host f01-014-106.gridka.de
> olb.allow host f01-016-108.gridka.de
> olb.allow host f01-016-109.gridka.de
> olb.allow host f01-016-106.gridka.de
> olb.allow host f01-016-107.gridka.de
> olb.allow host f01-014-103.gridka.de
> olb.allow host f01-014-107.gridka.de
> olb.allow host f01-005-151.gridka.de
> olb.allow host f01-010-110.gridka.de
> olb.allow host f01-005-115.gridka.de
> olb.allow host f01-010-107.gridka.de
> olb.allow host l01-001-122.gridka.de
> olb.port 3121
>
> odc.manager l01-001-122.gridka.de 3121
>
> xrootd.fslib /home/xrootd/software/current/lib/libXrdOfs.so
> xrootd.export /prod
> xrootd.export /store
>
> odc.trace redirect
> ---
> $ cat config/dataserver.cfg
> odc.manager l01-001-122.gridka.de 3121
>
> olb.allow host babar2.gridka.de
> olb.allow host f01-014-108.gridka.de
> olb.allow host f01-016-102.gridka.de
> olb.allow host f01-016-101.gridka.de
> olb.allow host f01-014-106.gridka.de
> olb.allow host f01-016-108.gridka.de
> olb.allow host f01-016-109.gridka.de
> olb.allow host f01-016-106.gridka.de
> olb.allow host f01-016-107.gridka.de
> olb.allow host f01-014-103.gridka.de
> olb.allow host f01-014-107.gridka.de
> olb.allow host f01-005-151.gridka.de
> olb.allow host 10.65.10.110
> olb.allow host f01-010-110.gridka.de
> olb.allow host 10.65.5.115
> olb.allow host f01-005-115.gridka.de
> olb.allow host f01-010-107.gridka.de
> olb.allow host l01-001-122.gridka.de
>
> olb.path r /store
> olb.path w /prod
> olb.port 3121
> olb.sched cpu 100
> olb.subscribe l01-001-122.gridka.de
> olb.wait
>
> ofs.redirect remote if l01-001-122.gridka.de
> ofs.redirect target
>
> oss.alloc * * 80
> oss.fdlimit * max
> oss.localroot /home/xrootd/disk/kanga-export/EventStore/
>
> xrd.protocol xrootd *
> xrootd.async off
> xrootd.export /prod
> xrootd.export /store
> xrootd.fslib /home/xrootd/software/current/lib/libXrdOfs.so
> xrootd.chksum crc32 /home/xrootd/bin/getCRC32.sh
>
> odc.trace redirect
> ---
>
> Did anything also happen at 18:33 or 18:45 when the redirector got reset? 
> In principle nothing happened from your point of view.
>
> Cheers,
>
> -- Gregory
>
>
>
> On Tue, 11 Oct 2005, Miriam Fritsch wrote:
>
>>
>> Hi Gregory,
>>
>> some jobs crash with the following error message:
>>
>> ---------------------------------------------------------------------------
>> 18:21:37.524 EvtCounter: processing event # 12085 [
>> 1d:ffffffff:04ee72/3f73bb1d:V ]
>> 2005-10-11 18:21:37 19228 Err : TXMessage::ReadRaw             - Error
>> reading 8 bytes
>> 2005-10-11 18:21:37 19228 Err : ReadPartialAnswer              - Error
>> reading msg from connmgr (server [f01-010-107.gridka.de:1094]).
>> 18:21:44.575 EvtCounter: processing event # 12086 [
>> 1d:ffffffff:04ee72/3f73be86:J ]
>> 2005-10-11 18:21:44 19228 Err : TXNetFile::ReadBuffer          - Server
>> [f01-010-107.gridka.de:1094] did not return OK message for last reque
>> st.
>> 2005-10-11 18:21:44 19228 Err : SendGenCommand                 - Server
>> declared error 3004: 'read does not refer to an open file'
>> -- JOB 
>> DONE --------------------------------------------------------------
>>
>> Cheers,
>>
>> Miriam
>>
>>
>> *************************************************************************
>>
>> Dr. Miriam Fritsch
>>
>> Institut fuer Experimentalphysik I
>> Ruhr-Universitaet Bochum, Germany               email: [log in to unmask]
>> c/o SLAC                                        tel:  +1 (650) 926-3565
>> 2575 Sand Hill Road #34                         fax:  +1 (650) 926-3882
>> Menlo Park, CA 94025, USA                       home: +1 (650) 324-2813
>>
>> *************************************************************************
>>
>>
>