Print

Print


Hi,

 in principle I agree. Maybe some introduced bug, maybe not.

 Which client are you using, Gregory? Are you using the one which is
coming from Root 4?

Fabrizio

On Tue, 2005-10-11 at 14:48 -0700, Andy Hanushevsky wrote:
> Hi Gregory,
> 
> This is a client problem. You are right, you should have been able to 
> restart the server with no problems. Fabrizio, do you see what happened 
> here? The file was opened, the server was restarted, the connection we 
> remade to that server, but the file was not re-opened. Instead, the original 
> file handle was used for the read. Apparently, there is a small timing 
> window where that could happen and that causes the job to crash. Two 
> solutions a) (the better one) close the tming window, b) (the sloppier one) 
> re-open the file if you get that particular error.
> 
> Andy
> 
> ----- Original Message ----- 
> From: "Gregory Schott" <[log in to unmask]>
> To: "Miriam Fritsch" <[log in to unmask]>
> Cc: "xrootd mailing list" <[log in to unmask]>; "SkimSOS" 
> <[log in to unmask]>
> Sent: Tuesday, October 11, 2005 11:14 AM
> Subject: Re: your mail
> 
> 
> > Hello Miriam,
> >
> > OK. This was at the time one of the servers was restarted (it got ofline 
> > just a second or two). Andreas thought that in this case the currently 
> > reading processes would reconnect to the redirector for re-assignemrnt of 
> > a dataserver. Apparently it crashes instead.
> >
> > I am forwarding to the xrootd experts to ask them for their opinion. We 
> > are using the latest (July) production version and the config files looks 
> > like:
> >
> > $ cat config/redirector.cf
> > olb.allow host babar2.gridka.de
> > olb.allow host f01-014-108.gridka.de
> > olb.allow host f01-016-102.gridka.de
> > olb.allow host f01-016-101.gridka.de
> > olb.allow host f01-014-106.gridka.de
> > olb.allow host f01-016-108.gridka.de
> > olb.allow host f01-016-109.gridka.de
> > olb.allow host f01-016-106.gridka.de
> > olb.allow host f01-016-107.gridka.de
> > olb.allow host f01-014-103.gridka.de
> > olb.allow host f01-014-107.gridka.de
> > olb.allow host f01-005-151.gridka.de
> > olb.allow host f01-010-110.gridka.de
> > olb.allow host f01-005-115.gridka.de
> > olb.allow host f01-010-107.gridka.de
> > olb.allow host l01-001-122.gridka.de
> > olb.port 3121
> >
> > odc.manager l01-001-122.gridka.de 3121
> >
> > xrootd.fslib /home/xrootd/software/current/lib/libXrdOfs.so
> > xrootd.export /prod
> > xrootd.export /store
> >
> > odc.trace redirect
> > ---
> > $ cat config/dataserver.cfg
> > odc.manager l01-001-122.gridka.de 3121
> >
> > olb.allow host babar2.gridka.de
> > olb.allow host f01-014-108.gridka.de
> > olb.allow host f01-016-102.gridka.de
> > olb.allow host f01-016-101.gridka.de
> > olb.allow host f01-014-106.gridka.de
> > olb.allow host f01-016-108.gridka.de
> > olb.allow host f01-016-109.gridka.de
> > olb.allow host f01-016-106.gridka.de
> > olb.allow host f01-016-107.gridka.de
> > olb.allow host f01-014-103.gridka.de
> > olb.allow host f01-014-107.gridka.de
> > olb.allow host f01-005-151.gridka.de
> > olb.allow host 10.65.10.110
> > olb.allow host f01-010-110.gridka.de
> > olb.allow host 10.65.5.115
> > olb.allow host f01-005-115.gridka.de
> > olb.allow host f01-010-107.gridka.de
> > olb.allow host l01-001-122.gridka.de
> >
> > olb.path r /store
> > olb.path w /prod
> > olb.port 3121
> > olb.sched cpu 100
> > olb.subscribe l01-001-122.gridka.de
> > olb.wait
> >
> > ofs.redirect remote if l01-001-122.gridka.de
> > ofs.redirect target
> >
> > oss.alloc * * 80
> > oss.fdlimit * max
> > oss.localroot /home/xrootd/disk/kanga-export/EventStore/
> >
> > xrd.protocol xrootd *
> > xrootd.async off
> > xrootd.export /prod
> > xrootd.export /store
> > xrootd.fslib /home/xrootd/software/current/lib/libXrdOfs.so
> > xrootd.chksum crc32 /home/xrootd/bin/getCRC32.sh
> >
> > odc.trace redirect
> > ---
> >
> > Did anything also happen at 18:33 or 18:45 when the redirector got reset? 
> > In principle nothing happened from your point of view.
> >
> > Cheers,
> >
> > -- Gregory
> >
> >
> >
> > On Tue, 11 Oct 2005, Miriam Fritsch wrote:
> >
> >>
> >> Hi Gregory,
> >>
> >> some jobs crash with the following error message:
> >>
> >> ---------------------------------------------------------------------------
> >> 18:21:37.524 EvtCounter: processing event # 12085 [
> >> 1d:ffffffff:04ee72/3f73bb1d:V ]
> >> 2005-10-11 18:21:37 19228 Err : TXMessage::ReadRaw             - Error
> >> reading 8 bytes
> >> 2005-10-11 18:21:37 19228 Err : ReadPartialAnswer              - Error
> >> reading msg from connmgr (server [f01-010-107.gridka.de:1094]).
> >> 18:21:44.575 EvtCounter: processing event # 12086 [
> >> 1d:ffffffff:04ee72/3f73be86:J ]
> >> 2005-10-11 18:21:44 19228 Err : TXNetFile::ReadBuffer          - Server
> >> [f01-010-107.gridka.de:1094] did not return OK message for last reque
> >> st.
> >> 2005-10-11 18:21:44 19228 Err : SendGenCommand                 - Server
> >> declared error 3004: 'read does not refer to an open file'
> >> -- JOB 
> >> DONE --------------------------------------------------------------
> >>
> >> Cheers,
> >>
> >> Miriam
> >>
> >>
> >> *************************************************************************
> >>
> >> Dr. Miriam Fritsch
> >>
> >> Institut fuer Experimentalphysik I
> >> Ruhr-Universitaet Bochum, Germany               email: [log in to unmask]
> >> c/o SLAC                                        tel:  +1 (650) 926-3565
> >> 2575 Sand Hill Road #34                         fax:  +1 (650) 926-3882
> >> Menlo Park, CA 94025, USA                       home: +1 (650) 324-2813
> >>
> >> *************************************************************************
> >>
> >>
> >