Print

Print


Hi,

 well, I don;t know this. From this it seems that the problem comes from
the root-native client. I'll try anyway to reproduce it, even if the log
shown does not include the activity but only the error messages. If I
find something wrong I'll try to fix it, but I have no idea about the
best way to propagate the fix for ROOT to the babar sw release.

 But, since that client version is quite old, I am a bit puzzled about
not having seen other similar feedbacks.

Fabrizio

On Tue, 2005-10-11 at 17:42 -0700, Andy Hanushevsky wrote:
> Hi Andreas,
> 
> Don't know. Fabrizio?
> 
> Andy
> 
> ----- Original Message ----- 
> From: "Andreas Petzold" <[log in to unmask]>
> To: "Andy Hanushevsky" <[log in to unmask]>
> Cc: "Gregory Schott" <[log in to unmask]>; <[log in to unmask]>; "xrootd 
> mailing list" <[log in to unmask]>
> Sent: Tuesday, October 11, 2005 4:09 PM
> Subject: Re: your mail
> 
> 
> > Hi,
> >
> > Andy Hanushevsky wrote:
> >> Hi Gregory,
> >>
> >> This is a client problem. You are right, you should have been able to 
> >> restart the server with no problems. Fabrizio, do you see what happened 
> >> here? The file was opened, the server was restarted, the connection we 
> >> remade to that server, but the file was not re-opened. Instead, the 
> >> original file handle was used for the read. Apparently, there is a small 
> >> timing window where that could happen and that causes the job to crash. 
> >> Two solutions a) (the better one) close the tming window, b) (the 
> >> sloppier one) re-open the file if you get that particular error.
> >
> > hmm, does that mean we don't have a chance of  fixing this for the current 
> > BABAR sw releases?
> >
> > Cheers,
> >
> > Andreas
> >
> >>
> >> Andy
> >>
> >> ----- Original Message ----- From: "Gregory Schott" <[log in to unmask]>
> >> To: "Miriam Fritsch" <[log in to unmask]>
> >> Cc: "xrootd mailing list" <[log in to unmask]>; "SkimSOS" 
> >> <[log in to unmask]>
> >> Sent: Tuesday, October 11, 2005 11:14 AM
> >> Subject: Re: your mail
> >>
> >>
> >>> Hello Miriam,
> >>>
> >>> OK. This was at the time one of the servers was restarted (it got ofline 
> >>> just a second or two). Andreas thought that in this case the currently 
> >>> reading processes would reconnect to the redirector for re-assignemrnt 
> >>> of a dataserver. Apparently it crashes instead.
> >>>
> >>> I am forwarding to the xrootd experts to ask them for their opinion. We 
> >>> are using the latest (July) production version and the config files 
> >>> looks like:
> >>>
> >>> $ cat config/redirector.cf
> >>> olb.allow host babar2.gridka.de
> >>> olb.allow host f01-014-108.gridka.de
> >>> olb.allow host f01-016-102.gridka.de
> >>> olb.allow host f01-016-101.gridka.de
> >>> olb.allow host f01-014-106.gridka.de
> >>> olb.allow host f01-016-108.gridka.de
> >>> olb.allow host f01-016-109.gridka.de
> >>> olb.allow host f01-016-106.gridka.de
> >>> olb.allow host f01-016-107.gridka.de
> >>> olb.allow host f01-014-103.gridka.de
> >>> olb.allow host f01-014-107.gridka.de
> >>> olb.allow host f01-005-151.gridka.de
> >>> olb.allow host f01-010-110.gridka.de
> >>> olb.allow host f01-005-115.gridka.de
> >>> olb.allow host f01-010-107.gridka.de
> >>> olb.allow host l01-001-122.gridka.de
> >>> olb.port 3121
> >>>
> >>> odc.manager l01-001-122.gridka.de 3121
> >>>
> >>> xrootd.fslib /home/xrootd/software/current/lib/libXrdOfs.so
> >>> xrootd.export /prod
> >>> xrootd.export /store
> >>>
> >>> odc.trace redirect
> >>> ---
> >>> $ cat config/dataserver.cfg
> >>> odc.manager l01-001-122.gridka.de 3121
> >>>
> >>> olb.allow host babar2.gridka.de
> >>> olb.allow host f01-014-108.gridka.de
> >>> olb.allow host f01-016-102.gridka.de
> >>> olb.allow host f01-016-101.gridka.de
> >>> olb.allow host f01-014-106.gridka.de
> >>> olb.allow host f01-016-108.gridka.de
> >>> olb.allow host f01-016-109.gridka.de
> >>> olb.allow host f01-016-106.gridka.de
> >>> olb.allow host f01-016-107.gridka.de
> >>> olb.allow host f01-014-103.gridka.de
> >>> olb.allow host f01-014-107.gridka.de
> >>> olb.allow host f01-005-151.gridka.de
> >>> olb.allow host 10.65.10.110
> >>> olb.allow host f01-010-110.gridka.de
> >>> olb.allow host 10.65.5.115
> >>> olb.allow host f01-005-115.gridka.de
> >>> olb.allow host f01-010-107.gridka.de
> >>> olb.allow host l01-001-122.gridka.de
> >>>
> >>> olb.path r /store
> >>> olb.path w /prod
> >>> olb.port 3121
> >>> olb.sched cpu 100
> >>> olb.subscribe l01-001-122.gridka.de
> >>> olb.wait
> >>>
> >>> ofs.redirect remote if l01-001-122.gridka.de
> >>> ofs.redirect target
> >>>
> >>> oss.alloc * * 80
> >>> oss.fdlimit * max
> >>> oss.localroot /home/xrootd/disk/kanga-export/EventStore/
> >>>
> >>> xrd.protocol xrootd *
> >>> xrootd.async off
> >>> xrootd.export /prod
> >>> xrootd.export /store
> >>> xrootd.fslib /home/xrootd/software/current/lib/libXrdOfs.so
> >>> xrootd.chksum crc32 /home/xrootd/bin/getCRC32.sh
> >>>
> >>> odc.trace redirect
> >>> ---
> >>>
> >>> Did anything also happen at 18:33 or 18:45 when the redirector got 
> >>> reset? In principle nothing happened from your point of view.
> >>>
> >>> Cheers,
> >>>
> >>> -- Gregory
> >>>
> >>>
> >>>
> >>> On Tue, 11 Oct 2005, Miriam Fritsch wrote:
> >>>
> >>>>
> >>>> Hi Gregory,
> >>>>
> >>>> some jobs crash with the following error message:
> >>>>
> >>>> ---------------------------------------------------------------------------
> >>>>
> >>>> 18:21:37.524 EvtCounter: processing event # 12085 [
> >>>> 1d:ffffffff:04ee72/3f73bb1d:V ]
> >>>> 2005-10-11 18:21:37 19228 Err : TXMessage::ReadRaw             - Error
> >>>> reading 8 bytes
> >>>> 2005-10-11 18:21:37 19228 Err : ReadPartialAnswer              - Error
> >>>> reading msg from connmgr (server [f01-010-107.gridka.de:1094]).
> >>>> 18:21:44.575 EvtCounter: processing event # 12086 [
> >>>> 1d:ffffffff:04ee72/3f73be86:J ]
> >>>> 2005-10-11 18:21:44 19228 Err : TXNetFile::ReadBuffer          - Server
> >>>> [f01-010-107.gridka.de:1094] did not return OK message for last reque
> >>>> st.
> >>>> 2005-10-11 18:21:44 19228 Err : SendGenCommand                 - Server
> >>>> declared error 3004: 'read does not refer to an open file'
> >>>> -- JOB 
> >>>> DONE --------------------------------------------------------------
> >>>>
> >>>> Cheers,
> >>>>
> >>>> Miriam
> >>>>
> >>>>
> >>>> *************************************************************************
> >>>>
> >>>> Dr. Miriam Fritsch
> >>>>
> >>>> Institut fuer Experimentalphysik I
> >>>> Ruhr-Universitaet Bochum, Germany               email: 
> >>>> [log in to unmask]
> >>>> c/o SLAC                                        tel:  +1 (650) 926-3565
> >>>> 2575 Sand Hill Road #34                         fax:  +1 (650) 926-3882
> >>>> Menlo Park, CA 94025, USA                       home: +1 (650) 324-2813
> >>>>
> >>>> *************************************************************************
> >>>>
> >>>>
> >>>
> >
> >