Hi Andreas, Don't know. Fabrizio? Andy ----- Original Message ----- From: "Andreas Petzold" <[log in to unmask]> To: "Andy Hanushevsky" <[log in to unmask]> Cc: "Gregory Schott" <[log in to unmask]>; <[log in to unmask]>; "xrootd mailing list" <[log in to unmask]> Sent: Tuesday, October 11, 2005 4:09 PM Subject: Re: your mail > Hi, > > Andy Hanushevsky wrote: >> Hi Gregory, >> >> This is a client problem. You are right, you should have been able to >> restart the server with no problems. Fabrizio, do you see what happened >> here? The file was opened, the server was restarted, the connection we >> remade to that server, but the file was not re-opened. Instead, the >> original file handle was used for the read. Apparently, there is a small >> timing window where that could happen and that causes the job to crash. >> Two solutions a) (the better one) close the tming window, b) (the >> sloppier one) re-open the file if you get that particular error. > > hmm, does that mean we don't have a chance of fixing this for the current > BABAR sw releases? > > Cheers, > > Andreas > >> >> Andy >> >> ----- Original Message ----- From: "Gregory Schott" <[log in to unmask]> >> To: "Miriam Fritsch" <[log in to unmask]> >> Cc: "xrootd mailing list" <[log in to unmask]>; "SkimSOS" >> <[log in to unmask]> >> Sent: Tuesday, October 11, 2005 11:14 AM >> Subject: Re: your mail >> >> >>> Hello Miriam, >>> >>> OK. This was at the time one of the servers was restarted (it got ofline >>> just a second or two). Andreas thought that in this case the currently >>> reading processes would reconnect to the redirector for re-assignemrnt >>> of a dataserver. Apparently it crashes instead. >>> >>> I am forwarding to the xrootd experts to ask them for their opinion. We >>> are using the latest (July) production version and the config files >>> looks like: >>> >>> $ cat config/redirector.cf >>> olb.allow host babar2.gridka.de >>> olb.allow host f01-014-108.gridka.de >>> olb.allow host f01-016-102.gridka.de >>> olb.allow host f01-016-101.gridka.de >>> olb.allow host f01-014-106.gridka.de >>> olb.allow host f01-016-108.gridka.de >>> olb.allow host f01-016-109.gridka.de >>> olb.allow host f01-016-106.gridka.de >>> olb.allow host f01-016-107.gridka.de >>> olb.allow host f01-014-103.gridka.de >>> olb.allow host f01-014-107.gridka.de >>> olb.allow host f01-005-151.gridka.de >>> olb.allow host f01-010-110.gridka.de >>> olb.allow host f01-005-115.gridka.de >>> olb.allow host f01-010-107.gridka.de >>> olb.allow host l01-001-122.gridka.de >>> olb.port 3121 >>> >>> odc.manager l01-001-122.gridka.de 3121 >>> >>> xrootd.fslib /home/xrootd/software/current/lib/libXrdOfs.so >>> xrootd.export /prod >>> xrootd.export /store >>> >>> odc.trace redirect >>> --- >>> $ cat config/dataserver.cfg >>> odc.manager l01-001-122.gridka.de 3121 >>> >>> olb.allow host babar2.gridka.de >>> olb.allow host f01-014-108.gridka.de >>> olb.allow host f01-016-102.gridka.de >>> olb.allow host f01-016-101.gridka.de >>> olb.allow host f01-014-106.gridka.de >>> olb.allow host f01-016-108.gridka.de >>> olb.allow host f01-016-109.gridka.de >>> olb.allow host f01-016-106.gridka.de >>> olb.allow host f01-016-107.gridka.de >>> olb.allow host f01-014-103.gridka.de >>> olb.allow host f01-014-107.gridka.de >>> olb.allow host f01-005-151.gridka.de >>> olb.allow host 10.65.10.110 >>> olb.allow host f01-010-110.gridka.de >>> olb.allow host 10.65.5.115 >>> olb.allow host f01-005-115.gridka.de >>> olb.allow host f01-010-107.gridka.de >>> olb.allow host l01-001-122.gridka.de >>> >>> olb.path r /store >>> olb.path w /prod >>> olb.port 3121 >>> olb.sched cpu 100 >>> olb.subscribe l01-001-122.gridka.de >>> olb.wait >>> >>> ofs.redirect remote if l01-001-122.gridka.de >>> ofs.redirect target >>> >>> oss.alloc * * 80 >>> oss.fdlimit * max >>> oss.localroot /home/xrootd/disk/kanga-export/EventStore/ >>> >>> xrd.protocol xrootd * >>> xrootd.async off >>> xrootd.export /prod >>> xrootd.export /store >>> xrootd.fslib /home/xrootd/software/current/lib/libXrdOfs.so >>> xrootd.chksum crc32 /home/xrootd/bin/getCRC32.sh >>> >>> odc.trace redirect >>> --- >>> >>> Did anything also happen at 18:33 or 18:45 when the redirector got >>> reset? In principle nothing happened from your point of view. >>> >>> Cheers, >>> >>> -- Gregory >>> >>> >>> >>> On Tue, 11 Oct 2005, Miriam Fritsch wrote: >>> >>>> >>>> Hi Gregory, >>>> >>>> some jobs crash with the following error message: >>>> >>>> --------------------------------------------------------------------------- >>>> >>>> 18:21:37.524 EvtCounter: processing event # 12085 [ >>>> 1d:ffffffff:04ee72/3f73bb1d:V ] >>>> 2005-10-11 18:21:37 19228 Err : TXMessage::ReadRaw - Error >>>> reading 8 bytes >>>> 2005-10-11 18:21:37 19228 Err : ReadPartialAnswer - Error >>>> reading msg from connmgr (server [f01-010-107.gridka.de:1094]). >>>> 18:21:44.575 EvtCounter: processing event # 12086 [ >>>> 1d:ffffffff:04ee72/3f73be86:J ] >>>> 2005-10-11 18:21:44 19228 Err : TXNetFile::ReadBuffer - Server >>>> [f01-010-107.gridka.de:1094] did not return OK message for last reque >>>> st. >>>> 2005-10-11 18:21:44 19228 Err : SendGenCommand - Server >>>> declared error 3004: 'read does not refer to an open file' >>>> -- JOB >>>> DONE -------------------------------------------------------------- >>>> >>>> Cheers, >>>> >>>> Miriam >>>> >>>> >>>> ************************************************************************* >>>> >>>> Dr. Miriam Fritsch >>>> >>>> Institut fuer Experimentalphysik I >>>> Ruhr-Universitaet Bochum, Germany email: >>>> [log in to unmask] >>>> c/o SLAC tel: +1 (650) 926-3565 >>>> 2575 Sand Hill Road #34 fax: +1 (650) 926-3882 >>>> Menlo Park, CA 94025, USA home: +1 (650) 324-2813 >>>> >>>> ************************************************************************* >>>> >>>> >>> > >