Hi, Andy Hanushevsky wrote: > Hi Gregory, > > This is a client problem. You are right, you should have been able to > restart the server with no problems. Fabrizio, do you see what happened > here? The file was opened, the server was restarted, the connection we > remade to that server, but the file was not re-opened. Instead, the > original file handle was used for the read. Apparently, there is a small > timing window where that could happen and that causes the job to crash. > Two solutions a) (the better one) close the tming window, b) (the > sloppier one) re-open the file if you get that particular error. hmm, does that mean we don't have a chance of fixing this for the current BABAR sw releases? Cheers, Andreas > > Andy > > ----- Original Message ----- From: "Gregory Schott" <[log in to unmask]> > To: "Miriam Fritsch" <[log in to unmask]> > Cc: "xrootd mailing list" <[log in to unmask]>; "SkimSOS" > <[log in to unmask]> > Sent: Tuesday, October 11, 2005 11:14 AM > Subject: Re: your mail > > >> Hello Miriam, >> >> OK. This was at the time one of the servers was restarted (it got >> ofline just a second or two). Andreas thought that in this case the >> currently reading processes would reconnect to the redirector for >> re-assignemrnt of a dataserver. Apparently it crashes instead. >> >> I am forwarding to the xrootd experts to ask them for their opinion. >> We are using the latest (July) production version and the config files >> looks like: >> >> $ cat config/redirector.cf >> olb.allow host babar2.gridka.de >> olb.allow host f01-014-108.gridka.de >> olb.allow host f01-016-102.gridka.de >> olb.allow host f01-016-101.gridka.de >> olb.allow host f01-014-106.gridka.de >> olb.allow host f01-016-108.gridka.de >> olb.allow host f01-016-109.gridka.de >> olb.allow host f01-016-106.gridka.de >> olb.allow host f01-016-107.gridka.de >> olb.allow host f01-014-103.gridka.de >> olb.allow host f01-014-107.gridka.de >> olb.allow host f01-005-151.gridka.de >> olb.allow host f01-010-110.gridka.de >> olb.allow host f01-005-115.gridka.de >> olb.allow host f01-010-107.gridka.de >> olb.allow host l01-001-122.gridka.de >> olb.port 3121 >> >> odc.manager l01-001-122.gridka.de 3121 >> >> xrootd.fslib /home/xrootd/software/current/lib/libXrdOfs.so >> xrootd.export /prod >> xrootd.export /store >> >> odc.trace redirect >> --- >> $ cat config/dataserver.cfg >> odc.manager l01-001-122.gridka.de 3121 >> >> olb.allow host babar2.gridka.de >> olb.allow host f01-014-108.gridka.de >> olb.allow host f01-016-102.gridka.de >> olb.allow host f01-016-101.gridka.de >> olb.allow host f01-014-106.gridka.de >> olb.allow host f01-016-108.gridka.de >> olb.allow host f01-016-109.gridka.de >> olb.allow host f01-016-106.gridka.de >> olb.allow host f01-016-107.gridka.de >> olb.allow host f01-014-103.gridka.de >> olb.allow host f01-014-107.gridka.de >> olb.allow host f01-005-151.gridka.de >> olb.allow host 10.65.10.110 >> olb.allow host f01-010-110.gridka.de >> olb.allow host 10.65.5.115 >> olb.allow host f01-005-115.gridka.de >> olb.allow host f01-010-107.gridka.de >> olb.allow host l01-001-122.gridka.de >> >> olb.path r /store >> olb.path w /prod >> olb.port 3121 >> olb.sched cpu 100 >> olb.subscribe l01-001-122.gridka.de >> olb.wait >> >> ofs.redirect remote if l01-001-122.gridka.de >> ofs.redirect target >> >> oss.alloc * * 80 >> oss.fdlimit * max >> oss.localroot /home/xrootd/disk/kanga-export/EventStore/ >> >> xrd.protocol xrootd * >> xrootd.async off >> xrootd.export /prod >> xrootd.export /store >> xrootd.fslib /home/xrootd/software/current/lib/libXrdOfs.so >> xrootd.chksum crc32 /home/xrootd/bin/getCRC32.sh >> >> odc.trace redirect >> --- >> >> Did anything also happen at 18:33 or 18:45 when the redirector got >> reset? In principle nothing happened from your point of view. >> >> Cheers, >> >> -- Gregory >> >> >> >> On Tue, 11 Oct 2005, Miriam Fritsch wrote: >> >>> >>> Hi Gregory, >>> >>> some jobs crash with the following error message: >>> >>> --------------------------------------------------------------------------- >>> >>> 18:21:37.524 EvtCounter: processing event # 12085 [ >>> 1d:ffffffff:04ee72/3f73bb1d:V ] >>> 2005-10-11 18:21:37 19228 Err : TXMessage::ReadRaw - Error >>> reading 8 bytes >>> 2005-10-11 18:21:37 19228 Err : ReadPartialAnswer - Error >>> reading msg from connmgr (server [f01-010-107.gridka.de:1094]). >>> 18:21:44.575 EvtCounter: processing event # 12086 [ >>> 1d:ffffffff:04ee72/3f73be86:J ] >>> 2005-10-11 18:21:44 19228 Err : TXNetFile::ReadBuffer - Server >>> [f01-010-107.gridka.de:1094] did not return OK message for last reque >>> st. >>> 2005-10-11 18:21:44 19228 Err : SendGenCommand - Server >>> declared error 3004: 'read does not refer to an open file' >>> -- JOB DONE >>> -------------------------------------------------------------- >>> >>> Cheers, >>> >>> Miriam >>> >>> >>> ************************************************************************* >>> >>> >>> Dr. Miriam Fritsch >>> >>> Institut fuer Experimentalphysik I >>> Ruhr-Universitaet Bochum, Germany email: [log in to unmask] >>> c/o SLAC tel: +1 (650) 926-3565 >>> 2575 Sand Hill Road #34 fax: +1 (650) 926-3882 >>> Menlo Park, CA 94025, USA home: +1 (650) 324-2813 >>> >>> ************************************************************************* >>> >>> >>> >>