Hi Manny, Good. Hold on to the "non-working" mps_PreStage because we can use this to find out why the client is doing what it is doing in the erro condition that gets returned. Andy On Thu, 9 Jun 2005, Emmanuel Olaiya wrote: > Hi Andy > > Andrew Hanushevsky wrote: > > Hi Manny, > > > > I guess we will completely sort this out on Monday. Distilling all of the > > below, there are only one saliant issue: > > > > a) Why is the file *not* getting basedir prepended to it? We can figure > > this out by doing a diff on what you installed and what is in utils to see > > why mps_PreStage is not prefixing the path. > > > > I was using mps_Prestage from the xrootd package as opposed to a RAL > version Chris modified. Staging works now! > > > The "continuing to hang" problem is a client problem. Here the client is > > always asking for a cache refresh. So, either an old client is being used > > (old clients had this bug and it was fixed about 6 months ago) or the bug > > has returned under this new scenario (I suspect that the latter is true). > > > > With the last test I didn't check the "continuing to hang" problem. > Though now that staging works this has got rid of one of our biggest > problems. If I now ask for a file that is not in the MSS or on disk, > this information is now passed on and my job does not hang. > > I'll be happy to do some tests with you on Monday. > > cheers > > Manny > > > So, Fabrizio, do you see anywhere in the client where the code may get > > causght in a cache refresh loop? > > Andy > > > > On Thu, 9 Jun 2005, Bill Weeks wrote: > > > > > >>Hi, > >>I hope I can help sort out what's going on here, but it is confusing. > >>First off, mps_PreStage and mps_Stage never really handled "mssdir" and > >>"basedir" correctly. This was never a problem for us because these have > >>always been the same. For RAL, this is not the case. So RAL (Chris?) changed > >>mps_PreStage to add $basedir to the target filename, e.g. > >> > >> $cmd = "$pstgcmd $rflag $Lflag $file $basedir/$file 2>&1"; > >> > >>Once this was done, mps_Stage failed for a file whose path did not > >>previously exist because $basedir/$file created a filepath with a "//" > >>in it and the MakePath subroutine didn't handle this properly. The change > >>I made in version 1.9 of mps_Stage removed the double //'s so MakePath > >>would work properly. > >> > >>The problem you are now reporting seems to indicate that you have either > >>removed your mod to mps_PreStage or have redefined basedir in your config > >>file because mps_Stage is trying to write into /store instead of /basedir/store, > >>e.g. /stage/bdata-data50/kanga/store. Is this what happened? > >> > >>I think once the file is correctly staged in, the waiting jobs that are > >>polling for the file will continue. > >> > >>We still have some work to do to correctly handle the situation where mssdir > >>and basedir are different. > >>--Bill Weeks, SLAC, (650) 926-2909 > >> > >> > >> > >>>Date: Tue, 07 Jun 2005 14:30:56 -0700 > >>>From: Emmanuel Olaiya <[log in to unmask]> > >>>User-Agent: Mozilla Thunderbird 0.9 (X11/20041103) > >>>X-Accept-Language: en-us, en > >>>MIME-Version: 1.0 > >>>To: Andrew Hanushevsky <[log in to unmask]> > >>>CC: "Adye, TJ (Tim)" <[log in to unmask]>, "Brew, CAJ (Chris)" > >> > >><[log in to unmask]>, [log in to unmask], Bill Weeks > >><[log in to unmask]> > >> > >>>Subject: Re: PreStage Problems > >>>Content-Transfer-Encoding: 7bit > >>> > >>>Hi Andy, Bill > >>> > >>>I took the versions of mps_Stage and mps_prep from > >>>/afs/slac/package/xrd/xrootd/utils. These are mps_Stage and mps_prep > >>>versions 1.9 and 1.8 respectively. > >>> > >>>I still see the problem Chris reported. Restarting the directors and the > >>>server (with prestaging on the server) I get the following message in > >>>the prestage log when asking for a file that doesn't exist at RAL > >>> > >>>Starting new cycle, pstg proc = 0 > >>>21:17:41 [ 17543] getlock: locking file > >>> > >>>>>/opt/xrootd/stageQ/PreStageQ.0.lock, flags 2 > >>> > >>>21:17:41 [ 17543] getlock: locking file > >>>+</opt/xrootd/stageQ/PreStageQ.0.old, flags 2 > >>>21:17:41 [ 17543] unlock: unlocking file /opt/xrootd/stageQ/PreStageQ.0.old > >>>21:17:41 [ 17543] unlock: unlocking file /opt/xrootd/stageQ/PreStageQ.0.lock > >>>21:17:41 [ 17543] getlock: locking file > >>> > >>>>>/opt/xrootd/stageQ/PreStageQ.1.lock, flags 2 > >>> > >>>21:17:41 [ 17543] unlock: unlocking file /opt/xrootd/stageQ/PreStageQ.1.lock > >>>21:21:29 [ 17772] mps_Stage: cannot create 'store' in > >>>'/store/PRskims/R14/16.1.1b/BToPPP/58/'; Permission denied > >>>21:21:29 [ 17772] mps_Stage: Invalid file system path, > >>>'/store/PRskims/R14/16.1.1b/BToPPP/58/'. > >>>21:21:29 [ 17772] do_stagein: xfr failed for > >>>/store/PRskims/R14/16.1.1b/BToPPP/58/BToPPP_5831.01.root, rc=4, retry=1 > >>> > >>>Whilst my job just hangs. If I take the log file literally, it is trying > >>>to write to /store when it should be trying to write to > >>>/base_directory/store. > >>> > >>>Doing further tests I can reproduce the problem I reported earlier. > >>>Whilst still asking for the above file I turn off staging, restart the > >>>directors and servers and the request for the file continues to hang (is > >>>told to wait). Then I make another request for the same file and this > >>>request is also continually told to wait: > >>> > >>>050607 21:55:13 2915 odc_Locate: olaiya.8042:[log in to unmask] asked to > >>>wait 5 by xrootd107 > >>>path=/store/PRskims/R14/16.1.1b/BToPPP/58/BToPPP_5831.01.root > >>>050607 21:55:14 2915 odc_Locate: olaiya.23507:[log in to unmask] asked to > >>>wait 5 by xrootd107 > >>>path=/store/PRskims/R14/16.1.1b/BToPPP/58/BToPPP_5831.01.root > >>>050607 21:55:18 2915 odc_Locate: olaiya.8042:[log in to unmask] asked to > >>>wait 5 by xrootd107 > >>>path=/store/PRskims/R14/16.1.1b/BToPPP/58/BToPPP_5831.01.root > >>>... > >>> > >>> > >>>It is only after I kill the first request that anymore requests for this > >>>file return correctly with a message indicating that the file cannot be > >>>found. > >>> > >>>cheers > >>> > >>>Manny > >>> > >>>Andrew Hanushevsky wrote: > >>> > >>>>Hi Tim, > >>>> > >>>>Bill Weeks should have the fix available. You can also find the fixed mps > >>>>scripts in /afs/slac/package/xrd/xrootd/utils (I think you just need an > >>>>update for mps_Stage and mps_prep). > >>>> > >>>>Otherwise, the earliest time I can get together with Many is Monday. How > >>>>about the afternoon, say 1:30pm? > >>>> > >>>>Andy > >>>> > >>>>On Tue, 7 Jun 2005, Adye, TJ (Tim) wrote: > >>>> > >>>> > >>>> > >>>>>Hi Guys, > >>>>> > >>>>>Did you manage to sort something out, despite the cancellation of the > >>>>>meeting? These are serious problems for us. > >>>>> > >>>>>Tim. > >>>>> > >>>>> > >>>>> > >>>>>>-----Original Message----- > >>>>>>From: [log in to unmask] > >>>>>>[mailto:[log in to unmask]] On Behalf Of > >>>>>>Emmanuel Olaiya > >>>>>>Sent: 06 June 2005 22:57 > >>>>>>To: Andy Hanushevsky > >>>>>>Cc: Brew, CAJ (Chris); [log in to unmask]; Bill Weeks > >>>>>>Subject: Re: PreStage Problems > >>>>>> > >>>>>>Hi Andy > >>>>>> > >>>>>>Yes, it would be good if you could have a look at this with > >>>>>>me. We can > >>>>>>arrange a time in the xrootd meeting tomorrow. > >>>>>> > >>>>>>cheers > >>>>>> > >>>>>>Manny > >>>>>> > >>>>>>Andy Hanushevsky wrote: > >>>>>> > >>>>>> > >>>>>>>Hi Manny, > >>>>>>> > >>>>>>>I find this is quite mysterious as this should never be the > >>>>>> > >>>>>>case and, > >>>>>> > >>>>>> > >>>>>>>frankly, appears to violate causality. I suspect something > >>>>>> > >>>>>>else is going > >>>>>> > >>>>>> > >>>>>>>on. If this is reproducible then why don't we run a test with all > >>>>>>>debugging turned on. Yes? > >>>>>>> > >>>>>>>Andy > >>>>>>> > >>>>>>>----- Original Message ----- From: "Emmanuel Olaiya" > >>>>>> > >>>>>><[log in to unmask]> > >>>>>> > >>>>>>>To: "Andrew Hanushevsky" <[log in to unmask]> > >>>>>>>Cc: "Brew, CAJ (Chris)" <[log in to unmask]>; > >>>>>>><[log in to unmask]>; "Bill Weeks" <[log in to unmask]> > >>>>>>>Sent: Monday, June 06, 2005 1:41 PM > >>>>>>>Subject: Re: PreStage Problems > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>>Hi Andy > >>>>>>>> > >>>>>>>>I should have mentioned that we also remove the prestage queue and > >>>>>>>>restarted both the server and redirector. However the old > >>>>>> > >>>>>>request to > >>>>>> > >>>>>> > >>>>>>>>wait did not change. Moreover, any similar new requests > >>>>>> > >>>>>>were also told > >>>>>> > >>>>>> > >>>>>>>>to wait until the old request was terminated. > >>>>>>>> > >>>>>>>>cheers > >>>>>>>> > >>>>>>>>Manny > >>>>>>>> > >>>>>>>>Andrew Hanushevsky wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>>Hi Manny, > >>>>>>>>> > >>>>>>>>>Yes, but who telling the client to wait? The redirector > >>>>>> > >>>>>>or the server > >>>>>> > >>>>>> > >>>>>>>>>that > >>>>>>>>>wanted to orginally stage the file in. When you restart the > >>>>>>>>>redirector it > >>>>>>>>>loses all it's memory but the data server does not. So, > >>>>>> > >>>>>>it will hapiily > >>>>>> > >>>>>> > >>>>>>>>>tell the redirector that it has the file eventhough the file is > >>>>>>>>>merely in > >>>>>>>>>the pre-stage queue. As long as the file is in the > >>>>>> > >>>>>>prestage queue and > >>>>>> > >>>>>> > >>>>>>>>>not on > >>>>>>>>>disk, the only option is to direct clients to where the > >>>>>> > >>>>>>file will be > >>>>>> > >>>>>> > >>>>>>>>>staged in and then the clients simply wait for the file > >>>>>> > >>>>>>(which in this > >>>>>> > >>>>>> > >>>>>>>>>case will never appear). So, if you remove staging you > >>>>>> > >>>>>>also need to > >>>>>> > >>>>>> > >>>>>>>>>remove > >>>>>>>>>the prestage queue and restart the data server. > >>>>>>>>> > >>>>>>>>>Andy > >>>>>>>>> > >>>>>>>>>On Fri, 3 Jun 2005, Emmanuel Olaiya wrote: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>>Hi Andy > >>>>>>>>>> > >>>>>>>>>>One other issue we have spotted at RAL. We removed the staging > >>>>>>>>>>capabilities and restarted the director and server. > >>>>>> > >>>>>>However we found > >>>>>> > >>>>>> > >>>>>>>>>>previous requests for a file that were told to wait > >>>>>> > >>>>>>continued being > >>>>>> > >>>>>> > >>>>>>>>>>told > >>>>>>>>>>to wait. We also found that if somebody else asked for > >>>>>> > >>>>>>this same file > >>>>>> > >>>>>> > >>>>>>>>>>that was not on disk they were also told to wait rather > >>>>>> > >>>>>>than being told > >>>>>> > >>>>>> > >>>>>>>>>>the file could not be found. We needed to kill the > >>>>>> > >>>>>>previous request and > >>>>>> > >>>>>> > >>>>>>>>>>restart the server and directory for xrootd to know the > >>>>>> > >>>>>>file was not on > >>>>>> > >>>>>> > >>>>>>>>>>disk. > >>>>>>>>>> > >>>>>>>>>>cheers > >>>>>>>>>> > >>>>>>>>>>Manny > >>>>>>>>>> > >>>>>>>>>>Andrew Hanushevsky wrote: > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>>Hi Chris, > >>>>>>>>>>> > >>>>>>>>>>>Oh yeah, different problem. I think that Bill Weeks fixed that. > >>>>>>>>>>>Bill did > >>>>>>>>>>>you fix that problem? > >>>>>>>>>>> > >>>>>>>>>>>Andy > >>>>>>>>>>> > >>>>>>>>>>>On Mon, 30 May 2005, Brew, CAJ (Chris) wrote: > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>>Hi, > >>>>>>>>>>>> > >>>>>>>>>>>>I might be being stupid but I don't see how this > >>>>>> > >>>>>>relates to the > >>>>>> > >>>>>> > >>>>>>>>>>>>problem. > >>>>>>>>>>>>The files I wanted were on a different disk server > >>>>>> > >>>>>>which then went > >>>>>> > >>>>>> > >>>>>>>>>>>>down. > >>>>>>>>>>>>The server in question was registered with the OLB as > >>>>>> > >>>>>>being able to > >>>>>> > >>>>>> > >>>>>>>>>>>>stage in the name space so the request was redirected to it. If > >>>>>>>>>>>>mps_Stage is used without the PreStage queuing system > >>>>>> > >>>>>>everything > >>>>>> > >>>>>> > >>>>>>>>>>>>works > >>>>>>>>>>>>as expected. If we try to go through the PreStage > >>>>>> > >>>>>>queue to limit the > >>>>>> > >>>>>> > >>>>>>>>>>>>number of concurrent accesses to the tapestore the > >>>>>> > >>>>>>stage in fails. > >>>>>> > >>>>>> > >>>>>>>>>>>>Apparently because the DIR_LOCK file does not exist (which it > >>>>>>>>>>>>doesn't, > >>>>>>>>>>>>since the file, and it's directory structure, has > >>>>>> > >>>>>>never existed on > >>>>>> > >>>>>> > >>>>>>>>>>>>this > >>>>>>>>>>>>server). > >>>>>>>>>>>> > >>>>>>>>>>>>Yours, > >>>>>>>>>>>>Chris. > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>>-----Original Message----- > >>>>>>>>>>>>>From: Andrew Hanushevsky [mailto:[log in to unmask]] > >>>>>>>>>>>>>Sent: 28 May 2005 07:39 > >>>>>>>>>>>>>To: Brew, CAJ (Chris) > >>>>>>>>>>>>>Cc: [log in to unmask]; abh; Olaiya, EO (Emmanuel) > >>>>>>>>>>>>>Subject: RE: PreStage Problems > >>>>>>>>>>>>> > >>>>>>>>>>>>>Hi Chris, > >>>>>>>>>>>>> > >>>>>>>>>>>>>This was traced to overzealous testing. The syustem does not > >>>>>>>>>>>>>put in a new > >>>>>>>>>>>>>entry in the pre-stage queue until after about 10-20 minutes > >>>>>>>>>>>>>have elapsed > >>>>>>>>>>>>>since the last time the entry was added. So, this is not a > >>>>>>>>>>>>>bug but a test > >>>>>>>>>>>>>case that was not "real". Generally, files live in the disk > >>>>>>>>>>>>>cache for at > >>>>>>>>>>>>>least 10-20 minutes. > >>>>>>>>>>>>> > >>>>>>>>>>>>>Andy > >>>>>>>>>>>>> > >>>>>>>>>>>>>On Fri, 27 May 2005, Brew, CAJ (Chris) wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>>Hi, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>At the meeting a couple of weeks ago, it was said > >>>>>> > >>>>>>that someone was > >>>>>> > >>>>>> > >>>>>>>>>>>>>>looking into this but I haven't heard anything back. Is > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>there any new? > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>>Thanks, > >>>>>>>>>>>>>>Chris. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>>-----Original Message----- > >>>>>>>>>>>>>>>From: Brew, CAJ (Chris) > >>>>>>>>>>>>>>>Sent: 17 May 2005 13:50 > >>>>>>>>>>>>>>>To: [log in to unmask]; abh > >>>>>>>>>>>>>>>Cc: Olaiya, EO (Emmanuel) > >>>>>>>>>>>>>>>Subject: PreStage Problems > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>Hi, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>I've been running some more tests of the staging at RAL and > >>>>>>>>>>>>>>>have run into a problem somewhere in the > >>>>>>>>>>>>>>>mps_Stage/PreStage/prep system. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>Everything work fine staging file that was on the system and > >>>>>>>>>>>>>>>has been deleted but if I try to stage in a file > >>>>>> > >>>>>>that was one > >>>>>> > >>>>>> > >>>>>>>>>>>>>>>a different server, hence the directory structure for the > >>>>>>>>>>>>>>>file does not exist on the staging server it fails and I see > >>>>>>>>>>>>>>>the following error in the PreStage log file: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>12:45:43 [ 10859] mps_Stage: Open > >>>>>>>>>>>>>>> > >>>>>> > >>>>>>'/stage/bdata-data50/kanga//store/SPskims/R12/16.0.2e/BtoKKKL/ > >>>>>> > >>>>>> > >>>>>>>>>>>>>>>001005/200002/DIR_LOCK' r/w failed; No such file or > >>>>>> > >>>>>>directory. > >>>>>> > >>>>>> > >>>>>>>>>>>>>>>12:45:43 [ 10859] do_stagein: xfr failed for > >>>>>>>>>>>>>>> > >>>>>> > >>>>>>/store/SPskims/R12/16.0.2e/BtoKKKL/001005/200002/BtoKKKL_00100 > >>>>>> > >>>>>> > >>>>>>>>>>>>>>>5_3247.01.root, rc=4, retry=1 > >>>>>>>>>>>>>>>12:45:45 [ 3255] > >>>>>>>>>>>>>>> > >>>>>> > >>>>>>file=/store/SPskims/R12/16.0.2e/BtoKKKL/001005/200002/BtoKKKL_ > >>>>>> > >>>>>> > >>>>>>>>>>>>>>>0010053247.01.root, rc=1024, reqid=ef000001:1cd2.425d27e1 > >>>>>>>>>>>>>>>:3762 > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>If I create the directories and the DIR_LOCK file before > >>>>>>>>>>>>>>>running the import, everything works. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>The config file I'm using on the server is below. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>Is there some setting I'm missing which is needed to create > >>>>>>>>>>>>>>>the directories/DIR_LOCK file or does the code need fixing? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>Thanks, > >>>>>>>>>>>>>>>Chris > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>-- > >>>>>>>>>>>>>>>Chris Brew ([log in to unmask]) +44 1235 446326 > >>>>>>>>>>>>>>>Particle Physics Department > >>>>>>>>>>>>>>>Rutherford Appleton Laboratory > >>>>>>>>>>>>>>>Chilton, Didcot. Oxfordshire. > >>>>>>>>>>>>>>>OX11 0QX. United Kingdom. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >> >