Print

Print


Hi Manny,

Good. Hold on to the "non-working" mps_PreStage because we can use this to
find out why the client is doing what it is doing in the erro condition
that gets returned.

Andy

On Thu, 9 Jun 2005, Emmanuel Olaiya wrote:

> Hi Andy
>
> Andrew Hanushevsky wrote:
> > Hi Manny,
> >
> > I guess we will completely sort this out on Monday. Distilling all of the
> > below, there are only one saliant issue:
> >
> > a) Why is the file *not* getting basedir prepended to it? We can figure
> > this out by doing a diff on what you installed and what is in utils to see
> > why mps_PreStage is not prefixing the path.
> >
>
> I was using mps_Prestage from the xrootd package as opposed to a RAL
> version Chris modified. Staging works now!
>
> > The "continuing to hang" problem is a client problem. Here the client is
> > always asking for a cache refresh. So, either an old client is being used
> > (old clients had this bug and it was fixed about 6 months ago) or the bug
> > has returned under this new scenario (I suspect that the latter is true).
> >
>
> With the last test I didn't check the "continuing to hang" problem.
> Though now that staging works this has got rid of one of our biggest
> problems. If I now ask for a file that is not in the MSS or on disk,
> this information is now passed on and my job does not hang.
>
> I'll be happy to do some tests with you on Monday.
>
> cheers
>
> Manny
>
> > So, Fabrizio, do you see anywhere in the client where the code may get
> > causght in a cache refresh loop?
> > Andy
> >
> > On Thu, 9 Jun 2005, Bill Weeks wrote:
> >
> >
> >>Hi,
> >>I hope I can help sort out what's going on here, but it is confusing.
> >>First off, mps_PreStage and mps_Stage never really handled "mssdir" and
> >>"basedir" correctly. This was never a problem for us because these have
> >>always been the same. For RAL, this is not the case. So RAL (Chris?) changed
> >>mps_PreStage to add $basedir to the target filename, e.g.
> >>
> >>   $cmd = "$pstgcmd $rflag $Lflag $file $basedir/$file 2>&1";
> >>
> >>Once this was done, mps_Stage failed for a file whose path did not
> >>previously exist because $basedir/$file created a filepath with a "//"
> >>in it and the MakePath subroutine didn't handle this properly. The change
> >>I made in version 1.9 of mps_Stage removed the double //'s so MakePath
> >>would work properly.
> >>
> >>The problem you are now reporting seems to indicate that you have either
> >>removed your mod to mps_PreStage or have redefined basedir in your config
> >>file because mps_Stage is trying to write into /store instead of /basedir/store,
> >>e.g. /stage/bdata-data50/kanga/store. Is this what happened?
> >>
> >>I think once the file is correctly staged in, the waiting jobs that are
> >>polling for the file will continue.
> >>
> >>We still have some work to do to correctly handle the situation where mssdir
> >>and basedir are different.
> >>--Bill Weeks, SLAC, (650) 926-2909
> >>
> >>
> >>
> >>>Date: Tue, 07 Jun 2005 14:30:56 -0700
> >>>From: Emmanuel Olaiya <[log in to unmask]>
> >>>User-Agent: Mozilla Thunderbird 0.9 (X11/20041103)
> >>>X-Accept-Language: en-us, en
> >>>MIME-Version: 1.0
> >>>To: Andrew Hanushevsky <[log in to unmask]>
> >>>CC: "Adye, TJ (Tim)" <[log in to unmask]>, "Brew, CAJ (Chris)"
> >>
> >><[log in to unmask]>, [log in to unmask], Bill Weeks
> >><[log in to unmask]>
> >>
> >>>Subject: Re: PreStage Problems
> >>>Content-Transfer-Encoding: 7bit
> >>>
> >>>Hi Andy, Bill
> >>>
> >>>I took the versions of mps_Stage and mps_prep from
> >>>/afs/slac/package/xrd/xrootd/utils. These are mps_Stage and mps_prep
> >>>versions 1.9 and 1.8 respectively.
> >>>
> >>>I still see the problem Chris reported. Restarting the directors and the
> >>>server (with prestaging on the server) I get the following message in
> >>>the prestage log when asking for a file that doesn't exist at RAL
> >>>
> >>>Starting new cycle, pstg proc = 0
> >>>21:17:41 [ 17543] getlock: locking file
> >>>
> >>>>>/opt/xrootd/stageQ/PreStageQ.0.lock, flags 2
> >>>
> >>>21:17:41 [ 17543] getlock: locking file
> >>>+</opt/xrootd/stageQ/PreStageQ.0.old, flags 2
> >>>21:17:41 [ 17543] unlock: unlocking file /opt/xrootd/stageQ/PreStageQ.0.old
> >>>21:17:41 [ 17543] unlock: unlocking file /opt/xrootd/stageQ/PreStageQ.0.lock
> >>>21:17:41 [ 17543] getlock: locking file
> >>>
> >>>>>/opt/xrootd/stageQ/PreStageQ.1.lock, flags 2
> >>>
> >>>21:17:41 [ 17543] unlock: unlocking file /opt/xrootd/stageQ/PreStageQ.1.lock
> >>>21:21:29 [ 17772] mps_Stage: cannot create 'store' in
> >>>'/store/PRskims/R14/16.1.1b/BToPPP/58/'; Permission denied
> >>>21:21:29 [ 17772] mps_Stage: Invalid file system path,
> >>>'/store/PRskims/R14/16.1.1b/BToPPP/58/'.
> >>>21:21:29 [ 17772] do_stagein: xfr failed for
> >>>/store/PRskims/R14/16.1.1b/BToPPP/58/BToPPP_5831.01.root, rc=4, retry=1
> >>>
> >>>Whilst my job just hangs. If I take the log file literally, it is trying
> >>>to write to /store when it should be trying to write to
> >>>/base_directory/store.
> >>>
> >>>Doing further tests I can reproduce the problem I reported earlier.
> >>>Whilst still asking for the above file I turn off staging, restart the
> >>>directors and servers and the request for the file continues to hang (is
> >>>told to wait). Then I make another request for the same file and this
> >>>request is also continually told to wait:
> >>>
> >>>050607 21:55:13 2915 odc_Locate: olaiya.8042:[log in to unmask] asked to
> >>>wait 5 by xrootd107
> >>>path=/store/PRskims/R14/16.1.1b/BToPPP/58/BToPPP_5831.01.root
> >>>050607 21:55:14 2915 odc_Locate: olaiya.23507:[log in to unmask] asked to
> >>>wait 5 by xrootd107
> >>>path=/store/PRskims/R14/16.1.1b/BToPPP/58/BToPPP_5831.01.root
> >>>050607 21:55:18 2915 odc_Locate: olaiya.8042:[log in to unmask] asked to
> >>>wait 5 by xrootd107
> >>>path=/store/PRskims/R14/16.1.1b/BToPPP/58/BToPPP_5831.01.root
> >>>...
> >>>
> >>>
> >>>It is only after I kill the first request that anymore requests for this
> >>>file return correctly with a message indicating that the file cannot be
> >>>found.
> >>>
> >>>cheers
> >>>
> >>>Manny
> >>>
> >>>Andrew Hanushevsky wrote:
> >>>
> >>>>Hi Tim,
> >>>>
> >>>>Bill Weeks should have the fix available. You can also find the fixed mps
> >>>>scripts in /afs/slac/package/xrd/xrootd/utils (I think you just need an
> >>>>update for mps_Stage and mps_prep).
> >>>>
> >>>>Otherwise, the earliest time I can get together with Many is Monday. How
> >>>>about the afternoon, say 1:30pm?
> >>>>
> >>>>Andy
> >>>>
> >>>>On Tue, 7 Jun 2005, Adye, TJ (Tim) wrote:
> >>>>
> >>>>
> >>>>
> >>>>>Hi Guys,
> >>>>>
> >>>>>Did you manage to sort something out, despite the cancellation of the
> >>>>>meeting? These are serious problems for us.
> >>>>>
> >>>>>Tim.
> >>>>>
> >>>>>
> >>>>>
> >>>>>>-----Original Message-----
> >>>>>>From: [log in to unmask]
> >>>>>>[mailto:[log in to unmask]] On Behalf Of
> >>>>>>Emmanuel Olaiya
> >>>>>>Sent: 06 June 2005 22:57
> >>>>>>To: Andy Hanushevsky
> >>>>>>Cc: Brew, CAJ (Chris); [log in to unmask]; Bill Weeks
> >>>>>>Subject: Re: PreStage Problems
> >>>>>>
> >>>>>>Hi Andy
> >>>>>>
> >>>>>>Yes, it would be good if you could have a look at this with
> >>>>>>me. We can
> >>>>>>arrange a time in the xrootd meeting tomorrow.
> >>>>>>
> >>>>>>cheers
> >>>>>>
> >>>>>>Manny
> >>>>>>
> >>>>>>Andy Hanushevsky wrote:
> >>>>>>
> >>>>>>
> >>>>>>>Hi Manny,
> >>>>>>>
> >>>>>>>I find this is quite mysterious as this should never be the
> >>>>>>
> >>>>>>case and,
> >>>>>>
> >>>>>>
> >>>>>>>frankly, appears to violate causality. I suspect something
> >>>>>>
> >>>>>>else is going
> >>>>>>
> >>>>>>
> >>>>>>>on. If this is reproducible then why don't we run a test with all
> >>>>>>>debugging turned on. Yes?
> >>>>>>>
> >>>>>>>Andy
> >>>>>>>
> >>>>>>>----- Original Message ----- From: "Emmanuel Olaiya"
> >>>>>>
> >>>>>><[log in to unmask]>
> >>>>>>
> >>>>>>>To: "Andrew Hanushevsky" <[log in to unmask]>
> >>>>>>>Cc: "Brew, CAJ (Chris)" <[log in to unmask]>;
> >>>>>>><[log in to unmask]>; "Bill Weeks" <[log in to unmask]>
> >>>>>>>Sent: Monday, June 06, 2005 1:41 PM
> >>>>>>>Subject: Re: PreStage Problems
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>Hi Andy
> >>>>>>>>
> >>>>>>>>I should have mentioned that we also remove the prestage queue and
> >>>>>>>>restarted both the server and redirector. However the old
> >>>>>>
> >>>>>>request to
> >>>>>>
> >>>>>>
> >>>>>>>>wait did not change. Moreover, any similar new requests
> >>>>>>
> >>>>>>were also told
> >>>>>>
> >>>>>>
> >>>>>>>>to wait until the old request was terminated.
> >>>>>>>>
> >>>>>>>>cheers
> >>>>>>>>
> >>>>>>>>Manny
> >>>>>>>>
> >>>>>>>>Andrew Hanushevsky wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>Hi Manny,
> >>>>>>>>>
> >>>>>>>>>Yes, but who telling the client to wait? The redirector
> >>>>>>
> >>>>>>or the server
> >>>>>>
> >>>>>>
> >>>>>>>>>that
> >>>>>>>>>wanted to orginally stage the file in. When you restart the
> >>>>>>>>>redirector it
> >>>>>>>>>loses all it's memory but the data server does not. So,
> >>>>>>
> >>>>>>it will hapiily
> >>>>>>
> >>>>>>
> >>>>>>>>>tell the redirector that it has the file eventhough the file is
> >>>>>>>>>merely in
> >>>>>>>>>the pre-stage queue. As long as the file is in the
> >>>>>>
> >>>>>>prestage queue and
> >>>>>>
> >>>>>>
> >>>>>>>>>not on
> >>>>>>>>>disk, the only option is to direct clients to where the
> >>>>>>
> >>>>>>file will be
> >>>>>>
> >>>>>>
> >>>>>>>>>staged in and then the clients simply wait for the file
> >>>>>>
> >>>>>>(which in this
> >>>>>>
> >>>>>>
> >>>>>>>>>case will never appear). So, if you remove staging you
> >>>>>>
> >>>>>>also need to
> >>>>>>
> >>>>>>
> >>>>>>>>>remove
> >>>>>>>>>the prestage queue and restart the data server.
> >>>>>>>>>
> >>>>>>>>>Andy
> >>>>>>>>>
> >>>>>>>>>On Fri, 3 Jun 2005, Emmanuel Olaiya wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>Hi Andy
> >>>>>>>>>>
> >>>>>>>>>>One other issue we have spotted at RAL. We removed the staging
> >>>>>>>>>>capabilities and restarted the director and server.
> >>>>>>
> >>>>>>However we found
> >>>>>>
> >>>>>>
> >>>>>>>>>>previous requests for a file that were told to wait
> >>>>>>
> >>>>>>continued being
> >>>>>>
> >>>>>>
> >>>>>>>>>>told
> >>>>>>>>>>to wait. We also found that if somebody else asked for
> >>>>>>
> >>>>>>this same file
> >>>>>>
> >>>>>>
> >>>>>>>>>>that was not on disk they were also told to wait rather
> >>>>>>
> >>>>>>than being told
> >>>>>>
> >>>>>>
> >>>>>>>>>>the file could not be found. We needed to kill the
> >>>>>>
> >>>>>>previous request and
> >>>>>>
> >>>>>>
> >>>>>>>>>>restart the server and directory for xrootd to know the
> >>>>>>
> >>>>>>file was not on
> >>>>>>
> >>>>>>
> >>>>>>>>>>disk.
> >>>>>>>>>>
> >>>>>>>>>>cheers
> >>>>>>>>>>
> >>>>>>>>>>Manny
> >>>>>>>>>>
> >>>>>>>>>>Andrew Hanushevsky wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>Hi Chris,
> >>>>>>>>>>>
> >>>>>>>>>>>Oh yeah, different problem. I think that Bill Weeks fixed that.
> >>>>>>>>>>>Bill did
> >>>>>>>>>>>you fix that problem?
> >>>>>>>>>>>
> >>>>>>>>>>>Andy
> >>>>>>>>>>>
> >>>>>>>>>>>On Mon, 30 May 2005, Brew, CAJ (Chris) wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>>I might be being stupid but I don't see how this
> >>>>>>
> >>>>>>relates to the
> >>>>>>
> >>>>>>
> >>>>>>>>>>>>problem.
> >>>>>>>>>>>>The files I wanted were on a different disk server
> >>>>>>
> >>>>>>which then went
> >>>>>>
> >>>>>>
> >>>>>>>>>>>>down.
> >>>>>>>>>>>>The server in question was registered with the OLB as
> >>>>>>
> >>>>>>being able to
> >>>>>>
> >>>>>>
> >>>>>>>>>>>>stage in the name space so the request was redirected to it. If
> >>>>>>>>>>>>mps_Stage is used without the PreStage queuing system
> >>>>>>
> >>>>>>everything
> >>>>>>
> >>>>>>
> >>>>>>>>>>>>works
> >>>>>>>>>>>>as expected. If we try to go through the PreStage
> >>>>>>
> >>>>>>queue to limit the
> >>>>>>
> >>>>>>
> >>>>>>>>>>>>number of concurrent accesses to the tapestore the
> >>>>>>
> >>>>>>stage in fails.
> >>>>>>
> >>>>>>
> >>>>>>>>>>>>Apparently because the DIR_LOCK file does not exist (which it
> >>>>>>>>>>>>doesn't,
> >>>>>>>>>>>>since the file, and it's directory structure, has
> >>>>>>
> >>>>>>never existed on
> >>>>>>
> >>>>>>
> >>>>>>>>>>>>this
> >>>>>>>>>>>>server).
> >>>>>>>>>>>>
> >>>>>>>>>>>>Yours,
> >>>>>>>>>>>>Chris.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>-----Original Message-----
> >>>>>>>>>>>>>From: Andrew Hanushevsky [mailto:[log in to unmask]]
> >>>>>>>>>>>>>Sent: 28 May 2005 07:39
> >>>>>>>>>>>>>To: Brew, CAJ (Chris)
> >>>>>>>>>>>>>Cc: [log in to unmask]; abh; Olaiya, EO (Emmanuel)
> >>>>>>>>>>>>>Subject: RE: PreStage Problems
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>Hi Chris,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>This was traced to overzealous testing. The syustem does not
> >>>>>>>>>>>>>put in a new
> >>>>>>>>>>>>>entry in the pre-stage queue until after about 10-20 minutes
> >>>>>>>>>>>>>have elapsed
> >>>>>>>>>>>>>since the last time the entry was added. So, this is not a
> >>>>>>>>>>>>>bug but a test
> >>>>>>>>>>>>>case that was not "real". Generally, files live in the disk
> >>>>>>>>>>>>>cache for at
> >>>>>>>>>>>>>least 10-20 minutes.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>Andy
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>On Fri, 27 May 2005, Brew, CAJ (Chris) wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>Hi,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>At the meeting a couple of weeks ago, it was said
> >>>>>>
> >>>>>>that someone was
> >>>>>>
> >>>>>>
> >>>>>>>>>>>>>>looking into this but I haven't heard anything back. Is
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>there any new?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>Thanks,
> >>>>>>>>>>>>>>Chris.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>-----Original Message-----
> >>>>>>>>>>>>>>>From: Brew, CAJ (Chris)
> >>>>>>>>>>>>>>>Sent: 17 May 2005 13:50
> >>>>>>>>>>>>>>>To: [log in to unmask]; abh
> >>>>>>>>>>>>>>>Cc: Olaiya, EO (Emmanuel)
> >>>>>>>>>>>>>>>Subject: PreStage Problems
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>Hi,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>I've been running some more tests of the staging at RAL and
> >>>>>>>>>>>>>>>have run into a problem somewhere in the
> >>>>>>>>>>>>>>>mps_Stage/PreStage/prep system.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>Everything work fine staging file that was on the system and
> >>>>>>>>>>>>>>>has been deleted but if I try to stage in a file
> >>>>>>
> >>>>>>that was one
> >>>>>>
> >>>>>>
> >>>>>>>>>>>>>>>a different server, hence the directory structure for the
> >>>>>>>>>>>>>>>file does not exist on the staging server it fails and I see
> >>>>>>>>>>>>>>>the following error in the PreStage log file:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>12:45:43 [ 10859] mps_Stage: Open
> >>>>>>>>>>>>>>>
> >>>>>>
> >>>>>>'/stage/bdata-data50/kanga//store/SPskims/R12/16.0.2e/BtoKKKL/
> >>>>>>
> >>>>>>
> >>>>>>>>>>>>>>>001005/200002/DIR_LOCK' r/w failed; No such file or
> >>>>>>
> >>>>>>directory.
> >>>>>>
> >>>>>>
> >>>>>>>>>>>>>>>12:45:43 [ 10859] do_stagein: xfr failed for
> >>>>>>>>>>>>>>>
> >>>>>>
> >>>>>>/store/SPskims/R12/16.0.2e/BtoKKKL/001005/200002/BtoKKKL_00100
> >>>>>>
> >>>>>>
> >>>>>>>>>>>>>>>5_3247.01.root, rc=4, retry=1
> >>>>>>>>>>>>>>>12:45:45 [  3255]
> >>>>>>>>>>>>>>>
> >>>>>>
> >>>>>>file=/store/SPskims/R12/16.0.2e/BtoKKKL/001005/200002/BtoKKKL_
> >>>>>>
> >>>>>>
> >>>>>>>>>>>>>>>0010053247.01.root, rc=1024, reqid=ef000001:1cd2.425d27e1
> >>>>>>>>>>>>>>>:3762
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>If I create the directories and the DIR_LOCK file before
> >>>>>>>>>>>>>>>running the import, everything works.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>The config file I'm using on the server is below.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>Is there some setting I'm missing which is needed to create
> >>>>>>>>>>>>>>>the directories/DIR_LOCK file or does the code need fixing?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>Thanks,
> >>>>>>>>>>>>>>>Chris
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>--
> >>>>>>>>>>>>>>>Chris Brew  ([log in to unmask])  +44 1235 446326
> >>>>>>>>>>>>>>>Particle Physics Department
> >>>>>>>>>>>>>>>Rutherford Appleton Laboratory
> >>>>>>>>>>>>>>>Chilton, Didcot. Oxfordshire.
> >>>>>>>>>>>>>>>OX11 0QX. United Kingdom.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>
>