Print

Print


Hi Andy

Andrew Hanushevsky wrote:
> Hi Manny,
> 
> I guess we will completely sort this out on Monday. Distilling all of the
> below, there are only one saliant issue:
> 
> a) Why is the file *not* getting basedir prepended to it? We can figure
> this out by doing a diff on what you installed and what is in utils to see
> why mps_PreStage is not prefixing the path.
> 

I was using mps_Prestage from the xrootd package as opposed to a RAL 
version Chris modified. Staging works now!

> The "continuing to hang" problem is a client problem. Here the client is
> always asking for a cache refresh. So, either an old client is being used
> (old clients had this bug and it was fixed about 6 months ago) or the bug
> has returned under this new scenario (I suspect that the latter is true).
> 

With the last test I didn't check the "continuing to hang" problem. 
Though now that staging works this has got rid of one of our biggest 
problems. If I now ask for a file that is not in the MSS or on disk, 
this information is now passed on and my job does not hang.

I'll be happy to do some tests with you on Monday.

cheers

Manny

> So, Fabrizio, do you see anywhere in the client where the code may get
> causght in a cache refresh loop?
> Andy
> 
> On Thu, 9 Jun 2005, Bill Weeks wrote:
> 
> 
>>Hi,
>>I hope I can help sort out what's going on here, but it is confusing.
>>First off, mps_PreStage and mps_Stage never really handled "mssdir" and
>>"basedir" correctly. This was never a problem for us because these have
>>always been the same. For RAL, this is not the case. So RAL (Chris?) changed
>>mps_PreStage to add $basedir to the target filename, e.g.
>>
>>   $cmd = "$pstgcmd $rflag $Lflag $file $basedir/$file 2>&1";
>>
>>Once this was done, mps_Stage failed for a file whose path did not
>>previously exist because $basedir/$file created a filepath with a "//"
>>in it and the MakePath subroutine didn't handle this properly. The change
>>I made in version 1.9 of mps_Stage removed the double //'s so MakePath
>>would work properly.
>>
>>The problem you are now reporting seems to indicate that you have either
>>removed your mod to mps_PreStage or have redefined basedir in your config
>>file because mps_Stage is trying to write into /store instead of /basedir/store,
>>e.g. /stage/bdata-data50/kanga/store. Is this what happened?
>>
>>I think once the file is correctly staged in, the waiting jobs that are
>>polling for the file will continue.
>>
>>We still have some work to do to correctly handle the situation where mssdir
>>and basedir are different.
>>--Bill Weeks, SLAC, (650) 926-2909
>>
>>
>>
>>>Date: Tue, 07 Jun 2005 14:30:56 -0700
>>>From: Emmanuel Olaiya <[log in to unmask]>
>>>User-Agent: Mozilla Thunderbird 0.9 (X11/20041103)
>>>X-Accept-Language: en-us, en
>>>MIME-Version: 1.0
>>>To: Andrew Hanushevsky <[log in to unmask]>
>>>CC: "Adye, TJ (Tim)" <[log in to unmask]>, "Brew, CAJ (Chris)"
>>
>><[log in to unmask]>, [log in to unmask], Bill Weeks
>><[log in to unmask]>
>>
>>>Subject: Re: PreStage Problems
>>>Content-Transfer-Encoding: 7bit
>>>
>>>Hi Andy, Bill
>>>
>>>I took the versions of mps_Stage and mps_prep from
>>>/afs/slac/package/xrd/xrootd/utils. These are mps_Stage and mps_prep
>>>versions 1.9 and 1.8 respectively.
>>>
>>>I still see the problem Chris reported. Restarting the directors and the
>>>server (with prestaging on the server) I get the following message in
>>>the prestage log when asking for a file that doesn't exist at RAL
>>>
>>>Starting new cycle, pstg proc = 0
>>>21:17:41 [ 17543] getlock: locking file
>>>
>>>>>/opt/xrootd/stageQ/PreStageQ.0.lock, flags 2
>>>
>>>21:17:41 [ 17543] getlock: locking file
>>>+</opt/xrootd/stageQ/PreStageQ.0.old, flags 2
>>>21:17:41 [ 17543] unlock: unlocking file /opt/xrootd/stageQ/PreStageQ.0.old
>>>21:17:41 [ 17543] unlock: unlocking file /opt/xrootd/stageQ/PreStageQ.0.lock
>>>21:17:41 [ 17543] getlock: locking file
>>>
>>>>>/opt/xrootd/stageQ/PreStageQ.1.lock, flags 2
>>>
>>>21:17:41 [ 17543] unlock: unlocking file /opt/xrootd/stageQ/PreStageQ.1.lock
>>>21:21:29 [ 17772] mps_Stage: cannot create 'store' in
>>>'/store/PRskims/R14/16.1.1b/BToPPP/58/'; Permission denied
>>>21:21:29 [ 17772] mps_Stage: Invalid file system path,
>>>'/store/PRskims/R14/16.1.1b/BToPPP/58/'.
>>>21:21:29 [ 17772] do_stagein: xfr failed for
>>>/store/PRskims/R14/16.1.1b/BToPPP/58/BToPPP_5831.01.root, rc=4, retry=1
>>>
>>>Whilst my job just hangs. If I take the log file literally, it is trying
>>>to write to /store when it should be trying to write to
>>>/base_directory/store.
>>>
>>>Doing further tests I can reproduce the problem I reported earlier.
>>>Whilst still asking for the above file I turn off staging, restart the
>>>directors and servers and the request for the file continues to hang (is
>>>told to wait). Then I make another request for the same file and this
>>>request is also continually told to wait:
>>>
>>>050607 21:55:13 2915 odc_Locate: olaiya.8042:[log in to unmask] asked to
>>>wait 5 by xrootd107
>>>path=/store/PRskims/R14/16.1.1b/BToPPP/58/BToPPP_5831.01.root
>>>050607 21:55:14 2915 odc_Locate: olaiya.23507:[log in to unmask] asked to
>>>wait 5 by xrootd107
>>>path=/store/PRskims/R14/16.1.1b/BToPPP/58/BToPPP_5831.01.root
>>>050607 21:55:18 2915 odc_Locate: olaiya.8042:[log in to unmask] asked to
>>>wait 5 by xrootd107
>>>path=/store/PRskims/R14/16.1.1b/BToPPP/58/BToPPP_5831.01.root
>>>...
>>>
>>>
>>>It is only after I kill the first request that anymore requests for this
>>>file return correctly with a message indicating that the file cannot be
>>>found.
>>>
>>>cheers
>>>
>>>Manny
>>>
>>>Andrew Hanushevsky wrote:
>>>
>>>>Hi Tim,
>>>>
>>>>Bill Weeks should have the fix available. You can also find the fixed mps
>>>>scripts in /afs/slac/package/xrd/xrootd/utils (I think you just need an
>>>>update for mps_Stage and mps_prep).
>>>>
>>>>Otherwise, the earliest time I can get together with Many is Monday. How
>>>>about the afternoon, say 1:30pm?
>>>>
>>>>Andy
>>>>
>>>>On Tue, 7 Jun 2005, Adye, TJ (Tim) wrote:
>>>>
>>>>
>>>>
>>>>>Hi Guys,
>>>>>
>>>>>Did you manage to sort something out, despite the cancellation of the
>>>>>meeting? These are serious problems for us.
>>>>>
>>>>>Tim.
>>>>>
>>>>>
>>>>>
>>>>>>-----Original Message-----
>>>>>>From: [log in to unmask]
>>>>>>[mailto:[log in to unmask]] On Behalf Of
>>>>>>Emmanuel Olaiya
>>>>>>Sent: 06 June 2005 22:57
>>>>>>To: Andy Hanushevsky
>>>>>>Cc: Brew, CAJ (Chris); [log in to unmask]; Bill Weeks
>>>>>>Subject: Re: PreStage Problems
>>>>>>
>>>>>>Hi Andy
>>>>>>
>>>>>>Yes, it would be good if you could have a look at this with
>>>>>>me. We can
>>>>>>arrange a time in the xrootd meeting tomorrow.
>>>>>>
>>>>>>cheers
>>>>>>
>>>>>>Manny
>>>>>>
>>>>>>Andy Hanushevsky wrote:
>>>>>>
>>>>>>
>>>>>>>Hi Manny,
>>>>>>>
>>>>>>>I find this is quite mysterious as this should never be the
>>>>>>
>>>>>>case and,
>>>>>>
>>>>>>
>>>>>>>frankly, appears to violate causality. I suspect something
>>>>>>
>>>>>>else is going
>>>>>>
>>>>>>
>>>>>>>on. If this is reproducible then why don't we run a test with all
>>>>>>>debugging turned on. Yes?
>>>>>>>
>>>>>>>Andy
>>>>>>>
>>>>>>>----- Original Message ----- From: "Emmanuel Olaiya"
>>>>>>
>>>>>><[log in to unmask]>
>>>>>>
>>>>>>>To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>>>>Cc: "Brew, CAJ (Chris)" <[log in to unmask]>;
>>>>>>><[log in to unmask]>; "Bill Weeks" <[log in to unmask]>
>>>>>>>Sent: Monday, June 06, 2005 1:41 PM
>>>>>>>Subject: Re: PreStage Problems
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>Hi Andy
>>>>>>>>
>>>>>>>>I should have mentioned that we also remove the prestage queue and
>>>>>>>>restarted both the server and redirector. However the old
>>>>>>
>>>>>>request to
>>>>>>
>>>>>>
>>>>>>>>wait did not change. Moreover, any similar new requests
>>>>>>
>>>>>>were also told
>>>>>>
>>>>>>
>>>>>>>>to wait until the old request was terminated.
>>>>>>>>
>>>>>>>>cheers
>>>>>>>>
>>>>>>>>Manny
>>>>>>>>
>>>>>>>>Andrew Hanushevsky wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>Hi Manny,
>>>>>>>>>
>>>>>>>>>Yes, but who telling the client to wait? The redirector
>>>>>>
>>>>>>or the server
>>>>>>
>>>>>>
>>>>>>>>>that
>>>>>>>>>wanted to orginally stage the file in. When you restart the
>>>>>>>>>redirector it
>>>>>>>>>loses all it's memory but the data server does not. So,
>>>>>>
>>>>>>it will hapiily
>>>>>>
>>>>>>
>>>>>>>>>tell the redirector that it has the file eventhough the file is
>>>>>>>>>merely in
>>>>>>>>>the pre-stage queue. As long as the file is in the
>>>>>>
>>>>>>prestage queue and
>>>>>>
>>>>>>
>>>>>>>>>not on
>>>>>>>>>disk, the only option is to direct clients to where the
>>>>>>
>>>>>>file will be
>>>>>>
>>>>>>
>>>>>>>>>staged in and then the clients simply wait for the file
>>>>>>
>>>>>>(which in this
>>>>>>
>>>>>>
>>>>>>>>>case will never appear). So, if you remove staging you
>>>>>>
>>>>>>also need to
>>>>>>
>>>>>>
>>>>>>>>>remove
>>>>>>>>>the prestage queue and restart the data server.
>>>>>>>>>
>>>>>>>>>Andy
>>>>>>>>>
>>>>>>>>>On Fri, 3 Jun 2005, Emmanuel Olaiya wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>Hi Andy
>>>>>>>>>>
>>>>>>>>>>One other issue we have spotted at RAL. We removed the staging
>>>>>>>>>>capabilities and restarted the director and server.
>>>>>>
>>>>>>However we found
>>>>>>
>>>>>>
>>>>>>>>>>previous requests for a file that were told to wait
>>>>>>
>>>>>>continued being
>>>>>>
>>>>>>
>>>>>>>>>>told
>>>>>>>>>>to wait. We also found that if somebody else asked for
>>>>>>
>>>>>>this same file
>>>>>>
>>>>>>
>>>>>>>>>>that was not on disk they were also told to wait rather
>>>>>>
>>>>>>than being told
>>>>>>
>>>>>>
>>>>>>>>>>the file could not be found. We needed to kill the
>>>>>>
>>>>>>previous request and
>>>>>>
>>>>>>
>>>>>>>>>>restart the server and directory for xrootd to know the
>>>>>>
>>>>>>file was not on
>>>>>>
>>>>>>
>>>>>>>>>>disk.
>>>>>>>>>>
>>>>>>>>>>cheers
>>>>>>>>>>
>>>>>>>>>>Manny
>>>>>>>>>>
>>>>>>>>>>Andrew Hanushevsky wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>Hi Chris,
>>>>>>>>>>>
>>>>>>>>>>>Oh yeah, different problem. I think that Bill Weeks fixed that.
>>>>>>>>>>>Bill did
>>>>>>>>>>>you fix that problem?
>>>>>>>>>>>
>>>>>>>>>>>Andy
>>>>>>>>>>>
>>>>>>>>>>>On Mon, 30 May 2005, Brew, CAJ (Chris) wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>Hi,
>>>>>>>>>>>>
>>>>>>>>>>>>I might be being stupid but I don't see how this
>>>>>>
>>>>>>relates to the
>>>>>>
>>>>>>
>>>>>>>>>>>>problem.
>>>>>>>>>>>>The files I wanted were on a different disk server
>>>>>>
>>>>>>which then went
>>>>>>
>>>>>>
>>>>>>>>>>>>down.
>>>>>>>>>>>>The server in question was registered with the OLB as
>>>>>>
>>>>>>being able to
>>>>>>
>>>>>>
>>>>>>>>>>>>stage in the name space so the request was redirected to it. If
>>>>>>>>>>>>mps_Stage is used without the PreStage queuing system
>>>>>>
>>>>>>everything
>>>>>>
>>>>>>
>>>>>>>>>>>>works
>>>>>>>>>>>>as expected. If we try to go through the PreStage
>>>>>>
>>>>>>queue to limit the
>>>>>>
>>>>>>
>>>>>>>>>>>>number of concurrent accesses to the tapestore the
>>>>>>
>>>>>>stage in fails.
>>>>>>
>>>>>>
>>>>>>>>>>>>Apparently because the DIR_LOCK file does not exist (which it
>>>>>>>>>>>>doesn't,
>>>>>>>>>>>>since the file, and it's directory structure, has
>>>>>>
>>>>>>never existed on
>>>>>>
>>>>>>
>>>>>>>>>>>>this
>>>>>>>>>>>>server).
>>>>>>>>>>>>
>>>>>>>>>>>>Yours,
>>>>>>>>>>>>Chris.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>-----Original Message-----
>>>>>>>>>>>>>From: Andrew Hanushevsky [mailto:[log in to unmask]]
>>>>>>>>>>>>>Sent: 28 May 2005 07:39
>>>>>>>>>>>>>To: Brew, CAJ (Chris)
>>>>>>>>>>>>>Cc: [log in to unmask]; abh; Olaiya, EO (Emmanuel)
>>>>>>>>>>>>>Subject: RE: PreStage Problems
>>>>>>>>>>>>>
>>>>>>>>>>>>>Hi Chris,
>>>>>>>>>>>>>
>>>>>>>>>>>>>This was traced to overzealous testing. The syustem does not
>>>>>>>>>>>>>put in a new
>>>>>>>>>>>>>entry in the pre-stage queue until after about 10-20 minutes
>>>>>>>>>>>>>have elapsed
>>>>>>>>>>>>>since the last time the entry was added. So, this is not a
>>>>>>>>>>>>>bug but a test
>>>>>>>>>>>>>case that was not "real". Generally, files live in the disk
>>>>>>>>>>>>>cache for at
>>>>>>>>>>>>>least 10-20 minutes.
>>>>>>>>>>>>>
>>>>>>>>>>>>>Andy
>>>>>>>>>>>>>
>>>>>>>>>>>>>On Fri, 27 May 2005, Brew, CAJ (Chris) wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>At the meeting a couple of weeks ago, it was said
>>>>>>
>>>>>>that someone was
>>>>>>
>>>>>>
>>>>>>>>>>>>>>looking into this but I haven't heard anything back. Is
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>there any new?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>Thanks,
>>>>>>>>>>>>>>Chris.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>-----Original Message-----
>>>>>>>>>>>>>>>From: Brew, CAJ (Chris)
>>>>>>>>>>>>>>>Sent: 17 May 2005 13:50
>>>>>>>>>>>>>>>To: [log in to unmask]; abh
>>>>>>>>>>>>>>>Cc: Olaiya, EO (Emmanuel)
>>>>>>>>>>>>>>>Subject: PreStage Problems
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>I've been running some more tests of the staging at RAL and
>>>>>>>>>>>>>>>have run into a problem somewhere in the
>>>>>>>>>>>>>>>mps_Stage/PreStage/prep system.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>Everything work fine staging file that was on the system and
>>>>>>>>>>>>>>>has been deleted but if I try to stage in a file
>>>>>>
>>>>>>that was one
>>>>>>
>>>>>>
>>>>>>>>>>>>>>>a different server, hence the directory structure for the
>>>>>>>>>>>>>>>file does not exist on the staging server it fails and I see
>>>>>>>>>>>>>>>the following error in the PreStage log file:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>12:45:43 [ 10859] mps_Stage: Open
>>>>>>>>>>>>>>>
>>>>>>
>>>>>>'/stage/bdata-data50/kanga//store/SPskims/R12/16.0.2e/BtoKKKL/
>>>>>>
>>>>>>
>>>>>>>>>>>>>>>001005/200002/DIR_LOCK' r/w failed; No such file or
>>>>>>
>>>>>>directory.
>>>>>>
>>>>>>
>>>>>>>>>>>>>>>12:45:43 [ 10859] do_stagein: xfr failed for
>>>>>>>>>>>>>>>
>>>>>>
>>>>>>/store/SPskims/R12/16.0.2e/BtoKKKL/001005/200002/BtoKKKL_00100
>>>>>>
>>>>>>
>>>>>>>>>>>>>>>5_3247.01.root, rc=4, retry=1
>>>>>>>>>>>>>>>12:45:45 [  3255]
>>>>>>>>>>>>>>>
>>>>>>
>>>>>>file=/store/SPskims/R12/16.0.2e/BtoKKKL/001005/200002/BtoKKKL_
>>>>>>
>>>>>>
>>>>>>>>>>>>>>>0010053247.01.root, rc=1024, reqid=ef000001:1cd2.425d27e1
>>>>>>>>>>>>>>>:3762
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>If I create the directories and the DIR_LOCK file before
>>>>>>>>>>>>>>>running the import, everything works.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>The config file I'm using on the server is below.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>Is there some setting I'm missing which is needed to create
>>>>>>>>>>>>>>>the directories/DIR_LOCK file or does the code need fixing?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>Thanks,
>>>>>>>>>>>>>>>Chris
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>--
>>>>>>>>>>>>>>>Chris Brew  ([log in to unmask])  +44 1235 446326
>>>>>>>>>>>>>>>Particle Physics Department
>>>>>>>>>>>>>>>Rutherford Appleton Laboratory
>>>>>>>>>>>>>>>Chilton, Didcot. Oxfordshire.
>>>>>>>>>>>>>>>OX11 0QX. United Kingdom.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>