Print

Print


Hi Andy, Bill

I took the versions of mps_Stage and mps_prep from 
/afs/slac/package/xrd/xrootd/utils. These are mps_Stage and mps_prep 
versions 1.9 and 1.8 respectively.

I still see the problem Chris reported. Restarting the directors and the 
server (with prestaging on the server) I get the following message in 
the prestage log when asking for a file that doesn't exist at RAL

Starting new cycle, pstg proc = 0
21:17:41 [ 17543] getlock: locking file 
 >>/opt/xrootd/stageQ/PreStageQ.0.lock, flags 2
21:17:41 [ 17543] getlock: locking file 
+</opt/xrootd/stageQ/PreStageQ.0.old, flags 2
21:17:41 [ 17543] unlock: unlocking file /opt/xrootd/stageQ/PreStageQ.0.old
21:17:41 [ 17543] unlock: unlocking file /opt/xrootd/stageQ/PreStageQ.0.lock
21:17:41 [ 17543] getlock: locking file 
 >>/opt/xrootd/stageQ/PreStageQ.1.lock, flags 2
21:17:41 [ 17543] unlock: unlocking file /opt/xrootd/stageQ/PreStageQ.1.lock
21:21:29 [ 17772] mps_Stage: cannot create 'store' in 
'/store/PRskims/R14/16.1.1b/BToPPP/58/'; Permission denied
21:21:29 [ 17772] mps_Stage: Invalid file system path, 
'/store/PRskims/R14/16.1.1b/BToPPP/58/'.
21:21:29 [ 17772] do_stagein: xfr failed for 
/store/PRskims/R14/16.1.1b/BToPPP/58/BToPPP_5831.01.root, rc=4, retry=1

Whilst my job just hangs. If I take the log file literally, it is trying 
to write to /store when it should be trying to write to 
/base_directory/store.

Doing further tests I can reproduce the problem I reported earlier. 
Whilst still asking for the above file I turn off staging, restart the 
directors and servers and the request for the file continues to hang (is 
told to wait). Then I make another request for the same file and this 
request is also continually told to wait:

050607 21:55:13 2915 odc_Locate: olaiya.8042:[log in to unmask] asked to 
wait 5 by xrootd107 
path=/store/PRskims/R14/16.1.1b/BToPPP/58/BToPPP_5831.01.root
050607 21:55:14 2915 odc_Locate: olaiya.23507:[log in to unmask] asked to 
wait 5 by xrootd107 
path=/store/PRskims/R14/16.1.1b/BToPPP/58/BToPPP_5831.01.root
050607 21:55:18 2915 odc_Locate: olaiya.8042:[log in to unmask] asked to 
wait 5 by xrootd107 
path=/store/PRskims/R14/16.1.1b/BToPPP/58/BToPPP_5831.01.root
...


It is only after I kill the first request that anymore requests for this 
file return correctly with a message indicating that the file cannot be 
found.

cheers

Manny

Andrew Hanushevsky wrote:
> Hi Tim,
> 
> Bill Weeks should have the fix available. You can also find the fixed mps
> scripts in /afs/slac/package/xrd/xrootd/utils (I think you just need an
> update for mps_Stage and mps_prep).
> 
> Otherwise, the earliest time I can get together with Many is Monday. How
> about the afternoon, say 1:30pm?
> 
> Andy
> 
> On Tue, 7 Jun 2005, Adye, TJ (Tim) wrote:
> 
> 
>>Hi Guys,
>>
>>Did you manage to sort something out, despite the cancellation of the
>>meeting? These are serious problems for us.
>>
>>Tim.
>>
>>
>>>-----Original Message-----
>>>From: [log in to unmask]
>>>[mailto:[log in to unmask]] On Behalf Of
>>>Emmanuel Olaiya
>>>Sent: 06 June 2005 22:57
>>>To: Andy Hanushevsky
>>>Cc: Brew, CAJ (Chris); [log in to unmask]; Bill Weeks
>>>Subject: Re: PreStage Problems
>>>
>>>Hi Andy
>>>
>>>Yes, it would be good if you could have a look at this with
>>>me. We can
>>>arrange a time in the xrootd meeting tomorrow.
>>>
>>>cheers
>>>
>>>Manny
>>>
>>>Andy Hanushevsky wrote:
>>>
>>>>Hi Manny,
>>>>
>>>>I find this is quite mysterious as this should never be the
>>>
>>>case and,
>>>
>>>>frankly, appears to violate causality. I suspect something
>>>
>>>else is going
>>>
>>>>on. If this is reproducible then why don't we run a test with all
>>>>debugging turned on. Yes?
>>>>
>>>>Andy
>>>>
>>>>----- Original Message ----- From: "Emmanuel Olaiya"
>>>
>>><[log in to unmask]>
>>>
>>>>To: "Andrew Hanushevsky" <[log in to unmask]>
>>>>Cc: "Brew, CAJ (Chris)" <[log in to unmask]>;
>>>><[log in to unmask]>; "Bill Weeks" <[log in to unmask]>
>>>>Sent: Monday, June 06, 2005 1:41 PM
>>>>Subject: Re: PreStage Problems
>>>>
>>>>
>>>>
>>>>>Hi Andy
>>>>>
>>>>>I should have mentioned that we also remove the prestage queue and
>>>>>restarted both the server and redirector. However the old
>>>
>>>request to
>>>
>>>>>wait did not change. Moreover, any similar new requests
>>>
>>>were also told
>>>
>>>>>to wait until the old request was terminated.
>>>>>
>>>>>cheers
>>>>>
>>>>>Manny
>>>>>
>>>>>Andrew Hanushevsky wrote:
>>>>>
>>>>>
>>>>>>Hi Manny,
>>>>>>
>>>>>>Yes, but who telling the client to wait? The redirector
>>>
>>>or the server
>>>
>>>>>>that
>>>>>>wanted to orginally stage the file in. When you restart the
>>>>>>redirector it
>>>>>>loses all it's memory but the data server does not. So,
>>>
>>>it will hapiily
>>>
>>>>>>tell the redirector that it has the file eventhough the file is
>>>>>>merely in
>>>>>>the pre-stage queue. As long as the file is in the
>>>
>>>prestage queue and
>>>
>>>>>>not on
>>>>>>disk, the only option is to direct clients to where the
>>>
>>>file will be
>>>
>>>>>>staged in and then the clients simply wait for the file
>>>
>>>(which in this
>>>
>>>>>>case will never appear). So, if you remove staging you
>>>
>>>also need to
>>>
>>>>>>remove
>>>>>>the prestage queue and restart the data server.
>>>>>>
>>>>>>Andy
>>>>>>
>>>>>>On Fri, 3 Jun 2005, Emmanuel Olaiya wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>Hi Andy
>>>>>>>
>>>>>>>One other issue we have spotted at RAL. We removed the staging
>>>>>>>capabilities and restarted the director and server.
>>>
>>>However we found
>>>
>>>>>>>previous requests for a file that were told to wait
>>>
>>>continued being
>>>
>>>>>>>told
>>>>>>>to wait. We also found that if somebody else asked for
>>>
>>>this same file
>>>
>>>>>>>that was not on disk they were also told to wait rather
>>>
>>>than being told
>>>
>>>>>>>the file could not be found. We needed to kill the
>>>
>>>previous request and
>>>
>>>>>>>restart the server and directory for xrootd to know the
>>>
>>>file was not on
>>>
>>>>>>>disk.
>>>>>>>
>>>>>>>cheers
>>>>>>>
>>>>>>>Manny
>>>>>>>
>>>>>>>Andrew Hanushevsky wrote:
>>>>>>>
>>>>>>>
>>>>>>>>Hi Chris,
>>>>>>>>
>>>>>>>>Oh yeah, different problem. I think that Bill Weeks fixed that.
>>>>>>>>Bill did
>>>>>>>>you fix that problem?
>>>>>>>>
>>>>>>>>Andy
>>>>>>>>
>>>>>>>>On Mon, 30 May 2005, Brew, CAJ (Chris) wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>Hi,
>>>>>>>>>
>>>>>>>>>I might be being stupid but I don't see how this
>>>
>>>relates to the
>>>
>>>>>>>>>problem.
>>>>>>>>>The files I wanted were on a different disk server
>>>
>>>which then went
>>>
>>>>>>>>>down.
>>>>>>>>>The server in question was registered with the OLB as
>>>
>>>being able to
>>>
>>>>>>>>>stage in the name space so the request was redirected to it. If
>>>>>>>>>mps_Stage is used without the PreStage queuing system
>>>
>>>everything
>>>
>>>>>>>>>works
>>>>>>>>>as expected. If we try to go through the PreStage
>>>
>>>queue to limit the
>>>
>>>>>>>>>number of concurrent accesses to the tapestore the
>>>
>>>stage in fails.
>>>
>>>>>>>>>Apparently because the DIR_LOCK file does not exist (which it
>>>>>>>>>doesn't,
>>>>>>>>>since the file, and it's directory structure, has
>>>
>>>never existed on
>>>
>>>>>>>>>this
>>>>>>>>>server).
>>>>>>>>>
>>>>>>>>>Yours,
>>>>>>>>>Chris.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>-----Original Message-----
>>>>>>>>>>From: Andrew Hanushevsky [mailto:[log in to unmask]]
>>>>>>>>>>Sent: 28 May 2005 07:39
>>>>>>>>>>To: Brew, CAJ (Chris)
>>>>>>>>>>Cc: [log in to unmask]; abh; Olaiya, EO (Emmanuel)
>>>>>>>>>>Subject: RE: PreStage Problems
>>>>>>>>>>
>>>>>>>>>>Hi Chris,
>>>>>>>>>>
>>>>>>>>>>This was traced to overzealous testing. The syustem does not
>>>>>>>>>>put in a new
>>>>>>>>>>entry in the pre-stage queue until after about 10-20 minutes
>>>>>>>>>>have elapsed
>>>>>>>>>>since the last time the entry was added. So, this is not a
>>>>>>>>>>bug but a test
>>>>>>>>>>case that was not "real". Generally, files live in the disk
>>>>>>>>>>cache for at
>>>>>>>>>>least 10-20 minutes.
>>>>>>>>>>
>>>>>>>>>>Andy
>>>>>>>>>>
>>>>>>>>>>On Fri, 27 May 2005, Brew, CAJ (Chris) wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>Hi,
>>>>>>>>>>>
>>>>>>>>>>>At the meeting a couple of weeks ago, it was said
>>>
>>>that someone was
>>>
>>>>>>>>>>>looking into this but I haven't heard anything back. Is
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>there any new?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>Thanks,
>>>>>>>>>>>Chris.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>-----Original Message-----
>>>>>>>>>>>>From: Brew, CAJ (Chris)
>>>>>>>>>>>>Sent: 17 May 2005 13:50
>>>>>>>>>>>>To: [log in to unmask]; abh
>>>>>>>>>>>>Cc: Olaiya, EO (Emmanuel)
>>>>>>>>>>>>Subject: PreStage Problems
>>>>>>>>>>>>
>>>>>>>>>>>>Hi,
>>>>>>>>>>>>
>>>>>>>>>>>>I've been running some more tests of the staging at RAL and
>>>>>>>>>>>>have run into a problem somewhere in the
>>>>>>>>>>>>mps_Stage/PreStage/prep system.
>>>>>>>>>>>>
>>>>>>>>>>>>Everything work fine staging file that was on the system and
>>>>>>>>>>>>has been deleted but if I try to stage in a file
>>>
>>>that was one
>>>
>>>>>>>>>>>>a different server, hence the directory structure for the
>>>>>>>>>>>>file does not exist on the staging server it fails and I see
>>>>>>>>>>>>the following error in the PreStage log file:
>>>>>>>>>>>>
>>>>>>>>>>>>12:45:43 [ 10859] mps_Stage: Open
>>>>>>>>>>>>
>>>
>>>'/stage/bdata-data50/kanga//store/SPskims/R12/16.0.2e/BtoKKKL/
>>>
>>>>>>>>>>>>001005/200002/DIR_LOCK' r/w failed; No such file or
>>>
>>>directory.
>>>
>>>>>>>>>>>>12:45:43 [ 10859] do_stagein: xfr failed for
>>>>>>>>>>>>
>>>
>>>/store/SPskims/R12/16.0.2e/BtoKKKL/001005/200002/BtoKKKL_00100
>>>
>>>>>>>>>>>>5_3247.01.root, rc=4, retry=1
>>>>>>>>>>>>12:45:45 [  3255]
>>>>>>>>>>>>
>>>
>>>file=/store/SPskims/R12/16.0.2e/BtoKKKL/001005/200002/BtoKKKL_
>>>
>>>>>>>>>>>>0010053247.01.root, rc=1024, reqid=ef000001:1cd2.425d27e1
>>>>>>>>>>>>:3762
>>>>>>>>>>>>
>>>>>>>>>>>>If I create the directories and the DIR_LOCK file before
>>>>>>>>>>>>running the import, everything works.
>>>>>>>>>>>>
>>>>>>>>>>>>The config file I'm using on the server is below.
>>>>>>>>>>>>
>>>>>>>>>>>>Is there some setting I'm missing which is needed to create
>>>>>>>>>>>>the directories/DIR_LOCK file or does the code need fixing?
>>>>>>>>>>>>
>>>>>>>>>>>>Thanks,
>>>>>>>>>>>>Chris
>>>>>>>>>>>>
>>>>>>>>>>>>--
>>>>>>>>>>>>Chris Brew  ([log in to unmask])  +44 1235 446326
>>>>>>>>>>>>Particle Physics Department
>>>>>>>>>>>>Rutherford Appleton Laboratory
>>>>>>>>>>>>Chilton, Didcot. Oxfordshire.
>>>>>>>>>>>>OX11 0QX. United Kingdom.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>