LISTSERV 16.5 - ATLAS-SCCS-PLANNING-L Archives

Subscriber's Corner

Email Lists

ATLAS-SCCS-PLANNING-L Archives

ATLAS-SCCS-PLANNING-L@LISTSERV.SLAC.STANFORD.EDU

View:

Message:

[

First

Last

]

By Topic:

[

First

Last

]

By Author:

[

First

Last

]

Font:

Proportional Font

		LISTSERV Archives
		ATLAS-SCCS-PLANNING-L Home
		ATLAS-SCCS-PLANNING-L April 2007

Subject:

RE: Atlas AFS volume layout

From:

"Young, Charles C." <[log in to unmask]>

Date:

27 Apr 2007 01:49:20 -0700Fri, 27 Apr 2007 01:49:20 -0700

Content-Type:

text/plain

Parts/Attachments:

text/plain (138 lines)

Hi Renata, 

Thanks! I have a stupid question of what is AFS and what is NFS in this context. Our work space is

	/afs/slac/g/atlas/work/<something>

Where <something> can take you to one of several NFS servers, e.g. 

lrwxr-xr-x    1 YANGW    root           27 Oct 18  2006 a -> /nfs/sulky51/atlaswork.u1/a/
lrwlrwxr-xr-x    1 gowdy    root           25 Feb 13  2006 n -> /nfs/surrey14/atlas
work/n/xr-xr-x    1 gowdy    root           25 Feb 13  2006 c -> /nfs/surrey13/atlaswork/c/

Is it a worry if my batch job writes directly to one of these areas (rather than to local disk and copy at end of job)? Cheers.

					Charlie
--
Charles C. Young
M.S. 43, Stanford Linear Accelerator Center       
P.O. Box 20450                                         
Stanford, CA 94309                                      
[log in to unmask]                                
voice  (650) 926 2669                         
fax    (650) 926 2923                       
CERN GSM +41 76 487 2069 

> -----Original Message-----
> From: Renata Maria Dart [mailto:[log in to unmask]] 
> Sent: Friday, April 27, 2007 12:25 AM
> To: Young, Charles C.
> Cc: Yang, Wei; Moss, Leonard J.; atlas-sccs-planning-l; 
> Zachary Marshall; David W. Miller
> Subject: RE: Atlas AFS volume layout
> 
> On Thu, 26 Apr 2007, Young, Charles C. wrote:
> 
> >Hi Wei,
> >
> >It could be helpful to get the BaBar input beforehand so we 
> could all think about it, in order to make the meeting more 
> productive. Cheers.
> >
> >					Charlie
> >--
> 
> Hi Charlie, here is a not-so-brief summary of our recent AFS 
> experiences with BaBar:
> 
> 
> First some AFS background.  An AFS fileserver keeps track of 
> client requests with callbacks.  A callback is a promise by 
> the fileserver to the tell the client when a change is made 
> to any of the data being delivered.  This can have an impact 
> on server performance in the following ways:
> 
> 
> 1.  The performance of an AFS server can become seriously 
> impaired when many clients are all accessing the same 
> readwrite file/directory and that file/directory is being 
> updated frequently.  Everytime an update is made, the 
> fileserver needs to notify each client.
> So, a large number of clients can be a problem even if the 
> number of updates is relatively small.
> 
> 2. The problem outlined above can be further exacerbated if a 
> large number of requests for status are made on the 
> file/directory as soon as the callbacks are broken.  A broken 
> callback will tell the client to refetch information, so the 
> larger the number of machines, the larger the number of 
> status requests that will occur as a result of the broken 
> callback.  And then any additional status requests that may 
> be going on will cause further grief.
> 
> The way to avoid callback problems is to avoid writing to the 
> same file/directory in AFS from many clients.  The 
> recommended procedure in batch is to write locally and copy 
> once to AFS at the end of the job.
> 
> 
> The problems that we saw with BaBar:
> 
> First I should say that the problems we saw with BaBar came 
> after they started increasing the number of jobs being run as 
> part of their skimming.  Before that, the problems were still 
> there, but at a low enough level that they didn't have the 
> same impact.
> 
> 1.  There was a problem with our TRS utility that was causing 
> multiple updates to a file in one of their AFS directories.  
> This was causing the problem described above.  We have since 
> changed the TRS utility to avoid making that update.
> 
> 2.  The BaBar folks were launching 1000s of batch jobs at 
> once which were accessing the file(s) on one server in such a 
> way that it caused a plunge in availability.  They have since 
> changed the way they run by keeping the level of batch jobs 
> up so that 1000s don't hit all at the same time, but are 
> spread out.  We are still trying to figure out what the jobs 
> are doing at startup that cause the problem (writing to 
> AFS?), but the bypass has been working.  I have our AFS 
> support people looking into it.
> 
> 3.  The BaBar folks also fixed a problem in their code that 
> was launching 10s of 1000s of 1 minute batch jobs.  This was 
> putting a heavy load on the batch system because it had to 
> spend much/all of its time scheduling, in addition to the 
> impact on AFS.
> 
> 4.  The BaBar code does huge numbers of accesses to files 
> under /afs/slac/g/babar.  They suspect that their tcl files 
> are part of the problem and they are going to move those 
> files to readonly volumes.
> This will spread the load across multiple machines.  
> Unfortunately the BaBar group space has grown over time so 
> that setting it up to be readonly now is a daunting task.  At 
> the moment they have a parallel readonly volume that they 
> will be using for the tcl space.  A little AFS background on 
> readonly volumes....the readonly path through AFS requires 
> that all volumes (mountpoints) along the way be readonly.
> So, in the case of the atlas volume 
> /afs/slac/g/atlas/AtlasSimulation for example, 
> /afs/slac/g/atlas would have to be set up with readonlies in 
> order for AtlasSimulation to be set up with readonlies.  So 
> if you think some of your code would benefit from having the 
> load spread across multiple fileservers in readonly volumes, 
> it would be best to set up time to switch /afs/slac/g/atlas 
> to be readonly now, before things get anymore complicated.
> 
> -Renata
> 
> 
> 
> 
> 
>

Top of Message | Previous Page | Permalink

Search Archives

Advanced Options

Options

		Log In
		Get Password

		Search Archives

		Subscribe or Unsubscribe