Hi Renata,
Thanks! I have a stupid question of what is AFS and what is NFS in this context. Our work space is
/afs/slac/g/atlas/work/<something>
Where <something> can take you to one of several NFS servers, e.g.
lrwxr-xr-x 1 YANGW root 27 Oct 18 2006 a -> /nfs/sulky51/atlaswork.u1/a/
lrwlrwxr-xr-x 1 gowdy root 25 Feb 13 2006 n -> /nfs/surrey14/atlas
work/n/xr-xr-x 1 gowdy root 25 Feb 13 2006 c -> /nfs/surrey13/atlaswork/c/
Is it a worry if my batch job writes directly to one of these areas (rather than to local disk and copy at end of job)? Cheers.
Charlie
--
Charles C. Young
M.S. 43, Stanford Linear Accelerator Center
P.O. Box 20450
Stanford, CA 94309
[log in to unmask]
voice (650) 926 2669
fax (650) 926 2923
CERN GSM +41 76 487 2069
> -----Original Message-----
> From: Renata Maria Dart [mailto:[log in to unmask]]
> Sent: Friday, April 27, 2007 12:25 AM
> To: Young, Charles C.
> Cc: Yang, Wei; Moss, Leonard J.; atlas-sccs-planning-l;
> Zachary Marshall; David W. Miller
> Subject: RE: Atlas AFS volume layout
>
> On Thu, 26 Apr 2007, Young, Charles C. wrote:
>
> >Hi Wei,
> >
> >It could be helpful to get the BaBar input beforehand so we
> could all think about it, in order to make the meeting more
> productive. Cheers.
> >
> > Charlie
> >--
>
> Hi Charlie, here is a not-so-brief summary of our recent AFS
> experiences with BaBar:
>
>
> First some AFS background. An AFS fileserver keeps track of
> client requests with callbacks. A callback is a promise by
> the fileserver to the tell the client when a change is made
> to any of the data being delivered. This can have an impact
> on server performance in the following ways:
>
>
> 1. The performance of an AFS server can become seriously
> impaired when many clients are all accessing the same
> readwrite file/directory and that file/directory is being
> updated frequently. Everytime an update is made, the
> fileserver needs to notify each client.
> So, a large number of clients can be a problem even if the
> number of updates is relatively small.
>
> 2. The problem outlined above can be further exacerbated if a
> large number of requests for status are made on the
> file/directory as soon as the callbacks are broken. A broken
> callback will tell the client to refetch information, so the
> larger the number of machines, the larger the number of
> status requests that will occur as a result of the broken
> callback. And then any additional status requests that may
> be going on will cause further grief.
>
> The way to avoid callback problems is to avoid writing to the
> same file/directory in AFS from many clients. The
> recommended procedure in batch is to write locally and copy
> once to AFS at the end of the job.
>
>
> The problems that we saw with BaBar:
>
> First I should say that the problems we saw with BaBar came
> after they started increasing the number of jobs being run as
> part of their skimming. Before that, the problems were still
> there, but at a low enough level that they didn't have the
> same impact.
>
> 1. There was a problem with our TRS utility that was causing
> multiple updates to a file in one of their AFS directories.
> This was causing the problem described above. We have since
> changed the TRS utility to avoid making that update.
>
> 2. The BaBar folks were launching 1000s of batch jobs at
> once which were accessing the file(s) on one server in such a
> way that it caused a plunge in availability. They have since
> changed the way they run by keeping the level of batch jobs
> up so that 1000s don't hit all at the same time, but are
> spread out. We are still trying to figure out what the jobs
> are doing at startup that cause the problem (writing to
> AFS?), but the bypass has been working. I have our AFS
> support people looking into it.
>
> 3. The BaBar folks also fixed a problem in their code that
> was launching 10s of 1000s of 1 minute batch jobs. This was
> putting a heavy load on the batch system because it had to
> spend much/all of its time scheduling, in addition to the
> impact on AFS.
>
> 4. The BaBar code does huge numbers of accesses to files
> under /afs/slac/g/babar. They suspect that their tcl files
> are part of the problem and they are going to move those
> files to readonly volumes.
> This will spread the load across multiple machines.
> Unfortunately the BaBar group space has grown over time so
> that setting it up to be readonly now is a daunting task. At
> the moment they have a parallel readonly volume that they
> will be using for the tcl space. A little AFS background on
> readonly volumes....the readonly path through AFS requires
> that all volumes (mountpoints) along the way be readonly.
> So, in the case of the atlas volume
> /afs/slac/g/atlas/AtlasSimulation for example,
> /afs/slac/g/atlas would have to be set up with readonlies in
> order for AtlasSimulation to be set up with readonlies. So
> if you think some of your code would benefit from having the
> load spread across multiple fileservers in readonly volumes,
> it would be best to set up time to switch /afs/slac/g/atlas
> to be readonly now, before things get anymore complicated.
>
> -Renata
>
>
>
>
>
>
|