Print

Print


On Thu, 26 Apr 2007, Young, Charles C. wrote:

>Hi Wei,
>
>It could be helpful to get the BaBar input beforehand so we could all think about it, in order to make the meeting more productive. Cheers.
>
>					Charlie
>--

Hi Charlie, here is a not-so-brief summary of our recent
AFS experiences with BaBar:


First some AFS background.  An AFS fileserver keeps track of client
requests with callbacks.  A callback is a promise by the fileserver to
the tell the client when a change is made to any of the data being
delivered.  This can have an impact on server performance in the
following ways:


1.  The performance of an AFS server can become seriously impaired
when many clients are all accessing the same readwrite file/directory
and that file/directory is being updated frequently.  Everytime
an update is made, the fileserver needs to notify each client.
So, a large number of clients can be a problem even if the
number of updates is relatively small.

2. The problem outlined above can be further exacerbated if a large
number of requests for status are made on the file/directory as soon
as the callbacks are broken.  A broken callback will tell the client
to refetch information, so the larger the number of machines, the
larger the number of status requests that will occur as a result of
the broken callback.  And then any additional status requests that may
be going on will cause further grief.

The way to avoid callback problems is to avoid writing to the same
file/directory in AFS from many clients.  The recommended procedure in
batch is to write locally and copy once to AFS at the end of the job.


The problems that we saw with BaBar:

First I should say that the problems we saw with BaBar came after
they started increasing the number of jobs being run as part
of their skimming.  Before that, the problems were still there,
but at a low enough level that they didn't have the same impact.

1.  There was a problem with our TRS utility that was causing multiple
updates to a file in one of their AFS directories.  This was causing
the problem described above.  We have since changed the TRS utility to
avoid making that update.

2.  The BaBar folks were launching 1000s of batch jobs at once which
were accessing the file(s) on one server in such a way that it caused
a plunge in availability.  They have since changed the way they run by
keeping the level of batch jobs up so that 1000s don't hit all at the
same time, but are spread out.  We are still trying to figure out what
the jobs are doing at startup that cause the problem (writing to
AFS?), but the bypass has been working.  I have our AFS support
people looking into it.

3.  The BaBar folks also fixed a problem in their code that was
launching 10s of 1000s of 1 minute batch jobs.  This was putting a
heavy load on the batch system because it had to spend much/all of its
time scheduling, in addition to the impact on AFS.

4.  The BaBar code does huge numbers of accesses to files under
/afs/slac/g/babar.  They suspect that their tcl files are part of the
problem and they are going to move those files to readonly volumes.
This will spread the load across multiple machines.  Unfortunately the
BaBar group space has grown over time so that setting it up to be
readonly now is a daunting task.  At the moment they have a parallel
readonly volume that they will be using for the tcl space.  A little
AFS background on readonly volumes....the readonly path through AFS
requires that all volumes (mountpoints) along the way be readonly.
So, in the case of the atlas volume /afs/slac/g/atlas/AtlasSimulation
for example, /afs/slac/g/atlas would have to be set up with readonlies
in order for AtlasSimulation to be set up with readonlies.  So if you
think some of your code would benefit from having the load spread
across multiple fileservers in readonly volumes, it would be best to
set up time to switch /afs/slac/g/atlas to be readonly now, before
things get anymore complicated.

-Renata