On Thu, 26 Apr 2007, Young, Charles C. wrote: >Hi Wei, > >It could be helpful to get the BaBar input beforehand so we could all think about it, in order to make the meeting more productive. Cheers. > > Charlie >-- Hi Charlie, here is a not-so-brief summary of our recent AFS experiences with BaBar: First some AFS background. An AFS fileserver keeps track of client requests with callbacks. A callback is a promise by the fileserver to the tell the client when a change is made to any of the data being delivered. This can have an impact on server performance in the following ways: 1. The performance of an AFS server can become seriously impaired when many clients are all accessing the same readwrite file/directory and that file/directory is being updated frequently. Everytime an update is made, the fileserver needs to notify each client. So, a large number of clients can be a problem even if the number of updates is relatively small. 2. The problem outlined above can be further exacerbated if a large number of requests for status are made on the file/directory as soon as the callbacks are broken. A broken callback will tell the client to refetch information, so the larger the number of machines, the larger the number of status requests that will occur as a result of the broken callback. And then any additional status requests that may be going on will cause further grief. The way to avoid callback problems is to avoid writing to the same file/directory in AFS from many clients. The recommended procedure in batch is to write locally and copy once to AFS at the end of the job. The problems that we saw with BaBar: First I should say that the problems we saw with BaBar came after they started increasing the number of jobs being run as part of their skimming. Before that, the problems were still there, but at a low enough level that they didn't have the same impact. 1. There was a problem with our TRS utility that was causing multiple updates to a file in one of their AFS directories. This was causing the problem described above. We have since changed the TRS utility to avoid making that update. 2. The BaBar folks were launching 1000s of batch jobs at once which were accessing the file(s) on one server in such a way that it caused a plunge in availability. They have since changed the way they run by keeping the level of batch jobs up so that 1000s don't hit all at the same time, but are spread out. We are still trying to figure out what the jobs are doing at startup that cause the problem (writing to AFS?), but the bypass has been working. I have our AFS support people looking into it. 3. The BaBar folks also fixed a problem in their code that was launching 10s of 1000s of 1 minute batch jobs. This was putting a heavy load on the batch system because it had to spend much/all of its time scheduling, in addition to the impact on AFS. 4. The BaBar code does huge numbers of accesses to files under /afs/slac/g/babar. They suspect that their tcl files are part of the problem and they are going to move those files to readonly volumes. This will spread the load across multiple machines. Unfortunately the BaBar group space has grown over time so that setting it up to be readonly now is a daunting task. At the moment they have a parallel readonly volume that they will be using for the tcl space. A little AFS background on readonly volumes....the readonly path through AFS requires that all volumes (mountpoints) along the way be readonly. So, in the case of the atlas volume /afs/slac/g/atlas/AtlasSimulation for example, /afs/slac/g/atlas would have to be set up with readonlies in order for AtlasSimulation to be set up with readonlies. So if you think some of your code would benefit from having the load spread across multiple fileservers in readonly volumes, it would be best to set up time to switch /afs/slac/g/atlas to be readonly now, before things get anymore complicated. -Renata