Hi JY, et. al. Clearly we have some issues here, and I want to address as many (if not all) of them as I can in one writing. So, this post will be longer than usual. Before I start into the details, let me go into some of the philosophical questions here on the xrootd design point and how that interacts with some of the r/w issues brought forth. The design point is to provide high performance fault-tolerant data access for primarily read-only data. This design point was chosen because this is 80% of the problem in HEP applications. That solution requires the design of a highly distributed loosely coupled server cluster technology where clients can be quickly dispersed across the disk server nodes. We believe that our recent tests have shown we have accomplished this goal. That leaves 20% of the problem (i.e., r/w access) which is probably the hardest to solve to everyone's satisfaction. Of course, that's a canonical ratio and is extremely liberal. In HEP, there is usually an order of magnitude difference between r/o and r/w data. Even that is misleading since most of that data is processed in write-once read-many mode. So, the real problem is effectively much smaller, though not necessarily less vexing. The difficulties in r/w access are not in the areas of high performance (in fact, xrootd does quite well writing files as well) or even fault-tolerance, the problem is in file management in such an architecture. The scalability in xrootd comes from the fact that there is no over-all concentration of knowledge or power in any particular part of the system. Before one starts pointing at the top-level olbd (initial point of contact) one has to remember that even that olbd is only aware of up to 64 of its immediate neighbors. This allows nodes to come and go, be added/removed easily, and scale linearly as the system size increases. However, this makes r/w file management, without severely impacting the scalability and performance, relatively complex. The complexity drove the architecture to allow for the implementation of an SRM to handle disk cache management. xrootd was never meant to eliminate the need for a disk cache manager especially for sites that have significant r/w requirements. However, to provide for scalability and high performance, the architecture necessitates that the disk cache be made available for direct access. This does put certain constraints on the SRM; which may or may not be possible for some implementations. The central issue here are the semantics of file access in the presence of multiple copies of a file, either in the disk server nodes or underlying support systems (e.g., MSS). This problem has been tackled, in varying degrees of success, in many systems. I can't think of one where it has been completely solved without sacrifices in performance, scalability, fault tolerance, or file semantics. Systems that have come closest to solving this problem rely on complete knowledge of where every file resides in the system (i.e., a comprehensive catalog). This has shown to be generally not a scalable solution because the management overhead is significant. Invariably, systems reduce the overhead by increasing the granularity of what is meant by "replica". For instance, AFS groups files into volumes and file location is determined on a volume basis. This significantly reduces the size of the location database. Furthermore, AFS only allows r/o files to reside on multiple servers, further simplifying recovery after a failure. The problem becomes even more severe in the presence of multiple catalogs. The classic case being the addition of an MSS. Solutions vary from trying to synchronize multiple catalogs or, effectively, providing a logical single catalog. Neither approach eases the management overhead. The overhead is simply pushed into some other part of the system. In short, there is no "free lunch" in providing a consistent view of a file in the presence of multiple copies with unconstrained writes. This leads to another type of solution -- constrain the writes to get a better handle on the problem. Typically, this means adopting a publish type of model. Here a user can write files using the appropriate system to do so and then publishes the file for read access. Once a file is published it can only be deleted, never replaced with a different but identically named file. This corresponds quite well to the way most scientific data is handled but puts a large, for some unacceptable, constraint on users who simply want a r/w file system. The key here is to remember that such a system is not a general file system replacement but an experimental data access system; and probably more similar to what the xrootd/olbd combination provides. Now, on to the specifics. In the particular case the JY saw, the servers were arranged to provide for a r/o area and a r/w area in the same logical space. Nothing prevents files from appearing in the r/o area. When they need to be modified, they need to appear in the r/w area. Of course, nothing really prevents the file from appearing in multiple places in the r/w area. The reason is that a server that hosted a r/w file may be out of service and the file was assigned to another server. So, while mixing r/o and r/w space in one cluster exacerbates the problem, the problem exists even if the whole space was r/w. So, central issue is a) what to do about alternate copies files, even in inaccessible places, and b) (the harder part) when to do if one or more those files are being used. One suggestion is to check in the MSS (i.e., amortize the cost of catalog synchronization at open time). This can be easily done but will severely limit the open rate and scale badly as either more clients use the system or more servers are added. For instance, we see an open rate of about 15/sec on a server. Apparently not very high until one considers that we are actually talking about 6 servers yielding an open rate of 90/second. Add more, and things simply get worse. I don't know of any MSS that can sustain a very high query rate against its catalog. I certainly know that HPSS would have a problem and that system is probably one of the best out there. The apparent solution is to simply check only when the file be opened in r/w mode. A similar option exists in the oss layer (i.e., oss.check) which was put in to specifically prohibit the creation of duplicate files (it only checks upon file create but can be extended to the generic r/w case). This reduces the overhead to a tolerable level but even at that level BaBar chose not to use it. Not because of the overhead but because the system became unavailable when the MSS went down; creating a massive single point of failure. Unfortunately, given all the mechanics of an MSS, it did go down often enough to make life intolerable. Even if one accepts this situation, it does not solve the problem of temporarily inaccessible files. So, we're back to having to either synchronize the server's cache after a bounce (intolerably slow) or checking at every open; likely overwhelming the MSS and bring the whole system to a standstill. The alternate solution that Pete proposes is to allow the MSS to inject a message into the system indicating that a file has been changed and this forces the removal of all copies. This is easy in some systems and quite difficult in others, and still leaves open the question of what to do about currently in-use and inaccessible copies. So, while this is a significant improvement over constantly checking, it's not a complete solution. It does, however, address the "backdoor" problem; which is apparently what happened in JY's case. Of course, some people would point out that allowing back doors is simply asking for trouble and should not be allowed; case closed. Given the frequency at which this happens, I'm inclined to agree. That doesn't mean that there shouldn't be an option to gaurd aganst this problem, if a site elects to suffer the overhead of doing so. Are there other reasonable solution? Yes and no. There are "more" reasonable solutions. However, none will completely solve the problem. Even designing a full-fledged distributed file system is insufficient as one can trivially draw time-dependent scenarios that will produce inconsistent views of a file. So, what's the proposal here for dealing with the r/w case. a)When a file is opened in r/w mode, the redirector can inject a message into the system to remove all known alternate copies except for the chosen r/w copy. Should there be two r/w copies, the system will prohibit access to the file. This leaves open the question of in-use copies, inaccessible copies, and system administrators manually creating copies (which does happen). b) Providing a check option for MSS files opened in r/w mode that the file is consistent with the MSS copy. This does not address the issue of files being modified in the MSS through some alternate means. c) Aggressively working to finalize an official SRM interface that handles the appropriate disk cache management for a site. This leaves the site to determine the tolerable level of overhead. Andy