LISTSERV 16.5 - XROOTD-L Archives

Hi JY, et. al.

Clearly we have some issues here, and I want to address as many (if not
all) of them as I can in one writing. So, this post will be longer than
usual.

Before I start into the details, let me go into some of the philosophical
questions here on the xrootd design point and how that interacts with some
of the r/w issues brought forth.

The design point is to provide high performance fault-tolerant data access
for primarily read-only data. This design point was chosen because this is
80% of the problem in HEP applications. That solution requires the
design of a highly distributed loosely coupled server cluster technology
where clients can be quickly dispersed across the disk server nodes. We
believe that our recent tests have shown we have accomplished this goal.

That leaves 20% of the problem (i.e., r/w access) which is probably the
hardest to solve to everyone's satisfaction. Of course, that's a
canonical
ratio and is extremely liberal. In HEP, there is usually an order of
magnitude difference between r/o and r/w data.  Even that is misleading
since most of that data is processed in write-once read-many mode. So, the
real problem is effectively much smaller, though not necessarily less
vexing.

The difficulties in  r/w access are not in the areas of high performance
(in fact, xrootd does quite well writing files as well) or even
fault-tolerance, the problem is in file management in such an
architecture. The scalability in xrootd comes from the fact that there is
no over-all concentration of knowledge or power in any particular part of
the system. Before one starts pointing at the top-level olbd (initial
point of contact) one has to remember that even that olbd is only aware of
up to 64 of its immediate neighbors. This allows nodes to come and go, be
added/removed easily, and scale linearly as the system size increases.
However, this makes r/w file management, without severely impacting the
scalability and performance, relatively complex.

The complexity drove the architecture to allow for the implementation of
an SRM to handle disk cache management. xrootd was never meant to
eliminate the need for a disk cache manager especially for sites that have
significant r/w requirements. However, to provide for scalability and high
performance, the architecture necessitates that the disk cache be made
available for direct access. This does put certain constraints on the SRM;
which may or may not be possible for some implementations.

The central issue here are the semantics of file access in the presence of
multiple copies of a file, either in the disk server nodes or underlying
support systems (e.g., MSS). This problem has been tackled, in varying
degrees of success, in many systems. I can't think of one where it has
been completely solved without sacrifices in performance, scalability,
fault tolerance, or file semantics.

Systems that have come closest to solving this problem rely on complete
knowledge of where every file resides in the system (i.e., a comprehensive
catalog). This has shown to be generally not a scalable solution because
the management overhead is significant. Invariably, systems reduce the
overhead by increasing the granularity of what is meant by "replica".  For
instance, AFS groups files into volumes and file location is determined on
a volume basis. This significantly reduces the size of the location
database.  Furthermore, AFS only allows r/o files to reside on multiple
servers, further simplifying recovery after a failure.

The problem becomes even more severe in the presence of multiple
catalogs. The classic case being the addition of an MSS. Solutions vary
from trying  to synchronize multiple catalogs or, effectively, providing a
logical single catalog. Neither approach eases the management overhead.
The overhead is simply pushed into some other part of the system. In
short, there is no "free lunch" in providing a consistent view of a file
in the presence of multiple copies with unconstrained writes.

This leads to another type of solution -- constrain the writes to get a
better handle on the problem. Typically, this means adopting a publish
type of model. Here a user can write files using the appropriate system to
do so and then publishes the file for read access. Once a file is
published it can only be deleted, never replaced with a different but
identically named file. This corresponds quite well to the way most
scientific data is handled but puts a large, for some unacceptable,
constraint on users who simply want a r/w file system. The key here is to
remember that such a system is not a general file system replacement but
an experimental data access system; and probably more similar to what the
xrootd/olbd combination provides.

Now, on to the specifics.

In the particular case the JY saw, the servers were arranged to provide
for a r/o area and a r/w area in the same logical space. Nothing prevents
files from appearing in the r/o area. When they need to be modified, they
need to appear in the r/w area. Of course, nothing really prevents the
file from appearing in multiple places in the r/w area. The reason is that
a server that hosted a r/w file may be out of service and the file was
assigned to another server. So, while mixing r/o and r/w space in one
cluster exacerbates the problem, the problem exists even if the whole
space was r/w.  So, central issue is

a) what to do about alternate copies files, even in inaccessible places,
and
b) (the harder part) when to do if one or more those files are being used.

One suggestion is to check in the MSS (i.e., amortize the cost of catalog
synchronization at open time). This can be easily done but will severely
limit the open rate and scale badly as either more clients use the system
or more servers are added. For instance, we see an open rate of about
15/sec on a server. Apparently not very high until one considers that we
are actually talking about 6 servers yielding an open rate of 90/second.
Add more, and things simply get worse. I don't know of any MSS that can
sustain a very high query rate against its catalog. I certainly know that
HPSS would have a problem and that system is probably one of the best out
there.

The apparent solution is to simply check only when the file be opened in
r/w mode. A similar option exists in the oss layer (i.e., oss.check) which
was put in to specifically prohibit the creation of duplicate files (it
only checks upon file create but can be extended to the generic r/w case).
This reduces the overhead to a tolerable level but even at that level
BaBar chose not to use it. Not because of the overhead but because the
system became unavailable when the MSS went down; creating a massive
single point of failure. Unfortunately, given all the mechanics of an MSS,
it did go down often enough  to make life intolerable. Even if one accepts
this situation, it does not solve the problem of temporarily inaccessible
files. So, we're back to having to either synchronize the server's cache
after a bounce (intolerably slow) or checking at every open; likely
overwhelming the MSS and bring the whole system to a standstill.

The alternate solution that Pete proposes is to allow the MSS to inject a
message into the system indicating that a file has been changed and this
forces the removal of all copies. This is easy in some systems and quite
difficult in others, and still leaves open the question of what to do
about currently in-use and inaccessible copies. So, while this is a
significant improvement over constantly checking, it's not a complete
solution. It does, however, address the "backdoor" problem; which is
apparently what happened in JY's case. Of course, some people would point
out that allowing back doors is simply asking for trouble and should not
be allowed; case closed. Given the frequency at which this happens, I'm
inclined to agree. That doesn't mean that there shouldn't be an option to
gaurd aganst this problem, if a site elects to suffer the overhead of
doing so.

Are there other reasonable solution? Yes and no. There are "more"
reasonable solutions. However, none will completely solve the problem.
Even designing a full-fledged distributed file system is insufficient as
one can trivially draw time-dependent scenarios that will produce
inconsistent views of a file.

So, what's the proposal here for dealing with the r/w case.

a)When a file is opened in r/w mode, the redirector can inject a message
into the system to remove all known alternate copies except for the chosen
r/w copy. Should there be two r/w copies, the system will prohibit access
to the file. This leaves open the question of in-use copies, inaccessible
copies, and system administrators manually creating copies (which does
happen).

b) Providing a check option for MSS files opened in r/w mode that the file
is consistent with the MSS copy. This does not address the issue of files
being modified in the MSS through some alternate means.

c) Aggressively working to finalize an official SRM interface that handles
the appropriate disk cache management for a site.  This leaves the site to
determine the tolerable level of overhead.

Andy