LISTSERV 16.5 - XROOTD-L Archives

Subscriber's Corner

Email Lists

XROOTD-L Archives

XROOTD-L@LISTSERV.SLAC.STANFORD.EDU

View:

Message:

[

First

Last

]

By Topic:

[

First

Last

]

By Author:

[

First

Last

]

Font:

Monospaced Font

		LISTSERV Archives
		XROOTD-L Home
		XROOTD-L March 2005

Subject:

Re: cache validity

From:

Andrew Hanushevsky <[log in to unmask]>

Date:

5 Mar 2005 18:02:37 -0800 (PST)Sat, 5 Mar 2005 18:02:37 -0800 (PST)

Content-Type:

TEXT/PLAIN

Parts/Attachments:

TEXT/PLAIN (167 lines)

Hi JY, et. al.

Clearly we have some issues here, and I want to address as many (if not
all) of them as I can in one writing. So, this post will be longer than
usual.

Before I start into the details, let me go into some of the philosophical
questions here on the xrootd design point and how that interacts with some
of the r/w issues brought forth.

The design point is to provide high performance fault-tolerant data access
for primarily read-only data. This design point was chosen because this is
80% of the problem in HEP applications. That solution requires the
design of a highly distributed loosely coupled server cluster technology
where clients can be quickly dispersed across the disk server nodes. We
believe that our recent tests have shown we have accomplished this goal.

That leaves 20% of the problem (i.e., r/w access) which is probably the
hardest to solve to everyone's satisfaction. Of course, that's a
canonical
ratio and is extremely liberal. In HEP, there is usually an order of
magnitude difference between r/o and r/w data. Even that is misleading
since most of that data is processed in write-once read-many mode. So, the
real problem is effectively much smaller, though not necessarily less
vexing.

The difficulties in r/w access are not in the areas of high performance
(in fact, xrootd does quite well writing files as well) or even
fault-tolerance, the problem is in file management in such an
architecture. The scalability in xrootd comes from the fact that there is
no over-all concentration of knowledge or power in any particular part of
the system. Before one starts pointing at the top-level olbd (initial
point of contact) one has to remember that even that olbd is only aware of
up to 64 of its immediate neighbors. This allows nodes to come and go, be
added/removed easily, and scale linearly as the system size increases.
However, this makes r/w file management, without severely impacting the
scalability and performance, relatively complex.

The complexity drove the architecture to allow for the implementation of
an SRM to handle disk cache management. xrootd was never meant to
eliminate the need for a disk cache manager especially for sites that have
significant r/w requirements. However, to provide for scalability and high
performance, the architecture necessitates that the disk cache be made
available for direct access. This does put certain constraints on the SRM;
which may or may not be possible for some implementations.

The central issue here are the semantics of file access in the presence of
multiple copies of a file, either in the disk server nodes or underlying
support systems (e.g., MSS). This problem has been tackled, in varying
degrees of success, in many systems. I can't think of one where it has
been completely solved without sacrifices in performance, scalability,
fault tolerance, or file semantics.

Systems that have come closest to solving this problem rely on complete
knowledge of where every file resides in the system (i.e., a comprehensive
catalog). This has shown to be generally not a scalable solution because
the management overhead is significant. Invariably, systems reduce the
overhead by increasing the granularity of what is meant by "replica". For
instance, AFS groups files into volumes and file location is determined on
a volume basis. This significantly reduces the size of the location
database. Furthermore, AFS only allows r/o files to reside on multiple
servers, further simplifying recovery after a failure.

The problem becomes even more severe in the presence of multiple
catalogs. The classic case being the addition of an MSS. Solutions vary
from trying to synchronize multiple catalogs or, effectively, providing a
logical single catalog. Neither approach eases the management overhead.
The overhead is simply pushed into some other part of the system. In
short, there is no "free lunch" in providing a consistent view of a file
in the presence of multiple copies with unconstrained writes.

This leads to another type of solution -- constrain the writes to get a
better handle on the problem. Typically, this means adopting a publish
type of model. Here a user can write files using the appropriate system to
do so and then publishes the file for read access. Once a file is
published it can only be deleted, never replaced with a different but
identically named file. This corresponds quite well to the way most
scientific data is handled but puts a large, for some unacceptable,
constraint on users who simply want a r/w file system. The key here is to
remember that such a system is not a general file system replacement but
an experimental data access system; and probably more similar to what the
xrootd/olbd combination provides.

Now, on to the specifics.

In the particular case the JY saw, the servers were arranged to provide
for a r/o area and a r/w area in the same logical space. Nothing prevents
files from appearing in the r/o area. When they need to be modified, they
need to appear in the r/w area. Of course, nothing really prevents the
file from appearing in multiple places in the r/w area. The reason is that
a server that hosted a r/w file may be out of service and the file was
assigned to another server. So, while mixing r/o and r/w space in one
cluster exacerbates the problem, the problem exists even if the whole
space was r/w. So, central issue is

a) what to do about alternate copies files, even in inaccessible places,
and
b) (the harder part) when to do if one or more those files are being used.

One suggestion is to check in the MSS (i.e., amortize the cost of catalog
synchronization at open time). This can be easily done but will severely
limit the open rate and scale badly as either more clients use the system
or more servers are added. For instance, we see an open rate of about
15/sec on a server. Apparently not very high until one considers that we
are actually talking about 6 servers yielding an open rate of 90/second.
Add more, and things simply get worse. I don't know of any MSS that can
sustain a very high query rate against its catalog. I certainly know that
HPSS would have a problem and that system is probably one of the best out
there.

The apparent solution is to simply check only when the file be opened in
r/w mode. A similar option exists in the oss layer (i.e., oss.check) which
was put in to specifically prohibit the creation of duplicate files (it
only checks upon file create but can be extended to the generic r/w case).
This reduces the overhead to a tolerable level but even at that level
BaBar chose not to use it. Not because of the overhead but because the
system became unavailable when the MSS went down; creating a massive
single point of failure. Unfortunately, given all the mechanics of an MSS,
it did go down often enough to make life intolerable. Even if one accepts
this situation, it does not solve the problem of temporarily inaccessible
files. So, we're back to having to either synchronize the server's cache
after a bounce (intolerably slow) or checking at every open; likely
overwhelming the MSS and bring the whole system to a standstill.

The alternate solution that Pete proposes is to allow the MSS to inject a
message into the system indicating that a file has been changed and this
forces the removal of all copies. This is easy in some systems and quite
difficult in others, and still leaves open the question of what to do
about currently in-use and inaccessible copies. So, while this is a
significant improvement over constantly checking, it's not a complete
solution. It does, however, address the "backdoor" problem; which is
apparently what happened in JY's case. Of course, some people would point
out that allowing back doors is simply asking for trouble and should not
be allowed; case closed. Given the frequency at which this happens, I'm
inclined to agree. That doesn't mean that there shouldn't be an option to
gaurd aganst this problem, if a site elects to suffer the overhead of
doing so.

Are there other reasonable solution? Yes and no. There are "more"
reasonable solutions. However, none will completely solve the problem.
Even designing a full-fledged distributed file system is insufficient as
one can trivially draw time-dependent scenarios that will produce
inconsistent views of a file.

So, what's the proposal here for dealing with the r/w case.

a)When a file is opened in r/w mode, the redirector can inject a message
into the system to remove all known alternate copies except for the chosen
r/w copy. Should there be two r/w copies, the system will prohibit access
to the file. This leaves open the question of in-use copies, inaccessible
copies, and system administrators manually creating copies (which does
happen).

b) Providing a check option for MSS files opened in r/w mode that the file
is consistent with the MSS copy. This does not address the issue of files
being modified in the MSS through some alternate means.

c) Aggressively working to finalize an official SRM interface that handles
the appropriate disk cache management for a site. This leaves the site to
determine the tolerable level of overhead.

Andy

Top of Message | Previous Page | Permalink

Search Archives

Advanced Options

Options

		Log In
		Get Password

		Search Archives

		Subscribe or Unsubscribe

Archives

April 2024
March 2024
February 2024
January 2024
December 2023
November 2023
October 2023
September 2023
August 2023
July 2023
June 2023
May 2023
April 2023
March 2023
February 2023
January 2023
December 2022
October 2022
September 2022
August 2022
July 2022
June 2022
May 2022
April 2022
March 2022
February 2022
January 2022
December 2021
November 2021
October 2021
September 2021
August 2021
July 2021
June 2021
May 2021
April 2021
March 2021
February 2021
January 2021
December 2020
November 2020
October 2020
September 2020
August 2020
July 2020
June 2020
May 2020
April 2020
March 2020
February 2020
January 2020
December 2019
November 2019
October 2019
September 2019
August 2019
July 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
October 2018
September 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
December 2017
November 2017
October 2017
September 2017
August 2017
July 2017
June 2017
May 2017
April 2017
March 2017
February 2017
January 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
July 2009
June 2009
May 2009
April 2009
March 2009
January 2009
December 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
December 2006
November 2006
October 2006
September 2006
August 2006
July 2006
June 2006
May 2006
April 2006
March 2006
February 2006
January 2006
December 2005
November 2005
October 2005
September 2005
August 2005
July 2005
June 2005
May 2005
April 2005
March 2005
February 2005
January 2005
December 2004
November 2004
October 2004
September 2004
August 2004

ATOM RSS1 RSS2

LISTSERV.SLAC.STANFORD.EDU

Privacy Notice, Security Notice and Terms of Use