Print

Print


Hi,

I have been investigating on how to meet various needs in terms of
0-config clusters and simplifying the overall setup. The two main items
that were brought up: 1) 0-config, and 2) combining the xrootd and olbd so
that one does not have to deal with two daemons seem like workable
solutions that, in fact, are very related. So, here is a proposal and
please feel free to rip it apart :-)

In general, manager and supervisor nodes can combine the xrootd and olbd
as one daemon. This is because they perform one simple function -- lookup
and redirection. The thread demands are largely homogeneous and one would
expect a smooth flow through the daemon. Data servers, on the other hand,
pose a problem since combining the two functions is like mixing apples and
oranges. Data servers never really need olbd functionality and, indeed,
the thread demands for data serving would compete with the services a data
server olbd provides to the cluster. Depending on the thread  contention,
long delays can be introduced into the olbd path that would  cause
unpredictable behaviour in terms of locating files. Hence, the two
function really need to live in separate processes.

So, here is what I can do to work within these constraints.

1) We introduce a new directive (optional):

xrootd.olb <path>

which specifies the location of the olbd "plugin", libXrdOlb.so. The
default is to use whatever LD_LIBRARY_PATH happens to be set to.

2) Manager and supervisor xrootd’s simply load the plugin and use it via
an object interface. Data server xrootd’s load the plugin  then fork and
execute the plugin in a separate process. The process verifies that a
previous incarnation is not running and if it is, exits since the xrootd
will simply use the previous  incarnation. This allows all functionality
to be controlled by simply starting an xrootd with the appropriate
parameters. No more starting a separate daemon.

3) Introduce a new directive (mandatory for auto-config clusters):

olb.xrootd <command line>

This directive specifies how to start an xrootd that will function as a
supervisor. I suppose we can come up with defaults but the problem is that
the xrd layer strips out  parameters before passing the command line to
xrootd so we can never know things like where the log file should go.
However, I don’t think that this parameter is unwieldy since it’s pretty
much fixed for once you dream up the configuration.

4) Introduce a new directive (optional):

olb.ftlevel x%

This specifies the fault tolerance level (default is 20%). The manager
will start enough supervisors to handle x% more data servers than really
is needed (e.g., be default 1.2 times as many supervisors than would be
needed are started).

5) Modify the existing role directive:

olb.role {manager | server | supervisor | superserver ]} [if <conds>]

The difference is that you can specify that a data server olb can also
function as a supervisor olb if you specify

olb.role superserver

In general, auto-clusters would always have that directive (the default
being manual configuration). This also provides an convenient way to limit
which nodes can act as superservers.

The algorithm would work as follows:

1) As data server olb’s connect they tell the manager (as is now) how many
rounds they have done without finding a supervisor. Once that number
reaches 3 (arbitrary -- you can give me another one), the manager asks a
superserver that it has not asked before to start a supervisor and the
data server olbd is asked to restart it’s search.

2) When a superserver olbd is asked to start a supervisor (and it has not
done so already) it launches a supervisor using the olb.xrootd command.
The trick here (which I haven’t figured out yet) is how to know that a
supervisor has been launched across a restart of the data server
xrootd/olbd. Tricky, very tricky.

3) The manager tallies how many data servers it knows about and always
makes sure that olb.ftlevel supervisors have been started.

4) This leads to a good possibility that the superserver requests can be
cascaded so that auto-clustering can work past 4,096 servers.

There are still many details to work out (the devil is in the details):

a) How one controls this algorithm in the presence of load balanced
managers. That is, you can start x managers and, somehow, one has to
prevent these managers from starting supervisors willy-nilly. This is not
an easy problem to solve as managers work independently and are loath to
contact each other (in fact that’s one of the strengths of the current
scheme).

b) What are the administrative interface relationships? This is another
one that I haven’t solved. It’s easy when xrootd’s and olbd’s are separate
but difficult to address when some are and some are not.

c) What are the cache side-effects in the presence of combined
xrootd/olbd’s? Not clear. What I do know is that there will be more cache
activity as things come and go at the supervisor level. How that sorts out
is unknown. However, one good thing here is that this investigation did
bring to light a failing in the reconfiguration algorithm. Currently, the
system does not completely handle port reassignment across partial
reconfigurations (i.e., xrootd going then coming back with a different
port number). Something to fix.

e) How will this effect existing schemes to automatically restart failed
servers? Note that data servers start an ephemeral olbd. This puts the
xrootd in the situation where it has to make sure that the ephemeral olbd
is restarted should it fail. It also adds in the big nit that this
knowledge is lost across data server restarts and it’s not clear how to
handle that situation.

f) Should the architecture change in terms of the xroots/olbd
relationship? Currently, olbd interactions occur at the ofs layer. In a
combined xrootd/olbd these interactions could occur at the xroot protocol
layer. In some ways this is cleaner but it also is more restrictive in how
you can reuse components.

g) How long will it take to reach stability? The answer is obviously
longer than it takes now since, if for no other reason, supervisors cannot
be pre-started. Unknown what production effect this will have.

h) There are probably a lot of other end conditions that I don’t know
about. So, please speak up and ask questions on how things would be
handled in strange situations.

All in all, the above is a workable solution but not something that I can
implement in a day. So, please comment because once I star down this road
it will be hard to change things.

Andy