Back to index
R*: An Overview of the Architecture
R. Williams, D. Daniels, L. Haas, G. Lapis, B. Lindsay, P. Ng,
R. Obermarck, P. Selinger, A. Walker, P. Wilms, and R. Yost
Summary by: Steve Gribble and Armando Fox
One-line summary:
R* is a distributed database system consisting of a set of voluntarily
cooperating (but distrusting) sites; the architectural theme in R* is the
maintainence of the autonomy of each site.
Relevance
One of the first well-implemented distributed databases; identified many
of the difficult issues in distributed database systems, and managed to
present a reasonable solution to some of them. A good introductory paper
for distributed databases.
Flaws
- Many of the interesting complexities were brushed over, such as
access path selection cost modelling, authentication issues,
distributed crash detection and recovery, directory service
performance issues, cache consistency issues, costs of
concurrency control, and any mention of performance at all.
- What about fault tolerance? Was that a design goal at all?
- Seems that they "rolled their own" for a lot of the
system components, such as object naming and location services.
Many of the distributed database issues are also distributed
systems issues, yet no distributed systems research/literature
was mentioned or referenced.
Overview/Main Points
- Goals:
- each site retains local privacy and control of its data
- present a single-machine interface to users and
programs for simplicity in coding/interaction
- no central dependencies (such as a central catalog,
compiler, deadlock detector, or scheduler) -
maximize site autonomy
- exploit parallelism wherever possible
- Data definitions and relations
- data is stored as tuples in relations. Relations can
be:
- dispersed: a relation in its entirety
is stored on a particular node in the network
- replicated: a relation is duplicated
in its entirety on two nodes in the network
- partitioned: data is logically a
single relation, but partitioned either into
sets of rows (horizontal partitioning) or
sets of columns (vertical partitioning) that
live on different sites in the network
- partitioning and replication can both simultaneously be
applied to relations
- snapshots (cached and possibly stale copies) of
relations can be made by R* for sake of performance, if
consistency constrains don't forbid it
- Objects are named according to scheme
"USER@USER_SITE.OBJECT_NAME@BIRST_SITE"
to achieve uniqueness across all sites (a system wide name
or SWN). "OBJECT_NAME"
is shorthand for local objects (a print name).
- Catalogs and Locating objects
- local-only and global objects are distinguished in R*
- system catalog is logically a relation, which can be
fragmented and replicated. R* maintains a cache of
catalog entries at each site.
- a site's catalog maintains information about objects
stored at that site and objects that were created at
that site (but since have been dispersed elsewhere).
This allows any object to be located via its SWN by
checking at the birth site. SG: what about fault
tolerance?
- Transactions
- Each transaction is given a unique transaction number
made up from site name and a locally unique
(monotonically increasing) sequence number.
Transaction number thus also gives ordering info.
- Two-phase commit protocol is used for transaction
commits/aborts. R* uses "presumed-to-commit"
variant to reduce number of messages necessary -
it is assumed that commits in the 2nd phase are
successful unless somebody notifies otherwise. SG:
again, what about fault tolerance?
- Query preparation
- first print names are resolved to SWNs. Then catalog
entries for each named object are fetched.
User-level authorization at each site is then performed
(no mention is made of protocol or method used).
- The site at which the SQL query is entered becomes the
master site. The master site compiles a global
query plan (using the usual Selingeresque access plan
selection techniques, taking into accound network I/O
cost). The global plan decides inter-site issues
(access strategies) - the invocation sequence of
participating sites and order of parameters.
- Remote apprentice sites receive the master
global plan, and then compile a local access strategy
for the locally managed objects. For example,
apprentices can change join orders etc. for local
relations as long as the result tuples are presented in
the order in the master plan.
- Access path cost is a linear function of I/O cost, CPU
cost, and message cost. Access path selection is very
tricky (e.g. consider accessing horizontally partitioned
relations in a cross-site join).
- 5 cross-site join strategies are considered:
- all tuples of inner relation sent to outer
relaton site in one batch and stored in
temporary relation; local join then done.
- all tuples of outer relation sent to inner
site one tuple at a time - join is pipelined
with tuple sending
- for each tuple of outer, a request for the
appropriate tuples of the inner relation is
sent to the inner site from the outer site;
the join is then done on the outer site.
- All inner tuples sent to third site and
stored in temp relation. Matching outer
tuples are sent to third site and pipelined
join is done.
- Outer tuples sent to third site. For each
outer tuple, request from third site to inner
site for matching inner tuples is sent, and
join done on third site.
- Query Execution
- Deadlock detection is done at each site by periodically
looking at wait-for information gathered locally and
retrieved from other sites. Potential cycles are
detected, and the cycle string propagated along the
potential cycle path.
- Logging is as in System R.
- Some SQL additions were created for the distribution and
migration of relations.
Back to index