Back to index

R^*: An Overview of the Architecture

R. Williams, D. Daniels, L. Haas, G. Lapis, B. Lindsay, P. Ng, R. Obermarck, P. Selinger, A. Walker, P. Wilms, and R. Yost

Summary by: Steve Gribble and Armando Fox

One-line summary: R* is a distributed database system consisting of a set of voluntarily cooperating (but distrusting) sites; the architectural theme in R* is the maintainence of the autonomy of each site.

Relevance

One of the first well-implemented distributed databases; identified many of the difficult issues in distributed database systems, and managed to present a reasonable solution to some of them. A good introductory paper for distributed databases.

Flaws

Many of the interesting complexities were brushed over, such as access path selection cost modelling, authentication issues, distributed crash detection and recovery, directory service performance issues, cache consistency issues, costs of concurrency control, and any mention of performance at all.
What about fault tolerance? Was that a design goal at all?
Seems that they "rolled their own" for a lot of the system components, such as object naming and location services. Many of the distributed database issues are also distributed systems issues, yet no distributed systems research/literature was mentioned or referenced.

Overview/Main Points

Goals:
- each site retains local privacy and control of its data
- present a single-machine interface to users and programs for simplicity in coding/interaction
- no central dependencies (such as a central catalog, compiler, deadlock detector, or scheduler) - maximize site autonomy
- exploit parallelism wherever possible
Data definitions and relations
- data is stored as tuples in relations. Relations can be:
  - dispersed: a relation in its entirety is stored on a particular node in the network
  - replicated: a relation is duplicated in its entirety on two nodes in the network
  - partitioned: data is logically a single relation, but partitioned either into sets of rows (horizontal partitioning) or sets of columns (vertical partitioning) that live on different sites in the network
- partitioning and replication can both simultaneously be applied to relations
- snapshots (cached and possibly stale copies) of relations can be made by R* for sake of performance, if consistency constrains don't forbid it
Objects are named according to scheme "USER@USER_SITE.OBJECT_NAME@BIRST_SITE" to achieve uniqueness across all sites (a system wide name or SWN). "OBJECT_NAME" is shorthand for local objects (a print name).
Catalogs and Locating objects
- local-only and global objects are distinguished in R*
- system catalog is logically a relation, which can be fragmented and replicated. R* maintains a cache of catalog entries at each site.
- a site's catalog maintains information about objects stored at that site and objects that were created at that site (but since have been dispersed elsewhere). This allows any object to be located via its SWN by checking at the birth site. SG: what about fault tolerance?
Transactions
- Each transaction is given a unique transaction number made up from site name and a locally unique (monotonically increasing) sequence number. Transaction number thus also gives ordering info.
- Two-phase commit protocol is used for transaction commits/aborts. R* uses "presumed-to-commit" variant to reduce number of messages necessary - it is assumed that commits in the 2nd phase are successful unless somebody notifies otherwise. SG: again, what about fault tolerance?
Query preparation
- first print names are resolved to SWNs. Then catalog entries for each named object are fetched. User-level authorization at each site is then performed (no mention is made of protocol or method used).
- The site at which the SQL query is entered becomes the master site. The master site compiles a global query plan (using the usual Selingeresque access plan selection techniques, taking into accound network I/O cost). The global plan decides inter-site issues (access strategies) - the invocation sequence of participating sites and order of parameters.
- Remote apprentice sites receive the master global plan, and then compile a local access strategy for the locally managed objects. For example, apprentices can change join orders etc. for local relations as long as the result tuples are presented in the order in the master plan.
- Access path cost is a linear function of I/O cost, CPU cost, and message cost. Access path selection is very tricky (e.g. consider accessing horizontally partitioned relations in a cross-site join).
- 5 cross-site join strategies are considered:
  1. all tuples of inner relation sent to outer relaton site in one batch and stored in temporary relation; local join then done.
  2. all tuples of outer relation sent to inner site one tuple at a time - join is pipelined with tuple sending
  3. for each tuple of outer, a request for the appropriate tuples of the inner relation is sent to the inner site from the outer site; the join is then done on the outer site.
  4. All inner tuples sent to third site and stored in temp relation. Matching outer tuples are sent to third site and pipelined join is done.
  5. Outer tuples sent to third site. For each outer tuple, request from third site to inner site for matching inner tuples is sent, and join done on third site.
Query Execution
- Deadlock detection is done at each site by periodically looking at wait-for information gathered locally and retrieved from other sites. Potential cycles are detected, and the cycle string propagated along the potential cycle path.
- Logging is as in System R.
Some SQL additions were created for the distribution and migration of relations.

Back to index

R*: An Overview of the Architecture

Relevance

Flaws

Overview/Main Points

R^*: An Overview of the Architecture