The Design of the Postgres Storage System

Michael Stonebraker

Summary by: Steve Gribble and Armando Fox

One-line summary: A novel storage system for Postgres is presented - transactions are supported without needing WAL (instantaneous crash recovery), full history of database can be kept, and the design is of many asynchronous processes, enabling parallelism.

Relevance

A cool storage system - much simpler than ARIES, although much less rigorously presented and analyzed.

Flaws

How often will people really want historical queries?
There is nothing in this paper that convinces me of the correctness of this storage system in all failure modes.
what is the real overhead of vaccuuming? Seems analogous to "cleaning" in LFS, biggest point of contention in LFS is cost of cleaning.
the concurrency control section is very ambiguous - I read it quite carefully, but cannot glean an understanding of exactly how concurrency control works in Postgres.

Overview/Main Points

No data is ever overwritten - all updates are inserts
transactions:
- transaction identifier (XID); transaction log has 2 bits per transaction indicating committed, aborted, or in progress.
- tail of log - from oldest active transaction to present. body of log - transactions are not in progress, so only need 1 bit to describe (committed/aborted).
- bloom filter to compress log
relational storage:
- one file per relation (arbitrary length achieved with continuation list)
- each record contains (in addition to data):
  - OID: system-assigned unique record ID
  - Xmin: transaction identifier of the interaction inserting record
  - Tmin: commit time of Xmin
  - Cmin: command ID of insertion interaction
  - Xmax: transaction id of interaction deleting record
  - Tmax: commit time of Xmax
  - Cmax: command ID of deletion interaction
  - PTR: forward pointer to next record for relation.
- when record is updated, store compressed deltas in new record. The initial record is called anchor point; successive updates are found through PTR chain, and are called delta records.
- Tmin and Tmax can be used for historical queries.
concurrency control:
- two-phase locking for concurrency, implying time-stamps in records must be set to commit time of each transaction "in order to avoid anomolous behaviour".
- time stamps are filled in asynchronously at time transaction commits (into a TIME relation which stores commit time of each transaction).
- each relation in POSTGRES is tagged with archival designation - either no archive (no historical queries), light archive (some historical queries, but little expected), or heavy archive. If no archive, Tmin and Tmax never filled in. If light archive, whenever historical query, commit Tmin and Tmax read for appropriate transaction from TIME relation (mega I/Os). For heavy archive, the first time you hit the TIME relation, but then you store Tmin and Tmax in the appropriate queried record.
archival system
- periodically (or continuously in background) "vaccuum" the disk, which sweeps old and invalid records to a WORM archive. If aborted, just reclaim space. If committed, copy to archive.
- build arbitrary number of secondary indexes on archive. Indexes are normally stored on magnetic device, but if they become large, may split them across magnetic and WORM.
- vacuuming phases:
  1. write an archive record and its associated index records
  2. write a new anchor point in the current DB
  3. reclaim the space occupied by the old anchor point and its delta records
- shows that survives crashes, with worst case being allocated but never used records in archive.
- combined mag and optical indexes - some pages of index on mag, some on optical. pages are moved from mag to optical as index grows.

Back to index