Back to index
The Design of the Postgres Storage System
Michael Stonebraker
Summary by: Steve Gribble and Armando Fox
One-line summary:
A novel storage system for Postgres is presented - transactions are
supported without needing WAL (instantaneous crash recovery), full history
of database can be kept, and the design is of many asynchronous processes,
enabling parallelism.
Relevance
A cool storage system - much simpler than ARIES, although much less
rigorously presented and analyzed.
Flaws
- How often will people really want historical queries?
- There is nothing in this paper that convinces me of the
correctness of this storage system in all failure modes.
- what is the real overhead of vaccuuming? Seems analogous to
"cleaning" in LFS, biggest point of contention in LFS
is cost of cleaning.
- the concurrency control section is very ambiguous - I read it
quite carefully, but cannot glean an understanding of exactly how
concurrency control works in Postgres.
Overview/Main Points
- No data is ever overwritten - all updates are inserts
- transactions:
- transaction identifier (XID); transaction log has 2
bits per transaction indicating committed, aborted, or
in progress.
- tail of log - from oldest active transaction to
present. body of log - transactions are not in
progress, so only need 1 bit to describe
(committed/aborted).
- bloom filter to compress log
- relational storage:
- one file per relation (arbitrary length achieved with
continuation list)
- each record contains (in addition to data):
- OID: system-assigned unique record ID
- Xmin: transaction identifier of the
interaction inserting record
- Tmin: commit time of Xmin
- Cmin: command ID of insertion interaction
- Xmax: transaction id of interaction deleting
record
- Tmax: commit time of Xmax
- Cmax: command ID of deletion interaction
- PTR: forward pointer to next record for
relation.
- when record is updated, store compressed deltas in new
record. The initial record is called anchor point;
successive updates are found through PTR chain, and are
called delta records.
- Tmin and Tmax can be used for historical queries.
- concurrency control:
- two-phase locking for concurrency, implying time-stamps
in records must be set to commit time of each
transaction "in order to avoid anomolous
behaviour".
- time stamps are filled in asynchronously at time
transaction commits (into a TIME relation which stores
commit time of each transaction).
- each relation in POSTGRES is tagged with archival
designation - either no archive (no historical
queries), light archive (some historical queries, but
little expected), or heavy archive. If no archive,
Tmin and Tmax never filled in. If light archive,
whenever historical query, commit Tmin and Tmax read
for appropriate transaction from TIME relation (mega
I/Os). For heavy archive, the first time you hit the
TIME relation, but then you store Tmin and Tmax in the
appropriate queried record.
- archival system
- periodically (or continuously in background)
"vaccuum" the disk, which sweeps old and
invalid records to a WORM archive. If aborted, just
reclaim space. If committed, copy to archive.
- build arbitrary number of secondary indexes on
archive. Indexes are normally stored on magnetic
device, but if they become large, may split them across
magnetic and WORM.
- vacuuming phases:
- write an archive record and its associated
index records
- write a new anchor point in the current DB
- reclaim the space occupied by the old anchor
point and its delta records
- shows that survives crashes, with worst case being
allocated but never used records in archive.
- combined mag and optical indexes - some pages of
index on mag, some on optical. pages are moved from
mag to optical as index grows.
Back to index