Jim Gray, Prashant Shenoy
Rules of Thumb in Data Engineering,
Proc. ICDE, Feb. 2000
(Summary by George Candea)
This paper is a reexamination of the rules of thumb that drive the
design of data storage systems.
- Many rules of thumb are a consequence of Moore's Law, so I'll
reiterate it here: "circuit densities increase 4x every 3 years"
(a.k.a. "things get 4x better every 3 years").
- In-memory data growth leads to the need of one extra bit of
addressing every 18 months (64-bit addressing should be sufficient for
another 2-3 decades).
- Storage capacity improves 100x / decade, while storage device
throughput increases 10x / decade. At the same time the ratio between
disk capacity and disk accesses/second is increasing more than 10x /
decade. Consequently, disk accesses become more precious and disk data
becomes colder with time, at a rate of 10x / decade.
- Disk page sizes increase 5x / decade. In ten years, the typical
small transfer unit will be around 64 Kb, while large transfer units
will be 1 MB or more.
- The cost per random tape access is about 100K times higher than
for disk. In five years this will probably be 1M times higher
(calculation includes cost of tape drive). Tape capacities are
expected to improve faster than tape speed, and access time is
expected to stay about the same.
- The nearline tape : online disk : RAM storage price ratio
today is 1:3:300. Historically it has been 1:10:1000.
- In ten years 1 MB of DRAM will cost what 1 MB of disk costs today.
- Disks are replacing tapes as backup devices. You may be able to
backup 1 TB to tape, but it takes very long to restore it. As
sysadmins see petabyte stores looming on the horizon, they start
replacing tape backups with the maintenance of multiple disk versions
online (perhaps geographically distributed) so that they never have to
restore from tape.
- Storage prices have dropped to the point where storage management
costs exceed storage costs, the same way PC management costs exceed
hardware costs. A sysadmin can generally administer $1 milion worth of
storage. That used to be 1GB, is 1TB today, and will be 1PB in 10
years.
- Amdahl's revised balanced system law: a system needs 8 MIPS per
MB/sec of I/O bandwidth (but the instruction rate and I/O rate need to
be measured on the relevant workload).
- The MB of RAM / MIPS of CPU ratio is rising (is about 1-4
right now).
- Random I/O happens once every 50K instructions. For sequential I/O
this number is much higher.
- Gilder's law: deployed network bandwidth triples every year.
- Network link bandwidth improves 4x every 3 years.
- The CPU cost of a SAN (System Area Network) message is 3K clocks +
1 clock/byte. Historically, a network message cost 10K instructions +
10 instructions/byte, while a disk I/O cost 5K instructions + 0.1
instructions/byte.
- Currently, it costs more than $1 to send 100MB via a WAN, while
local disk and LAN access are 10K times less expensive. This price gap
is likely to decline to 10:1 or even 3:1 over the next decade.
- The memory subsystem cannot keep the CPU's pipelines full, even
with 3-level caches, unless programs have good data locality. As CPU
speeds continue to outpace memory speeds, it will become increasingly
important for software to have small instruction cache footprints,
predictable branching behavior, and good data locality. There is a
trend to build huge multiprocessors using shared memory - these
systems are prone to instruction stretch, in which bus and cache
interference from other CPUs cause each CPU to slow down.
- 5-minute random rule: cache all randomly accessed disk pages that
are touched at least once every 5 minutes.
- 1-minute sequential rule: cache all sequentially accessed disk
pages that are touched at least once every minute.
- Spend 1 byte of RAM to save 1 MIPS of CPU.
- Cache web pages if there is any chance they will be re-referenced
within their lifetime (calcaulation made based on people cost
savings). A major assumption is that server performance will continue
to be poor (3 seconds on average). With declining costs, however, web
site owners may buy more bandwidth and hardware, reducing response
times to under 1 second. In that case, the need for caching would be
purely to save network bandwidth and download time.