Jim Gray, Prashant Shenoy
Rules of Thumb in Data Engineering,
Proc. ICDE, Feb. 2000

(Summary by George Candea)

This paper is a reexamination of the rules of thumb that drive the design of data storage systems.

Many rules of thumb are a consequence of Moore's Law, so I'll reiterate it here: "circuit densities increase 4x every 3 years" (a.k.a. "things get 4x better every 3 years").
In-memory data growth leads to the need of one extra bit of addressing every 18 months (64-bit addressing should be sufficient for another 2-3 decades).
Storage capacity improves 100x / decade, while storage device throughput increases 10x / decade. At the same time the ratio between disk capacity and disk accesses/second is increasing more than 10x / decade. Consequently, disk accesses become more precious and disk data becomes colder with time, at a rate of 10x / decade.
Disk page sizes increase 5x / decade. In ten years, the typical small transfer unit will be around 64 Kb, while large transfer units will be 1 MB or more.
The cost per random tape access is about 100K times higher than for disk. In five years this will probably be 1M times higher (calculation includes cost of tape drive). Tape capacities are expected to improve faster than tape speed, and access time is expected to stay about the same.
The nearline tape : online disk : RAM storage price ratio today is 1:3:300. Historically it has been 1:10:1000.
In ten years 1 MB of DRAM will cost what 1 MB of disk costs today.
Disks are replacing tapes as backup devices. You may be able to backup 1 TB to tape, but it takes very long to restore it. As sysadmins see petabyte stores looming on the horizon, they start replacing tape backups with the maintenance of multiple disk versions online (perhaps geographically distributed) so that they never have to restore from tape.
Storage prices have dropped to the point where storage management costs exceed storage costs, the same way PC management costs exceed hardware costs. A sysadmin can generally administer $1 milion worth of storage. That used to be 1GB, is 1TB today, and will be 1PB in 10 years.
Amdahl's revised balanced system law: a system needs 8 MIPS per MB/sec of I/O bandwidth (but the instruction rate and I/O rate need to be measured on the relevant workload).
The MB of RAM / MIPS of CPU ratio is rising (is about 1-4 right now).
Random I/O happens once every 50K instructions. For sequential I/O this number is much higher.
Gilder's law: deployed network bandwidth triples every year.
Network link bandwidth improves 4x every 3 years.
The CPU cost of a SAN (System Area Network) message is 3K clocks + 1 clock/byte. Historically, a network message cost 10K instructions + 10 instructions/byte, while a disk I/O cost 5K instructions + 0.1 instructions/byte.
Currently, it costs more than $1 to send 100MB via a WAN, while local disk and LAN access are 10K times less expensive. This price gap is likely to decline to 10:1 or even 3:1 over the next decade.
The memory subsystem cannot keep the CPU's pipelines full, even with 3-level caches, unless programs have good data locality. As CPU speeds continue to outpace memory speeds, it will become increasingly important for software to have small instruction cache footprints, predictable branching behavior, and good data locality. There is a trend to build huge multiprocessors using shared memory - these systems are prone to instruction stretch, in which bus and cache interference from other CPUs cause each CPU to slow down.
5-minute random rule: cache all randomly accessed disk pages that are touched at least once every 5 minutes.
1-minute sequential rule: cache all sequentially accessed disk pages that are touched at least once every minute.
Spend 1 byte of RAM to save 1 MIPS of CPU.
Cache web pages if there is any chance they will be re-referenced within their lifetime (calcaulation made based on people cost savings). A major assumption is that server performance will continue to be poor (3 seconds on average). With declining costs, however, web site owners may buy more bandwidth and hardware, reducing response times to under 1 second. In that case, the need for caching would be purely to save network bandwidth and download time.

Jim Gray, Prashant Shenoy Rules of Thumb in Data Engineering, Proc. ICDE, Feb. 2000

Jim Gray, Prashant Shenoy
Rules of Thumb in Data Engineering,
Proc. ICDE, Feb. 2000