An empirical study of operating system errors

A. Chou, B Chelf, D. Engler

One-line summary: A rough draft, but nevertheless some interesting validations of commonly accepted things as well as an unexpected discovery. They use xgcc (their own checker-gcc) to collect on the order of 1700 bugs from Linux and draw conclusions based on them. One thing to keep in mind w.r.t. applicability is that, like any static checker, using this tool requires access to the source code. Additionally, the authors point out that they treat all bugs equally, whereas it is often the case that some bugs are significantly more important than others. They do not discuss or speculate at all on the relevance of bugs they found.

Overview/Main Points

Depending on the version of Linux, most of the code it consists of (in terms of lines-of-code) is driver code; the percentage ranges from 50% (for mature releases) to 70% (for new releases). Moreover, they find that driver code is up to 6 times buggier than the rest of Linux. This is attributed to driver developers being less familiar with the kernel interfaces and the fact that driver code is used by much fewer users.

They use seven checkers that look at the following types of bugs: - call a blocking function with interrupts disabled or spinlock held - use freed memory - forget to restore disabled interrupts - double-acquiring a lock - not checking potentially-NULL return pointers (e.g., from kmalloc) - use pointers from user land without checking them - allocate large stack variables (>1 KB) on the fixed-size kernel stack

The paper presents a graph of bug lifetimes, i.e. the amount of time from when the bug is introduced to when it is eliminated from the code base. The # of bugs vs. lifetime distribution is somewhat exponentially decreasing but long-tailed, starting at about 650 bugs with lifetimes of a few days, dropping to 400 bugs that lived a few months, down to 100 bugs that lived for 2 years, etc. The long tail is possible explained by latent bugs that are on rarely executed error paths. Overall, they find that 40%-60% of the bugs in any given release were introduced during the one year prior to release.

The paper makes a fuzzy argument that is meant to validate the belief that bugs cluster in files, due to the number of bugs being proportional to a developer's skill. Another hypothesis is that heavily used functions would have fewer bugs. To verify this, the authors did a topological sort of the kernel call graph to estimate how many flow-insensitive global paths pass through each function. The most frequently invoked functions turned out to be printk (~ 10mil paths) and spin_lock / spin_unlock (each ~2.5mil paths). Plotting the number of errors vs. paths seems to confirm the hypothesis. However, one alternate explanation could be that almost all functions have very few paths through them.

The surprising part is that, based on the measurements reported in this paper, OpenBSD 2.8 proved itself buggier than Linux 2.4.1 with all the checkers they ran, anywhere, by from 20% to almost 6 times worse.

The process the authors used is mainly automatic, which is cool. The manual work consists of writing the checkers and auditing found bugs. The automatic part includes finding a bug, cross-checking its existence in prior/future releases, locating its birth/death, etc. It also appears that the authors are confused w.r.t. how fault injection works -- they claim it inserts bugs in code, whereas FI is primarily about injecting faults in application state, rather than application code.

Relevance

Flaws

Back to index