Linear Hashing: A new Tool for File and Table Addressing

Witold Litwin

Summary by: Steve Gribble and Armando Fox

One-line summary: Linear hashing is a hashing scheme that exhibits near-optimal performance, both in terms of access cost and storage load.

Overview/Main Points

Hashing basics:
- records indexed with primary (unique) key
- hashing function h(c) assigns to a key c a unique bucket.
- If a bucket becomes full, we have a collision. Collision resolution is the process of storing c (which is now an overflow record) into an overflow bucket. Chaining is the usual way of doing it.
- Instead of chaining or overflow bucket creation, it would be nice to reorganize data or add more buckets, both of which require a dynamically created hash function, which is usually a modification to the currently used hash function, called a split.
Linear hashing:
- (formally)
```
 Let C be key space, h0 : C --> {0, 1, .., N-1 } be the base hash
function.  Functions h1, h2, ..., hi are called split functions for h0
if they obey:
  (1) hi : C --> {0, 1, ..., (2^i)(N-1)}
  (2) for any c, either
         (2.1) hi(c) = hi-1(c)
      or (2.2) hi(c) = hi-1(c) + 2^(i-1)N
```
- For example, if currently you have N=100 (100 buckets), and you try to insert into bucket 0 using h0(c) = c mod N, but it overflows, then you use h(c) = h1(c) if h0(c) = 0, and h(c) = h0(c) otherwise. Here, our family of split functions is hi(c) = c mod (2^i)N. This means that a bucket numbered 100 is added by the split, and half of bucket 0 is moved to bucket 100.
- The linear hashing algorithm performs splits in a deterministic order, rather than splitting at a bucket that overflowed. The splits are performed in linear order (bucket 0 first, then bucket 1, then 2, ...), and a split is performed when any bucket overflows. If the bucket that overflows is not the bucket that is split (which is the common case), overflow techniques such as chaining are used, but the common case is that few overflow buckets are needed.
- Using linear hashing, the address space (number of buckets) increases linearly and is exactly as large as is needed. For any number of insertions, most of the overflow records are moved into primary buckets by splits, and thus the number of overflow records is small. Thus, we expect to find our data in one access most of the time.
Wrinkles:
- Buckets can be partitioned across multiple storage, areas (files/memory buffers), rather than one contiguous area. If so, a table is used to map from bucket numbers to file and offset within file, in the obvious way.
- Instead of splitting on every collision, you can do a split when the "load" (which is bytes stored / (num buckets * bucket size), i.e. utilization of the data structure) crosses some watermark. This is called controlled splitting; the previously described is called uncontrolled splitting.
- Both growth and shrinking are supported; if items are deleted and the load drops below a watermark, the opposite of splitting (called grouping) is performed.
Performance:
- simulations of uncontrolled split showed that indeed number of accesses required per search remains very close to one, more so as the number of insertions increases.
- transient state as you populate new data structure, but quickly settles to stable state which exhibits periods in number of accesses per search (new period begins when you start using the next split function in the family).
- Split cost is on the order of 5-8 accesses; this means bucket size should be greater than 8 ( > 20 is recommended) to amortize out the split cost across the small insertion costs.
- For controlled splitting, performance degrades if we force the system to maintain high load (>90%), but there is negligible degradation at around >75%.

Relevance

This is a close to optimally performing search/insert/delete technique for hashable keys, and it is relatively trivial to implement.
The load factor for this data structure is high (which is good).

Flaws

The paper is grammatically/stylistically poor.
The exact simulation methodology used to derive their performance data is not described in complete detail; exactly what family of split functions did they use?

Back to index