Increasing Network Throughput by Integrating Protocol Layers

Mark B. Abbott and Larry L. Peterson, Univ. of Arizona

One-line summary

Collapsing protocol layers to increase performance and reduce copying and buffering is straightfoward in principle, but getting the implementation right requires overcoming CPU-specific performance obstacles (cache performance, register spilling, etc.), dealing with unusual protocol semantics (ordering/fragmenting, etc.), and sacrificing some modularity which is difficult to restore. The benefit ultimately depends on the CPU-to-memory gap.

Main ideas

Main obstacles to overcome:
- Awkward data manipulation (non "loop style")
- Different "views" of data (level N headers are part of level N-1 data)
- Ordering/fragmenting constraints (updating connection state, etc. in middle layers)
- Preserving modularity
Previous work showed only marginal improvements from ILP, but that's because it included DES encryption, so protocol was compute-bound rather than memory or I/O bound, and because CPU cache effects were not considered
Word filters: operate on a machine word at a time conceptually; can retain state to do multibyte/multistage operations. "Output" consists of passing word(s) to next filter.
Word filters implemented as inline source macros to avoid function call overhead and allow state across words to be maintained in registers
Cache effects dominate data-manipulation performance; wide variety of experiments across machine types, transformation "types", and cache extremes (all hits and all misses)
Benefits: eliminate loads/stores, reduce loop overhead, fewer load delays slots therefore more likely to be filled, eliminates buffer management. (Armando's observation: also gives larger basic blocks which is good for compilers; but loops have larger bodies. Probably still too small to affect I-cache behavior, which is dominated by OS anyway, but would have been nice to see a reference.)
Register pressure is key constraint to "scalability" of approach; performance knee corresponds to register spilling. Ratio of local variables to layers gives slope after knee.
Performance prediction model given which depends on loop overhead, computation, and cache performance.
Headers vs.data: separate application message into new abstract data type, segregated message. Can only be done when application data boundaries coincide with lower-level boundaries, which calls for application-level framing in future protocols.
Ordering constraints: separate message delivery into initial, manipulate, and final stages; manipulation stages can be collapsed.
Throughput improvements of 30-60% were observed in most cases. (Throughput is important metric since today's fast LANs have bandwidth in excess of what the protocol stack can handle; ILP should reduce hardware/synchronization overhead as well.)
Preserving modularity: prototype for automatic synthesis is described, but it's very crude.
Other barriers to integration include data delivery that is not "streamlike" (eg a protocol that deliberately destroys data locality to enhance error resistance) and protocols for which the next "decapsulation" is not known until the current layer processing completes (demuxing, generalized routing, packet forwarding).

Comments/Flaws

The title is misleading in that it belies a surprisingly detailed performance analysis of the "low-level" (memory, CPU, etc) effects of ILP, making this a good architecture/networking "crossover" paper.

"PES" (pseudo-encryption standard) example used to illustrate "nontrivial" stateful word filters is in fact too simple to be realistic.
The single "data access" parameter to prediction model is too simplistic, especially with today's sophisticated cache architectures.
The idea of ILP is straightforward, but this paper shows that getting the implementation right is difficult for nonobvious reasons; to that end, there should be more graphs.

Back to index