Increasing Network Throughput by Integrating Protocol Layers
Mark B. Abbott and Larry L. Peterson, Univ. of Arizona
One-line summary
Collapsing protocol layers to increase performance and reduce copying and
buffering is straightfoward in principle, but getting the implementation
right requires overcoming CPU-specific performance obstacles (cache performance,
register spilling, etc.), dealing with unusual protocol semantics (ordering/fragmenting,
etc.), and sacrificing some modularity which is difficult to restore. The
benefit ultimately depends on the CPU-to-memory gap.
Main ideas
- Main obstacles to overcome:
- Awkward data manipulation (non "loop style")
- Different "views" of data (level N headers are part of
level N-1 data)
- Ordering/fragmenting constraints (updating connection state, etc.
in middle layers)
- Preserving modularity
- Previous work showed only marginal improvements from ILP, but that's
because it included DES encryption, so protocol was compute-bound rather
than memory or I/O bound, and because CPU cache effects were not considered
- Word filters: operate on a machine word at a time conceptually;
can retain state to do multibyte/multistage operations. "Output"
consists of passing word(s) to next filter.
- Word filters implemented as inline source macros to avoid function
call overhead and allow state across words to be maintained in registers
- Cache effects dominate data-manipulation performance; wide
variety of experiments across machine types, transformation "types",
and cache extremes (all hits and all misses)
- Benefits: eliminate loads/stores, reduce loop overhead, fewer load
delays slots therefore more likely to be filled, eliminates buffer management.
(Armando's observation: also gives larger basic blocks which is good for
compilers; but loops have larger bodies. Probably still too small to affect
I-cache behavior, which is dominated by OS anyway, but would have been nice
to see a reference.)
- Register pressure is key constraint to "scalability"
of approach; performance knee corresponds to register spilling. Ratio of
local variables to layers gives slope after knee.
- Performance prediction model given which depends on loop overhead,
computation, and cache performance.
- Headers vs.data: separate application message into new abstract
data type, segregated message. Can only be done when application
data boundaries coincide with lower-level boundaries, which calls for application-level
framing in future protocols.
- Ordering constraints: separate message delivery into initial,
manipulate, and final stages; manipulation stages can be collapsed.
- Throughput improvements of 30-60% were observed in most cases.
(Throughput is important metric since today's fast LANs have bandwidth
in excess of what the protocol stack can handle; ILP should reduce hardware/synchronization
overhead as well.)
- Preserving modularity: prototype for automatic synthesis is
described, but it's very crude.
- Other barriers to integration include data delivery that is
not "streamlike" (eg a protocol that deliberately destroys data
locality to enhance error resistance) and protocols for which the next "decapsulation"
is not known until the current layer processing completes (demuxing, generalized
routing, packet forwarding).
Comments/Flaws
The title is misleading in that it belies a surprisingly detailed performance
analysis of the "low-level" (memory, CPU, etc) effects of ILP,
making this a good architecture/networking "crossover" paper.
- "PES" (pseudo-encryption standard) example used to illustrate
"nontrivial" stateful word filters is in fact too simple to be
realistic.
- The single "data access" parameter to prediction model is
too simplistic, especially with today's sophisticated cache architectures.
- The idea of ILP is straightforward, but this paper shows that getting
the implementation right is difficult for nonobvious reasons; to that end,
there should be more graphs.
Back to index