Back to index
Incorporating Memory Management into User-Level Network Interfaces
Matt Welsh, Anindya Basu, Thorsten von Eicken, Cornell Univ.
One-line summary: Allow network endpoints to be faulted in and
out of the memory mapped for the NI,rather than pinning all that memory,
just as pages are faulted in and out of user memory. A dedicated on-NI
TLB coordinates its operations with the OS's memory management, so the
mechanisms are mostly OS-independent. (This addition to U-Net results in
the "U-Net/MM" architecture.)
Overview/Main Points
- Goal: Allow efficient use of high-performance NI's by multiple
applications without requiring all NI-mapped I/O memory to be
pinned.
- Paging the I/O buffers allows "zero copy" send/receive (directly
to/from application's data space), but must move some minimal
intelligence into the NI to do message mux/demux and virtual to
physical address translation (and signal "buffer faults").
- NI maintains short queues of pre-translated
(virtual-to-physical) buffer addresses for message receives. If
none is available when a message arrives, msg may be dropped.
- If buffer address can be translated (TLB hit) but buffer is paged out,
kernel initiates a page-in, but message may be dropped if page-in
is not completed by the time the message is ready to be delivered.
- On TLB miss, NI requests kernel to translate the buffer address.
If translation would result in access to a non-resident page,
kernel informs NI that translation will be deferred while the
page is faulted in. NI may handle other requests during this
time.
(Note: Page-ins can't be initiated
from inside an interrupt
handler, so an in-kernel thread is activated to
handle the page-in and eventually provide the translation. The
thread is moved to the head of the run queue.)
- NI TLB Consistency:
- Kernel sets max size of pinned memory by limiting
size of NI TLB.
- When TLB evicts pages, notifies kernel to reduce
that page's reference count. ("Busy" pages are never
evicted.)
- Kernel never requests a TLB invalidation: since TLB is
used to set up DMA transfers that can take arbitrarily
long, only the NI knows when it's safe to invalidate an
entry.
- Result: TLB invalidations are atomic with respect to sends
and receives, including DMA's.
- 2 implementations: Linux router connected to ATM switch with
embedded i960, and Win NT on DEC Alpha with DMA-capable fast
Ethernet.
- Linux: TLB hit adds 1-2us to the critical path (about 10%
overhead) compared to U-Net without memory mgmt. TLB miss
without page fault is 3-4x hit, with page fault is 5-6x.
- Win NT: added latency is comparable, but send critical path is
shorter (only about 3us) on NT, so overhead is relatively
larger (~30%),and page fault handling is almost an order
of magnitude increase in overhead. But hey, it's a page fault.
- TLB simulations: 1K and 256 entry direct-mapped TLB's with
16-entry fully-associative victim cache. The 1K entry TLB seems
large enough to accommodate "all network buffers in a wide
variety of [presumably co-running] applications" (but this
implies a large amount of pinned memory).
Relevance
- Flexible per-process memory management for NI queues need not be
expensive (in the common case) and provides many of the benefits
of traditional paged memory while preserving the ability to do
zero-copy high-performance network transfers.
- Main tradeoff: main CPU overhead vs. reduced message latency.
Moving the NI TLB stuff onto a coprocessor reduces the load on
main CPU,but increases messaging latency due to I/O bus transfers.
Flaws
- I had some trouble getting the big picture w/r/t performance, but
that might be my fault.
- Can TLB size be changed dynamically?
- How to avoid/deal with messages being dropped due to lack of
availability of a buffer or TLB entry?
Back to index