Incorporating Memory Management into User-Level Network Interfaces

Matt Welsh, Anindya Basu, Thorsten von Eicken, Cornell Univ.

One-line summary: Allow network endpoints to be faulted in and out of the memory mapped for the NI,rather than pinning all that memory, just as pages are faulted in and out of user memory. A dedicated on-NI TLB coordinates its operations with the OS's memory management, so the mechanisms are mostly OS-independent. (This addition to U-Net results in the "U-Net/MM" architecture.)

Overview/Main Points

Goal: Allow efficient use of high-performance NI's by multiple applications without requiring all NI-mapped I/O memory to be pinned.
Paging the I/O buffers allows "zero copy" send/receive (directly to/from application's data space), but must move some minimal intelligence into the NI to do message mux/demux and virtual to physical address translation (and signal "buffer faults").
NI maintains short queues of pre-translated (virtual-to-physical) buffer addresses for message receives. If none is available when a message arrives, msg may be dropped.
If buffer address can be translated (TLB hit) but buffer is paged out, kernel initiates a page-in, but message may be dropped if page-in is not completed by the time the message is ready to be delivered.
On TLB miss, NI requests kernel to translate the buffer address. If translation would result in access to a non-resident page, kernel informs NI that translation will be deferred while the page is faulted in. NI may handle other requests during this time. (Note: Page-ins can't be initiated from inside an interrupt handler, so an in-kernel thread is activated to handle the page-in and eventually provide the translation. The thread is moved to the head of the run queue.)
NI TLB Consistency:
- Kernel sets max size of pinned memory by limiting size of NI TLB.
- When TLB evicts pages, notifies kernel to reduce that page's reference count. ("Busy" pages are never evicted.)
- Kernel never requests a TLB invalidation: since TLB is used to set up DMA transfers that can take arbitrarily long, only the NI knows when it's safe to invalidate an entry.
- Result: TLB invalidations are atomic with respect to sends and receives, including DMA's.
2 implementations: Linux router connected to ATM switch with embedded i960, and Win NT on DEC Alpha with DMA-capable fast Ethernet.
- Linux: TLB hit adds 1-2us to the critical path (about 10% overhead) compared to U-Net without memory mgmt. TLB miss without page fault is 3-4x hit, with page fault is 5-6x.
- Win NT: added latency is comparable, but send critical path is shorter (only about 3us) on NT, so overhead is relatively larger (~30%),and page fault handling is almost an order of magnitude increase in overhead. But hey, it's a page fault.
TLB simulations: 1K and 256 entry direct-mapped TLB's with 16-entry fully-associative victim cache. The 1K entry TLB seems large enough to accommodate "all network buffers in a wide variety of [presumably co-running] applications" (but this implies a large amount of pinned memory).

Relevance

Flexible per-process memory management for NI queues need not be expensive (in the common case) and provides many of the benefits of traditional paged memory while preserving the ability to do zero-copy high-performance network transfers.
Main tradeoff: main CPU overhead vs. reduced message latency. Moving the NI TLB stuff onto a coprocessor reduces the load on main CPU,but increases messaging latency due to I/O bus transfers.

Flaws

I had some trouble getting the big picture w/r/t performance, but that might be my fault.
Can TLB size be changed dynamically?
How to avoid/deal with messages being dropped due to lack of availability of a buffer or TLB entry?

Back to index