Lazy Receiver Processing (LRP): A Network Subsystem Architecture for Server Systems

Peter Druschel and Gaurav Banga, Rice Univ.

One-line summary: Goal: alleviate receiver livelock and increase receiver-process throughput and scheduling fairness under high network load. Strategy: replace single IP queue with per-socket queues, demux packets as early as possible (in NI if possible), and do protocol processing lazily (when possible) and always at receiver's (not interrupt's) priority.

Overview/Main Points

Receiver livelock: OS spends all its time receiving/delivering packets (this code runs at interrupt level with high priority, regardless of receiver process). No cycles left over for receiver processes, so packets get dropped on the floor after resources have been expended on them.
How to fix:

Replace IP queue with per-socket queue
Once a socket's queue fills, discard further packets for that process. E.g. TCP listening socket discards incoming SYN's when its listen queue is full.
Demultiplex early (at NI if possible, else in software close to NI)
Author's demux function doesn't do dynamic alloc, has no timers, doesn't block, so relatively safe to run at interrupt level.
Perform remaining protocol processing at receiver's priority.
Functions such as ARP, ICMP, IP forwarding, etc. are charged to daemon processes whose priorities are adjustable.
Do lazy protocol processing (i.e. wait for recv() call) when protocol semantics allow it. (can't do it in TCP because of flow control semantics.)

Implementation:

Sparc 20/61, Fore SBA-200 ATM (155 Mbit) using U-Net software from Cornell (von Eicken et al.) for NI demux, new PF_LRP protocol for socket calls.
Dedicated local ATM network.
Results: LRP doesn't add overhead when congestion is low, but has fairer performance and more per-process throughput under extreme congestion: as offered load is increased to 20K pkts/sec, all techniques show linear increase in rate of packets delivered to application up to about 9K packets/sec, but after that, 4.4BSD and early-demux fall off rapidly to zero while Software demux and NI demux level off at 8K and 12K pkts/sec respectively.
LRP combination of early demux and receiver-priority scheduling is significantly better than early demux alone, even when no packets are being early-discarded. Authors speculate this gain is due to fewer context switches and improved system-memory locality (i.e. reduced stress on OS services) resulting from the architecture of their technique.
Effect on roundtrip latency for ping-pong test, as "background offered load" is increased from zero to 18K pkts/sec: with 4.4BSD, ping-pong latency rises sort of linearly from 500 to 2500 usec (at 15K pkts/sec), but for LRP it stays nearly constant around 600us.
Partially due to Unix scheduling artifact: when ping-pong server is interrupted to allow OS to receive background traffic packet, the interrupt is charged to the ping-pong server even though it is not the receiver, which depletes its priority. Eventually we get priority inversion.
RPC experiment: RPC server performs a memory-bound task in response to RPC call; experiment arranged so RPC server is never blocked on network. With LRP, RPC server spends more of its time in the RPC and less time receiving other RPC calls. Since servers are not being overloaded in this test, improved LRP performance must be due to locality, reduced cxt switching, etc.

Relevance

A generalization of some Active Message-like techniques (early demux, receiver-based message extraction) for IP-like servers under high load. Prevents receiver process thrashing and can help insure fairness of priority scheduling, by improving accounting (of NI interrupts) and limiting resources spent on receiving according to receiver's priority.

Flaws

Local ATM doesn't convince me that this will scale to the wide area. In particular, everyone (including, e.g., routers) has to use it, and early demux/priority scheduling requires knowledge of the receiving process's priority. Currently this knowledge can't be deduced by a router looking at the packet.
Experiments concentrate on short packets (e.g. HTTP, RPC). This is not a flaw but it is not explicitly mentioned. Not clear how this would fare with large packets, where additional steps (e.g. IP reassembly) may be necessary.

Back to index