Back to index
Lazy Receiver Processing (LRP): A Network Subsystem Architecture for
Server Systems
Peter Druschel and Gaurav Banga, Rice Univ.
One-line summary:
Goal: alleviate receiver livelock and increase receiver-process throughput
and scheduling fairness
under high network load. Strategy: replace single IP queue with per-socket
queues, demux packets as early as possible (in NI if possible), and do
protocol processing lazily (when possible) and always at receiver's (not
interrupt's) priority.
Overview/Main Points
Receiver livelock: OS spends all its time receiving/delivering packets
(this code runs at interrupt level with high priority, regardless of
receiver process). No cycles left over for receiver processes, so
packets get dropped on the floor after resources have been
expended on them.
How to fix:
- Replace IP queue with per-socket queue
Once a socket's queue fills, discard further packets for that
process. E.g. TCP listening socket discards incoming SYN's when
its listen queue is full.
- Demultiplex early (at NI if possible, else in software close to
NI)
Author's demux function doesn't do dynamic alloc, has no timers,
doesn't block, so relatively safe to run at interrupt level.
- Perform remaining protocol processing at receiver's priority.
Functions such as ARP, ICMP, IP forwarding, etc. are charged to
daemon processes whose priorities are adjustable.
- Do lazy protocol processing (i.e. wait for recv() call)
when protocol semantics allow it. (can't do it in TCP because of
flow control semantics.)
Implementation:
- Sparc 20/61, Fore SBA-200 ATM (155 Mbit) using U-Net software
from Cornell (von Eicken et al.) for NI demux, new PF_LRP
protocol for socket calls.
- Dedicated local ATM network.
- Results: LRP doesn't add overhead when congestion is low, but has
fairer performance and more per-process throughput
under extreme congestion: as offered load is increased to 20K
pkts/sec, all techniques show linear increase in
rate of packets delivered to application up to about 9K
packets/sec, but after that, 4.4BSD and early-demux fall off
rapidly to zero while
Software demux and NI demux level off at 8K and 12K pkts/sec
respectively.
- LRP combination of early demux and receiver-priority scheduling
is significantly better than early demux alone, even when no
packets are being early-discarded. Authors speculate this gain
is due to fewer context switches and improved system-memory
locality (i.e. reduced stress on OS services) resulting from the
architecture of their technique.
- Effect on roundtrip latency for ping-pong test, as "background
offered load" is increased from zero to 18K pkts/sec: with
4.4BSD, ping-pong latency rises sort of linearly from 500 to 2500
usec (at 15K pkts/sec), but for LRP it stays nearly constant
around 600us.
- Partially due to Unix scheduling artifact: when ping-pong server
is interrupted to allow OS to receive background traffic packet,
the interrupt is charged to the ping-pong server even though it
is not the receiver, which depletes its priority. Eventually we
get priority inversion.
- RPC experiment: RPC server performs a memory-bound task in
response to RPC call; experiment arranged so RPC server is never
blocked on network. With LRP, RPC server spends more of its time
in the RPC and less time receiving other RPC calls. Since
servers are not being overloaded in this test, improved LRP
performance must be due to locality, reduced cxt switching, etc.
Relevance
A generalization of some Active Message-like techniques (early demux,
receiver-based message extraction) for IP-like servers under high load.
Prevents receiver process thrashing and can help insure fairness of
priority scheduling, by improving accounting (of NI interrupts) and
limiting resources spent on receiving according to receiver's priority.
Flaws
- Local ATM doesn't convince me that this will scale to the wide
area. In particular, everyone (including, e.g., routers) has to
use it, and early demux/priority scheduling requires knowledge of
the receiving process's priority. Currently this knowledge can't
be deduced by a router looking at the packet.
- Experiments concentrate on short packets (e.g. HTTP, RPC). This
is not a flaw but it is not explicitly mentioned. Not clear how
this would fare with large packets, where additional steps
(e.g. IP
reassembly) may be necessary.
Back to index