Active Connection Management for Internet Services

Mike Chen, Stanford CS548 seminar, 5/22/02

Problem: failover strategies and conn. mgt. in L4-L7 switches are hardwired. A mismatch for the dynamism of large services (service growth, machines get replaced, etc.) Soln: extend API of LB switches to allow apps/infrastrcuture to dynamically control mapping of client conns to phsyical machines . Hypothesis: improves availability, load conditioning, admin/mgt.
Goals include: dynamic rsc alloc to handle peaks; automate switch config. to eliminate operator error; support graceful starting/stopping services to support rolling reboot and online upgrades; help detect/recover from svr failures.

Existing systems:

Cisco Dynamic Feedback protocol: can set weighting vector for LB, keepalive interval, server added/removed.
IBM Director: relies on failover mech. of underlying sys. Eg to reboot a node behind an LB switch, Director has to shut off the node and rely on the switch to detect this first.

Existing L4 connection mgt primitves:

Add: bind physical to virtual IP/port (after verifying health of resource)
Remove: stop forwarding new conns (force remove: breaks existing conns too)
Not supported: Drop some new conns (according to some criterion) for admission ctrl

Implementation of Mike's soln:

Each app server generates events ("I want to bind to a virtual address"), a separate Conn Mgr takes these and uses Java wrapper for switch to implemenet these requests.
CM is centralized with standby peers, maintains soft state (heartbeats from servers), sends config deltas to switches
Switch fails: CM adds known bindings to switch after recovery (switch forgets its bindings on hard reboot!)
Note - it takes as long for the standby to take over as it does to restart the primary and repopulate its soft state! Mike admitted standby is not as useful as he'd expected. A way to achieve seamless transition would be to have 2 primaries and have the switch just ignore messages from the standby, until it starts seeing messages from the standby with no corresponding messages from the primary...but then you have to deal w/the possibility that standby and primary will give different instructions, and besides the switches are designed to accept messages only over a single dedicated TCP connection for now.
CM failure: Since all config of switch is via CM, if CM fails and then another server fails before CM back up, switch will contain stale binding for that failed server. So after CM comes back up, collects known-good bindings from heartbeats, then removes unknown bindings on switches. ("The heartbeats are the only truth")
Supports resource replacement through graceful removal: quiesce server, then reboot it. ("100% availability" - except it's really not, since there is temproary thruput reduction as standby warms up.
App-level exceptions can actively remove failed resources (in addition to middleware's existing health checks). This is one more way to detect errors. (This implies the exception leaves the app in good enough state that it can request to be removed; if it is in that good shape, perhaps it should proactively restart itself, since that's what the switch is going to do anyway?)
Note rolling reboot vs. "big flip": Rolling Reboot only works if new upgrade is software-compatible w/old upgrade, since the old and new versions must coexist in same system.
Neat hack: since current switches can't do admission control, add drop servers, which are like /dev/null - it immediately RST's the TCP connection, or holds the client connection open, but is a "bottomless sink". Can use for preferential tretament, ie E-trade prioritizes paying customers over free servers.'
Interesting tidbit - COTS switches have a group of procs for data routing, and a separate proc for SNMP/config. Add and remove take 20-25ms each on Foundry ServerIron.