Oceano

German Goldszmidt et al, IBM Watson

One-line summary: Server farms shared across many apps; monitoring & LB of servers, IP sprayers, and backends can be used to dynamically reallocate "fungible" servers ("dolphins") to different apps, including loading an OS+app image on the fly when a server is reallocd; fixed servers ("whales") handle heavy-duty back-end stuff (DB servers, etc) for each app. Rule-based system is fed observed conditions and decides what to do; hysteresis and heuristics prevent extreme oscillation (eg when a server is reallocd it cannot be re-reallocd for some threshold amt of time).

Overview/Main Points

Motivation: expanding capacity (buying/allocing/installing new HW, etc) takes O(weeks). Bill T: IBM Olympics web farm was expanded in about 2 days since no internal bureaucracy obstacles; this may be a lower bound.
Peaks in different services sharing a farm can be accommodate as long as they don't all peak at once.
Key point is separation of "fixed" servers (whales) from "redeployable" ones (dolphins).

Relevance

Flaws

How do they avoid pathological hysteresis? If wrong assumption is made (eg if decrease in query rate is due to a failure in back end, adding more FE's may make things worse)
It's a rule based system, but a traditional criticism of these is they become unmanageable/unpredictable when they get too big. (Bill Tetzlaff says there's a bunch of work from the 80's where people unsuccessfully tried to apply RBS's to performance fault diagnosis)

Back to index