Back to index
Oceano
German Goldszmidt et al, IBM Watson
One-line summary:
Server farms shared across many apps; monitoring & LB of servers, IP
sprayers, and backends can be used to dynamically reallocate
"fungible" servers ("dolphins") to different apps, including
loading an OS+app image on the fly when a server is reallocd; fixed servers
("whales") handle heavy-duty back-end stuff (DB servers, etc) for each
app. Rule-based system is fed observed conditions and decides what to do;
hysteresis and heuristics prevent extreme oscillation (eg when a server is
reallocd it cannot be re-reallocd for some threshold amt of time).
Overview/Main Points
- Motivation: expanding capacity (buying/allocing/installing new HW, etc)
takes O(weeks). Bill T: IBM Olympics web farm was expanded in
about 2 days since no internal bureaucracy obstacles; this may be a lower
bound.
- Peaks in different services sharing a farm can be accommodate as long as
they don't all peak at once.
- Key point is separation of "fixed" servers (whales) from
"redeployable" ones (dolphins).
Relevance
Flaws
- How do they avoid pathological hysteresis? If wrong assumption is
made (eg if decrease in query rate is due to a failure in back end, adding
more FE's may make things worse)
- It's a rule based system, but a traditional criticism of these is they
become unmanageable/unpredictable when they get too big. (Bill
Tetzlaff says there's a bunch of work from the 80's where people
unsuccessfully tried to apply RBS's to performance fault diagnosis)
-
Back to index