Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer,
Cluster-Based Scalable Network Services,
16th ACM SOSP, Oct. 1997, pp. 78-91

(Summary by George Candea)

The introduction describes the advantages of clusters over big iron when it comes to Internet services:

Scalability, defined as the ability to maintain the same level of per-user service in the face of load hikes, with only an incremental, linear increase in hardware. The good match between Internet workloads (highly parallel, small-grained) and cluster architectures (many independent nodes) allow for linear, incremental scaling, replacing capacity planning and "forklift upgrades" with a fluid, reactionary scaling model.
High availability due to inherent redundancy in clusters and node independence, which allows for hot upgrades.
Cost effectiveness: using commodity, high volume building blocks gives good cost/performance ratios along with fast acquisition & service times.

There are however major challenges: administration is cumbersome, software must be decomposed into loosely coupled modules, clusters exhibit partial failures (unlike SMPs), and the amount of state shared across the cluster must be small. One way to address some of these issues is by recognizing that many Internet services require BASE (Basically Available, Soft state, Eventually Consistent) semantics, instead of traditional ACID (Atomic, Consistent, Isolated, Durable). BASE simplifies the implementation of fault tolerance and availability, while also permitting performance optimizations. In practice, services have components demanding varying data semantics; this paper focuses on those that have an ACID component, but manipulate primarily BASE data.

The proposed architecture consists of services built on top of a TACC (transformation, aggregation, caching, customization) layer, which sits on top of a SNS (scalable network service) layer.

SNS consists of front ends, a pool of workers (some of which are caches), a graphical monitor, user profile database, and a manager, all interconnected by a system area network (SAN). Duties of these components are largely independent of each other, enabling linear scaling. Workers are kept as simple as possible by localizing most control decisions in the front ends. The manager tracks and spawns new workers as needed; it collects load information from workers, synthesizes load balancing hints based on operator-controlled policies, and periodically sends them to the front ends, which then make local scheduling decisions based on these hints. There is also an overflow pool of machines, which are not part of the cluster but can be harvested for temporary handling of load bursts. SNS components implement process peer fault tolerance: when a component fails, its peers detect the failure and restart it. Cached stale state carries the surviving components through the failure and the reincarnated component rebuilds its soft state by listening to multicasts within the SAN.

The interface to SNS workers and front ends is embodied by stubs: worker stubs hide fault tolerance, load balancing, and multithreading considerations from worker code, while front end stubs interact with the manager provides dispatch logic that selects workers for serving each request.

TACC layer: transformation operates on a single data object and changes its content (e.g., encryption); aggregation collects data from several objects and collates it in pre-specified ways; customization captures/tracks per-user profile data and delivers it to the workers with each request; caching stores original as well as intermediate and post-transformation/aggregation content.

An experimental system embodying this architecture is TransSend, which improves surfability of web content over dialup. Client browsers connect over HTTP, one front end thread handles each incoming connection, fetches the data from caches (or Internet, if missed), pairs it up with the user's customization profile, sends it to the workers (called distillers), which then perform transformations corresponding to the profile and send the results back to the client. Broken connections, timeouts, and loss of beacons are used to infer intra-cluster component failures. Harvest-based caches are assembled into a global virtual cache, with keys being hashed across the individual caches. Although distillers sometimes crash on pathological inputs, process-peer fault tolerance ensures they get restarted. The only ACID component is the profile database; application and system data are BASE: load balancing info may be stale, transformed content is soft (cached, but regeneratable), and answers may be approximate. The authors found that, for this application, an approximate answer delivered quickly is more useful than the exact answer delivered slowly (users do have the option of getting the exact original content if needed).

Writing services on top of SNS/TACC appears to be quick and easy. The results show TransSend scaled to 10 Ultra-1 machines serving about 160 HTTP requests/second and demonstrate that one such workstation would be enough to serve the entire dialup IP population at UC Berkeley (600 modems, 25,000 users). The paper points out SAN saturation as a possible problem, because it causes unreliable multicast to be dropped and hence it cripples system data exchange. The authors' traces revealed that, although most content accessed on the web is much smaller than 1KB, the average byte transferred is part of large content (3-21 KB). For web traffic, they found cache hit rates in SNS/TACC to increase monotonically as a function of cache size, but plateau at a level determined by user population size.

Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer, Cluster-Based Scalable Network Services, 16th ACM SOSP, Oct. 1997, pp. 78-91

Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer,
Cluster-Based Scalable Network Services,
16th ACM SOSP, Oct. 1997, pp. 78-91