Experience with Grapevine: The Growth of a Distributed System Michael D. Schroeder, et al. - 1984 ------------------------------------------------------------- Summary: This paper presents the author's experience in scaling Grapevine, an early distributed system that provided message delivery, naming, authentication, and resource location services to Xerox sites. System overview: replication in Grapevine - Message delivery: Servers accept any mesasge for delivery, thus providing replicated submission service. All users have inboxes on at least two servers, thus replicating the delivery path. - Registration: Hierarchical names [name].[registry] are mapped to information about users, machines, access control lists, etc. in the registration database. The database is distributed and replicated at the level of a registry. System scalability: design goals and experiences - Ideal scalability: The cost of any computation performed by a single server should not grow as a function of the total system load or the number of servers in the system. This way, the power of an individual server won't limit the growth of the system. - Limited scalability: The design goal was to be able to expand system capacity up to 30 servers and 10,000 users by adding more servers of fixed power rather than using more powerful servers. - The cost of ideal scalability would have been added complexity and reduced performance. For example, in Grapevine, this would involve adding another layer to the naming hierarchy and keeping only partial configuration information in each server. System configuration: manual configuration is hard - For example, in Grapevine, there are many factors involved in choosing a distribution of registry replicas and inboxes and it is hard to develop guidelines for making such decisions. Transparency of distribution and replication: not so transparent - As in a lot of distributed systems, the system appears to be one large computer except for performance and error handling. - Lesson: Users want to know the system's state of availability. If a user cannot access his inbox, he wants to know what the problem is so that he has an idea of how long the problem might last. Reliability: - The primary technique for achieving high reliability is replication. Thus, reliability becomes a problem of resource management. Lack of resources can lead to system paralysis. Using redundancy requires the system to have spare resources under normal operation. Such extra capacity is also important for handling peak loads gracefully. Questions: - Should one design for ideal scalability or the limited scalability of Grapevine? What are the costs of making one decision over the other? - Certain things in distributed systems simply take time and have different failure modes than they would in a single-computer system, and people accept that (e.g., registry propogations in Grapevine, DNS propogation). Should we give up on trying to achieve transparency when things like performance and error detection simply cannot be made transparent (e.g., RPC)?