Experience with Grapevine: The Growth of a Distributed System
Michael D. Schroeder, et al. - 1984
-------------------------------------------------------------

Summary: This paper presents the author's experience in scaling
Grapevine, an early distributed system that provided message delivery,
naming, authentication, and resource location services to Xerox sites.

System overview: replication in Grapevine

  - Message delivery: Servers accept any mesasge for delivery, thus
    providing replicated submission service.  All users have inboxes
    on at least two servers, thus replicating the delivery path.

  - Registration: Hierarchical names [name].[registry] are mapped to
    information about users, machines, access control lists, etc. in
    the registration database.  The database is distributed and
    replicated at the level of a registry.

System scalability: design goals and experiences

  - Ideal scalability: The cost of any computation performed by a
    single server should not grow as a function of the total system
    load or the number of servers in the system.  This way, the power
    of an individual server won't limit the growth of the system.

  - Limited scalability: The design goal was to be able to expand
    system capacity up to 30 servers and 10,000 users by adding more
    servers of fixed power rather than using more powerful servers.

  - The cost of ideal scalability would have been added complexity and
    reduced performance.  For example, in Grapevine, this would
    involve adding another layer to the naming hierarchy and keeping
    only partial configuration information in each server.

System configuration: manual configuration is hard

  - For example, in Grapevine, there are many factors involved in
    choosing a distribution of registry replicas and inboxes and it is
    hard to develop guidelines for making such decisions.

Transparency of distribution and replication: not so transparent

  - As in a lot of distributed systems, the system appears to be one
    large computer except for performance and error handling.

  - Lesson: Users want to know the system's state of availability.  If
    a user cannot access his inbox, he wants to know what the problem
    is so that he has an idea of how long the problem might last.

Reliability:

  - The primary technique for achieving high reliability is
    replication.  Thus, reliability becomes a problem of resource
    management.  Lack of resources can lead to system paralysis.
    Using redundancy requires the system to have spare resources under
    normal operation.  Such extra capacity is also important for
    handling peak loads gracefully.

Questions:

  - Should one design for ideal scalability or the limited scalability
    of Grapevine?  What are the costs of making one decision over the
    other?

  - Certain things in distributed systems simply take time and have
    different failure modes than they would in a single-computer
    system, and people accept that (e.g., registry propogations in
    Grapevine, DNS propogation).  Should we give up on trying to
    achieve transparency when things like performance and error
    detection simply cannot be made transparent (e.g., RPC)?