Experience with Grapevine: The Growth of a Distributed System

Schroeder84

 

Mail system.  Composed of two services (servers): registration servers & message servers.  Messages buffered in inboxes on message services until requested by users.  Outbound messages can be accepted for delivery by any message server.  Message servers can accept message retreival requests for inboxes stored on that server.  Each user has inboxes on at least two message servers.  Registration server has “RName” entries for each individual user and distribution list.  An individual RName entry contains a list of inbox/message servers for that user.  Email addresses are of the form username@registrationserver.  Either all or none of the RNames registered at a given registration server can be replicated at another registration server.  Any registration server can accept an update from a given user, and will propagate the change to all other registration servers for that user.

 

Human “registrars” independently maintain different registration servers.  GrapevineUser client package makes the entire system look like one server to the user, even if various registration and/or message servers may be down at a given moment.

 

Scalability goals: 1) be able to increase system capacity by adding more servers with fixed capabilities instead of replacing existing servers with more capabilities or adding more servers with more capabilities. 2) the cost of any computation in the system should have a fixed upper bound and should not grow with the total number of users or servers.

 

In grapevine, every registration server knows about every other registration server.  This is the only place in the system where data stored is aware of the total system size.  However, these configuration files only take up 15KB on each registration server for 4500 users and 1500 distribution lists.  To expand the system, a distributed directory should be used for the registration servers.  SMTP mail servers can use a DNS-like protocol on the Internet to access a distributed directory of mail server names based on domain names.  Registries can store a maximum number of RName entries; if there are more users than this maximum amount, a new registry must be created.  As such, by dividing the registration database into a number of registries, the system can scale.

 

Problems with Distribution Lists.  Distribution lists grew linearly with the total size of the user community.  Processing messages sent to distribution lists took a long time.  (The message server accepting a message sent to the list would be required to, for each user on the distribution list, determine the best message server to send the message to per user.)  Frequency of addition and deletion, as well as the time required to conduct these operations grew linearly.  The designers of the system were surprised at the usage of distribution lists and probably would have designed distribution lists differently.

The authors suggest a redesign where a distribution list “tax” is a list of lists (tax@stanford, tax@columbia, ...) where a message sent to the tax distribution list would be sent to all the message servers corresponding to users at each of the domains, and then each of the message servers in those domains locally deliver those messages into the inboxes for the corresponding users on the distribution list.  However, the authors do mention that as the number of users grow, they do not know how the number of messages sent to distribution lists will increase, and this may create further scalability problems.

 

Configuration Decisions. 

 

Transparency of Distribution and Replication

 

System was designed to present the abstraction of “one server” to the user, but there are several areas in which the system’s distributed-ness shows through.  Many problems that can be seen by users are due to delays in the propagation of updates to databases (i.e. someone creates a user, and adds the user to a distribution list that is stores on a different server—the server storing the distribution list will claim that the user does not exist until the new information is propagated to it)

 

Duplicate messages.  If a message is sent to two distribution list that contain the same user, that user can recieve two copies of the message—if one of the registries for one of the distribution lists is down, the system proceeds to expand the distribution list using the available registry, and the user receives a duplicate when the unavailable registry is back up, and the send to that distribution list continues.  Alternatively, they could have designed the system to wait until both registries for the two distribution lists were up to eliminate duplicates, but this would unnecessarily delay the messages from being sent to people on one of the distribution lists.  

 

Distribution lists.  Instead of having the message server that accepts a message expand a distribution list, it would be more efficient to have a message server that is close to the registry storing the distribution list send out the message. 

 

Problems.  When there are problems, it would have been useful the tell the users a little bit about what is going wrong (a server is down, etc) as opposed the way the system was originally built to hide all network goings-on—this sometimes leaves users completely in the dark when something is not functioning properly.

 

Load Balancing / Performance Problems

 

Messages describing database updates (i.e., additions/deletions from a distribution list) contained the entire update (the entire new distribution list).  It would have been more efficient to just send deltas describing the updates.

 

Registries stored usernames and passwords, and authors promoted use of grapevine for authentication and file access control.   (Originally, each user had a different username password for each file server.)  On a file access, Grapevine would check 1) if the user had the appropriate access permission for the file, and if not 2) would check if the user was the member of a group that had the appropriate access permission for the file.  Potential pitfalls were: 1) registries would have to be available to access any file, 2) access control checks would take longer since now it involved communication with registries, and 3) registries might become overloaded with access control checks.  To get around these potential pitfalls, caching of authentication and access control info was done for a period of 12 hours at a time.  Advantage: efficiency; Disadvantage: access control changes (revocations) could take up to 12 hours to be realized since file servers would continue to use the cached credentials!

 

Still, access control checks took a long time because groups (recorded by distributed registries) could be recursively nested.  To simplify, they added a “^” to the names of distribution lists and stored cached, pre-flattened versions of the nested group names.

 

Original design was that message server inboxes were buffers, and messages should be downloaded to local PC clients.  However, they allowed for messages to be retained on the server for users who needed to remotely access email via a terminal.  This became much more popular than expected and the system which was expected to do sequential operations to the inbox buffers was now being used to the random access, and drastically slowed down the system.

 

Operation of a distributed system.

 

Message bodies were not replicated... only message headers... the bodies took too much space, and so when a disk crash occured at a message server, the messages would have to be replaced by an “apology” but at least the user knows who sent them messages and what the subjects were.

 

Administrators were able to remotely control and monitor grapevine servers.  Logging of system activities came in handy when malfunctions occurred and for debugging.  (log was kept around for one week.)  

 

If a message couldn’t be delivered, it was a “dead letter” and it was sent back to the sender containing path and reasons for undeliverability with a cc to a distribution list of system administrators.

 

Reliability

 

Reliability through replication. 

 

Foundations of a reliable system: bug-free software, reliable hardware, reliable communications, enough resources (processor speed, main memory, bandwidth, disk space).  Grapevine servers refused to accept any connections / more work when disk space was less than 5%.  System very rarely deadlocked, and was dealt with by manual restart.  Extra capacity is needed for redundancy and for handling peak loads gracefully. 

 

 

Concluding remarks.

 

As time goes on, only low risk, high benefit changes were added to the system.