Petros Maniatis
A motivational/inspirational paper (particular on databases though) is [111].
[47] on KeyKOS.
[117] on service infrastructure extensibility (via ``active names'').
[29]Resources subject to deadlocks can be tangible things (like disk channels) or abstractions (like a table or a process). There are resources that can easily be preempted (like the CPU) or not so easily preempted (like the disk). There are resources that imply the acquisition of other resources (for example files imply having buffers through which to access those files).
Conditions for a deadlock:
Havender's approach to preventing all of the above from happening at the same time:
This can be costly, since some of the resources might only be needed temporarily.
This can be really inconvenient when used on inherently non-preemptible tasks, like disk files.
This can be effective, although there can always be deadlocks caused among resources of the same type, especially when the instance of the type is not important.
Detecting the occurrence of a deadlock can be as simplistic as forming a resource request graph occasionally and checking it for cycles, or as educated as monitoring everything every task ever holds and requests and deducing when the combination describes a deadlock.
Acting on a detected deadlock means forcibly reclaiming the resources held by one or more of the participants. The author presents a simple, cost-based scheme to pick which deadlocked tasks to abort, or even minimize destructive aborts (i.e., aborts that involve non-preemptible resources).
Another approach between making deadlocks impossible and just waiting until they occur to fight them is searching ahead into the state space of the system and identifying approaching deadlocks (for instance, two current tasks seem to be likely to form a circular resource request path). In those cases, through the scheduler for instance, the approaching deadlock can be avoided by modifying current system behavior (in the above example, making sure that the two dangerous tasks don't get scheduled in an interleaved manner until the danger has passed).
In summary, there are three ways to deal with deadlocks:
At every level, the algorithm picks a task that requires fewer resources than those currently available. This means that, if all tasks take their chance to reserve the resources they require, this task will have its chance to advance and complete. After this determination, a task is marked ``able to progress'' and the resources it currently holds get added to the pool of available resources (since, either way, it will release them). When no more tasks can be marked ``able to progress'', and if there still exists unmarked tasks, then the algorithm concludes that a deadlock is in progress.
For an example, assume W=(5,8,13), V=(2,1,1), , and . In the beginning, the future available resources vector is equal to V, or (2,1,1). The first pass of the algorithm locates the third row of the Q matrix, (1,1,1), all of whose elements are less than or equal to the future available resources. Therefore, the third line of Q is marked, the future available resources vector becomes (1+2,3+1,10+1)=(3,4,11) and the algorithm continues. Now the first row of Q fits the condition, so it is now marked and the future available resources vector becomes (3+1,4+0,11+2)=(4,4,13). Finally, the middle row of Q can satisfy the marking condition and is also marked, leaving no unmarked rows and, consequently, signifying the absence of any deadlocks.
If Q had been instead, then no line would be markable at the beginning of the algorithm, and we would therefore be in a deadlock.
[120]The basic idea is the parallel exploitation of multiple sources of network information towards the same goal. A journal is created using information drawn from ICMP, ARP, SNMP(future work), DNS, and RIP. Although each method individually has results with limited coverage, overlapping the results from multiple methods can cover the gaps.
Most techniques are fairly intuitive (pinging address ranges, broadcasting pings, exploding TTLs as in traceroute, DNS zone transfers, ARPing subnet ranges, snooping for ARP replies, etc.).
[91]Network management handles 5 areas (according to the OSI definition):
Extensibility is a bit of a joke in SNMPv1. You can either mirror a table in the private subtree and use the same indices, or create a new pack of indices for the extension.
Important considerations include how message delivery works (it's unreliable, UDP-like), how actions should be separated between idempotent and others, and how a request ``package'' should fit in one message, to avoid unreliability due to fragments treated differently (i.e. a fragmented packet, whose first fragment was dropped but the second wasn't and so forth).
The whole trap scheme in SNMPv1 is very inflexible. I hope SNMPv2 works better.
[10]Flaws deriving from the Project Athena environment.
Protocol Flaws
Supposedly averted by authenticators. They have a very short lifetime (typically 5 minutes). However, it is fairly easy to snoop an authenticator and reuse it within its lifetime.
The only solution would be authenticator caching, to allow for replays. That hasn't been implemented easily yet. For TCP, it's very difficult, since servers are normally spawned by the main listener. Only threading would work. For UDP, it would present problems with application-level retransmissions. If the application could reconstruct authenticators for each retransmission, the problem might be obviated.
Kerberos assumes that there exists an underlying authenticated time service that can be used. This is not true however, not to mention that even authenticated time servers could be subverted, off wire (by using a fake radio transmitter, for instance). Instead, challenge/response sounds like a better approach, proving (using a nonce) that each party in fact possesses the session key. UDP would still be hard though (would need to retain state per use).
It's relatively easy to exploit user weaknesses in choosing passwords to make educated guesses. Passwords alone are not a good idea to determine a client's key. Diffie/Hellman could be significantly better.
The same problem applies elsewhere as well. However, a good solution for other cases (e.g.,. One Time Passwords) cannot be applied to Kerberos, since an initial request is always encrypted by Kc.
Message structure should be revised to prevent known-plaintext attacks.
They shouldn't be multi-session keys. They expose the system to cryptanalysis.
Ticket forwarding imposes problems, since it assumes equal levels of trust among all sites participating in the forwarding operation. Furthermore, multi-realm authentication can have problems, especially in routing, since there is not global realm name service.
Hardware Design
Certain portions of the Kerberos Protocol do not belong there, namely, message encoding (something like ASN.1 or BER should be used instead) and encryption mechanisms (let them be pluggable - the protocol should be correct for any encryption mechanism).
Proposed changes:
An extremely interesting paper!
[95]Name services can connect names to physical addresses of objects, or to other identifiers (names) which are however easier to resolve (for instance, from human-readable names to object IDs).
Issues in heterogeneous name spaces:
One solution is to make everything look like ``something''. In most cases, this something is a file. So, give all objects a file interface (i.e., reads, writes, ioctls, and so on.). This is cumbersome, and in some cases impossible. Did I mention esthetically scary?
In SPRING, name spaces and contexts (a generalization of a directory) are first class objects; they can be shared and copied and duplicated and destroyed and combined and stuff.
Object references are not persistent (you can't ``save'' an object ``pointer'' - you need to reacquire the pointer after execution is resumed).
Access control is performed at the name server. If the object manager wants to implement it itself (for instance, it doesn't trust the name server) it can do that in addition to the name server's controls.
Names contain different fields with status information, version, and type (C program or object code, and so on).
There are three kinds of persistence:
Access is only checked when trust boundaries are crossed (i.e. when we're moving to a name server for a context that's not collocated with the previously checked name server). If at some point the existing authentication is not enough for the current name server (because we crossed a trust boundary), the original client is asked (through a raised exception) to authenticate itself with the name server directly. Even when a context is passed between domains owned by different principals, the exception will be raised when the current credentials (in the capability within the context) don't match the current principal. In those cases, the client owned by the different principal will be contacted to authenticate itself directly.
Freezing and melting are done outside the object.
``Virtual worlds'' can be created, with slightly altered hierarchies (say when a new release of the system is tested and only some of the executables have changed). In those cases, domains can run with the altered root context and carried name space, without any indication that this is not the world that the rest of the domains run under.
[81]This is the conference version of the technical report shown in [95], mainly focusing on the UNIX name space adaptation.
When a context (directory) comprises names from different scopes (i.e., the system/lib context of the village and the system/lib context of the machine), the names are merged in order. More local instances are visible first.
File system and NIS name service are not provided by Spring in the UNIX emulation subsystem.
Instead of making all objects be treated as files (the Plan 9 approach), objects that may be treated as files support the IO interface.
[94]This is the conference paper on the persistence-related issues in [95].
The issues in persistence and naming can be reduced to the following:
Persistent objects or persistent classes? Can persistent classes be created? Can persistent objects be shared?
Are both persistent and transient name spaces supported? Are naming and storage independent or conjoint?
Are inter-object reference types restricted? Can transient objects be referenced in another object?
Which algorithm (Mark & Sweep, Reference Count)? Is storage reclamation restricted in a single host?
The four approaches handled in the paper include:
The workspace approach (as seen in Smalltalk) allows programs to have a persistent environment, and is implemented as a programming environment. Objects and additional persistent object types (classes) can be stored. However, references within the workspace of a program cannot be externalized, therefore sharing is impossible.
The name space is flat and, in most cases, not easily manipulated. The entire workspace is reflected into a single file in the filesystem. Therefore, garbage collection is relatively easy (not to mention that the entire workspace can be collected, by unlinking the file it corresponds with).
Workspaces allow easy manipulation and all kinds of references to be persistent, at the expense of sharing flexibility and programming language selection.
In UNIX, the only persistent objects are files, directories and symbolic links. Other objects have to be transformed/stored inside a file to become persistent. Sharing among processes/users is permitted, with some limited access restrictions.
Transient name spaces are not supported in the basic UNIX system. Naming and storage are conjoint in the file system.
The only objects that can contain references are directories. No transient objects can be referenced in other objects.
Storage reclamation is handled through reference counting. The method however can change to fit the object types stored.
UNIX supports sharing, as opposed to Workspaces. However, naming is very limited and not at all uniform.
Many object types can be made persistent, in addition to files, directories and symbolic links. Sharing is handled similarly to UNIX.
Transient name spaces are supported, but are handled through a different interface. Naming and storage are combined.
Inter-object references are much more flexible than in UNIX. Still, however, transient objects cannot be referred by persistent objects.
Storage reclamation uses reference counting; algorithms that can handled cyclic structures need to be utilized. Inter-machine reclamation is straightforward, since only symbolic links are allowed across machine boundaries.
This approach combines the good features of all the above. Persistent objects or classes are extensible and sharable.
Both persistent and transient name spaces are supported. The servers for storage and naming are independent and separate.
References to other objects are unrestricted, and transient objects can be referenced.
Storage reclamation is dependent on stored object types.
Transcription is the service used to transform an object's representation on the client side or an object's data structure on the implementation side into a form that can be used to reconstitute an object in a different location or at a different time.
Time transcription can be tricky, since time related portions of the object (for instance, references to transient objects) cannot be stored. The following are related concepts:
It's transcription that's suitable for persistent storage. It happens at the server implementing an object and is not directly visible to a client through the object's interface.
Tough problems: following pointers in data structures, handling pointer cycles, handling embedded objects, dealing with inheritance and unpickling a pickled form as a base class of the original object.
Freezing occurs at the client side, and involves an object representation. Freezing generates a freeze token that the client can use later to reconstruct the object. The reconstructed object is not the same as before being frozen, it shares however a large part of its state. The implementation decides what will constitute a freeze token. A server may well decide to return a pickled form of the object as a freeze token, or an identifier that will help it reconstruct the object's state later (for instance, the i-node of the file where the object's state has been stored).
Freezing invariably results in the object state being pickled.
A freeze token should be usable by clients other than the one that initiated the freezing. Therefore, no client-specific data should be part of the freeze token (for instance, no door identifiers).
Freezing and melting are performed by the freeze service of the manager of an object. Each object has an Object Manager ID (OMID). This can be stored along with a freeze token. The client that decides to melt this frozen object can locate the freeze service by asking the name server for an object manager with the stored OMID, and accessing its freeze service with the stored freeze token.
Freeze tokens are forgeable. Therefore, the freeze service needs to be authenticated.
Lost freeze token means lost object.
The name service is a good place to perform all freezing operations, insulating clients from them. When no open references exist to an object bound to a name in a name space, then the responsible name server freezes the object reference, storing a name-to-freeze-token binding in stable storage. When a client requests a resolution of a name bound to a frozen object, then the name server melts the freeze-token transparently to the client. Note that clients can always freeze objects directly.
This name service interposition between freeze and freezer, also facilitates access controls: the name server checks all privileges required from a client about to melt a freeze-token. Otherwise, complicated security protocols would have to be observed, to transfer melting privileges from client to client.
The melt operation can be made to start the corresponding object manager (so that it can unpickle an object whose reference is being melted). This can deal with object manager crashes, as well as managers that terminate when they manage no active objects.
Different object require different freeze services:
The representation of the object is the freeze token. Such would be a complex number.
They can be frozen either by a per-machine transient object freeze service, which stays up as long as the machine stays up, or by a library inside the object server, which stays up as long as the object's manager is up. The former requires some garbage collection (for those transient objects whose manager has died).
It is a transformation intended to create a copy of the object at a different time or location, without any shared state with the original object. Tar'ing a file system would be ``externalizing'' it.
Files become much less important components of an OS, now. Only applications that need to handle their own storage more flexibly need use files (for instance, to store freeze-tokens, or pickled objects).
Open Issues:
[96]This is an elaboration of the naming policies in Spring, as seen in the technical report([95]).
In the case of remote domain invocations, children inherit namespaces that contain all ancestral local machines and villages, under the contexts with names local_machine_X, local_village_X, where X is an integer increasing down the inheritance path. The home contexts are those for which X=0. The currently local contexts are those for which X is maximum, and they are pointed at by the symbolic links local_machine and local_village. The parent, however, does have the option to keep the local contexts that the child inherits unchanged; in those cases, the child executes in its parent's naming environment, though remotely.
[45]The gist is: Plug and play RPC mechanisms.
The general idea is that different objects elect what kind of RPC they need to operate according to their specifications (i.e., replicated RPC, atomic RPC, persistence or migration, etc.). This ``kind'' is embodied in a subcontract associated with an object. All RPC operations go through this subcontract.
Even in the cases where a different subcontract is expected, the mechanism supports subcontract on-line discovery; thus, if a client needs to operate on an object that supports a subcontract that hadn't been anticipated, the correct subcontract can be found and loaded at run time.
The basic operations supported by subcontracts are:
The overhead over a more static approach is basically due to the additional indirect calls, every time an object operation is used (first to the object stubs and then to the subcontract). This is expense is justified, according to the authors, by the added flexibility and the enormous potential for extensibility, customization and optimization without the change of the base system.
[86](1)CORBA was made to handle multiple architectures, multiple operating systems, multiple data formats.
Connecting components together will happen again and again. Having a way to do that without reinventing the wheel every time would be cool.
(1.4.2)Not only does CORBA handle the data translation between architectures etc., it also locates providers of the requested services.
[24]The basic idea is to improve filesystem performance and reliability, by preserving memory across crashes, to achieve write-through reliability with write-back performance. Crashes, not power outages, are the main villain attacked.
Goal: Survive operating system crashes without writing data to disk. They eliminate all reliability induced disk writes, by protecting memory through a crash.
To protect the file buffer cache in kernel memory, they use TLB protection: the buffer cache is read-only, until the kernel explicitly wishes to change it. On CPUs that support direct use of physical memory, bypassing the TLB, either that feature is disabled if possible, or ``sandboxing'' is used, by instrumenting the kernel to only allow writes, if the accessed memory is outside the file cache or authorized to be written. This is much slower than just disabling TLB bypasses.
To restore the buffer cache, an in-memory registry is maintained and used, where all buffers get tagged with their defining metadata (physical memory address, device number and i-node number, file offset and size). The registry only imposes 40/8192 or 0.5% of the buffer size space overhead. Before the VM and FS are booted during a warm reboot, memory is dumped into the swap partition. Also at this point, metadata are restored on disk, making sure the filesystem is intact. When the system has completely booted, a user-level process analyzes the memory dump and restores the file cache.
Effects:
To test reliability, they use synthetic, injected kernel faults: bit flips, instruction mashing, high-level scrambling (variable instantiation eliminations, overruns in bcopy, switches of loop conditions, futzing with loop base pointers, ...). To detect corruption, they checksum all buffers and go through them to look for inconsistencies after the crash.
Rio is neat: much faster than other means of protection (like turning all writes into write-throughs). More reliable than pure write through.
Catch: it requires that the CPU do not erase the memory across a reset!
[11]This paper is mainly used as an indication of how difficult things would be if we still had physical memory limitations. The creators of MULTICS went to amazing lengths to make sure that no physical memory is wasted on stuff that's not immediately needed.
Some VM facts:
Terminology:
[50] The key contribution of the system is the exposure of the low-level transaction mechanisms to servers, as opposed to the use of high-level transaction interfaces only. Therefore, servers can direct their own recovery, using these mechanisms.
Single-host transaction operations are handled locally. Only Transaction Managers need communicate with other hosts. Minimization of exchanged network messages was the motivating concern. The hierarchy is fault-tolerant and can be made more efficient, in that coordinators can be replicated (to avoid single-point failure), and coordinators can migrate, to minimize the distance between a coordinator and its participants.
Some refinements of two-phase commit protocols have been used. For instance, participants that do not modify recoverable resources, can be excluded from phase two for efficiency.
[110] Gripes about the way in which operating systems support DBMSs.
This is the earliest place I've seen the need for extensibility in filesystem caching (or as Stonebraker calls it, ``Buffer Pool Management''). Problems:
DBMSs use structured basic objects, as opposed to the unstructured character arrays (i.e., files), that normal OSes use, through their filesystems.
The next gripe argues UNIX imposes the use of a DB process-per-user, as opposed to having a server that accepts messages. I don't quite see how this applies any more (he only considers pipes the only IPC mechanisms available). Should that have been the case, the problems would have been:
As far as consistency is concerned, the two points apparently suggested above are:
Gripes on memory mapped files
In other words, OS extensibility is sorely needed.
[8] Wow! I think I just read an early description of a Remote Procedure Call, before it was called that! There is an interesting approach to canceling a request or a response, though.
The interprocessor protocol looks a lot like an early TCP clone. No slow start or congestion avoidance stuff is there, obviously, since the interprocessor buses are fairly reliable (less than one fault in a month reported). Processors broadcast a heartbeat every second. If two successive heartbeats are skipped by a processor (and detected by others, obviously), the processor is deemed dead.
The concept of process pairs is also brought up. Every process is backed by a checkpoint process, which receives status updated after every message operation of the primary. If the primary process dies, the backup becomes the primary and picks up where the dead one left off. To deal with non-idempotent requests handled by dead servers, a buffer of responses is retained, so that already answered requests can be answered again without recomputing a response.
[112] A very interesting, dense assortment of information on page table and TLB technologies and how they scale up to larger address spaces than they were designed for.
First, different page table organizations are presented:
The paper introduces the concept of clustered page tables, which combine the benefits of closely-accessible neighboring pages of linear page tables and those of relatively-low overhead for sparsely populated address spaces of hashed page tables. The idea is that every node in the hashed page table corresponds to more than one PTEs (2, 4, 8, ...), for adjacent virtual pages. Therefore, within a hash node, the linear metaphor is used, but the hashed metaphor is used to locate a batch of PTEs. Clustered page tables exploit the apparent ``burstiness'' of address spaces, where large objects occupying more than one page are sparsely scattered in the address space. Therefore, a cluster can hold the page entries for such large objects with relatively low overhead.
The paper then goes on to present the hardware approach to larger virtual address spaces: TLBs that support superpages and sub-blocks.
Page table support for such extended TLBs is the suggested:
Support for partial-sub-blocking is easier and more worthwhile. Using clustered page tables gives a better solution for all three extensions.
[111] Yet another paper motivating OS extensibility. OSes provide transaction and recovery support, but not of the kind that Databases want. Here's the canonical version of it all (i.e., database-unspecific).
Three big issues exist:
Solutions? Here we go:
[51] A straightforward paper, introducing an implementation of monitors (using the infamous ``Hoare semantics'').
Interesting here are some ideas on inserting some smarts into the mechanism that picks which waiter on a condition variable to wake up. The idea is to assign a fixed priority to a wait operation. This priority will be used to schedule wake ups when the condition variable is signaled.
A very useful set of rules Hoare describes in this concept was good enough that I'm going to quote them here:
[52] Hoare presents a language where input and output and parallelism are basic primitives, and describes how a number of simple exercises would be accomplished on this language. He concludes by justifying some of his design choices.
The language consists of
It is a cute and elegant enough language, with very powerful constructs. There is still some criticism though:
A very enjoyable paper altogether, with some interesting insights on language construction in a rather different domain.
S(i:1..100):: *[ S(i-1)?least()->S(0)!noneleft() []n:integer;S(i-1)?has(n)->S(0)!false []n:integer;S(i-1)?insert(n)-> loaded:boolean;loaded:=true; *[ loaded;S(i-1)?least()-> S(i+1)!least(); [ l:integer;S(i+1)?l->S(i-1)!n;n:=l []S(i+1)?noneleft()->S(i-1)!n;loaded:=false ] []loaded;m:integer;S(i-1)?has(m)-> [ m<=n->S(0)!(m=n) []m>n->S(i+1)!has(m) ] []loaded;m:integer;S(i-1)?insert(m)-> [ m<n->S(i+n)!insert(n);n:=m []m=n->skip []m>n->S(i+n)!insert(m) ] ] ]
[49]The title says it all. Some applications need to play with their virtual memory. Examples are described in Stonebraker's rant in [110], in the case of databases. For another example, in the case of large scientific simulations, the application knows exactly how memory is going to be accessed (for instance, sequentially); therefore, it can best determine how page replacement is to proceed. Finally, applications can make educated decisions about space/time tradeoffs, based on how much physical memory is available to them.
To accomplish that goal, the authors have modified the kernel of V++ to associate an external page-cache manager with each segment of a process's address space. This manager gets a number of pages from the system's manager and deals with them as it sees fit, to react to pages demanded from the associated application.
An interesting side note, explained for real elsewhere, mentions the allocation of memory according to an economic model of fixed supply. Application-controlled page-cache managers can only maintain memory as long as they can afford it. They get a steady income, get penalized for saving, and are subject to exchange rates with I/O traffic.
[74]The main ideas: code synthesis and reduced synchronization inside the system.
The principle of frugality says that ``we should use the least powerful solution to a given problem''.
Code synthesis is used to specialize and optimize system interfaces for their intended use. Especially frequently traversed interfaces (like the system calls interface) can save a lot of computation being specialized and made short enough for their associated thread. The basic disadvantage is that every thread, or every thread group, needs to store its special versions of common code, after its creation.
Reduced synchronization shares some of its motivation from specialization, above. The idea is that, instead of using data structures that support synchronization, even when synchronization is not needed, a non-synchronized version of them is created and used when the extra functionality is not needed. Other ways of reducing synchronization is code isolation, procedure chaining and optimistic synchronization.
Code isolation transforms algorithms so that synchronization is unnecessary. For instance, in a regular data structure supporting a put and a get, a different piece of the metadata in the structure is writable by the put and a different one is writable by the get. So locking is not necessary. Obviously, this works if there's only one process running get at a time, and only one process running put at a time, which is also where specialization is applied.
Procedure Chaining replaces context switches among threads in the same address space with simple procedure calls. This is done by changing the return addresses in the stack. Optimistic Synchronization can be seen, for example, in context switching, when a thread that doesn't use the floating point facilities of the processor does not save floating point registers when scheduled out. Floating point instructions cause an illegal instruction fault when first used, at which point the context switch code is make less optimistic, i.e., starts saving floating point registers as well.
[18]Local procedure call semantics works well, so extending it to work remotely is the next step.
Before RPC, programming on a distributed environment was too complicated. The authors' major aim was to provide a more conducive programming environment for remote computation. Other aims were to make this environment adequately efficient, and reasonably secure.
Local procedure semantics is followed as closely as possible. Client and server stubs are created automatically from the interface descriptions.
A client binds to a particular interface (the type) and a particular instance (a machine and a process exporting that interface). The instance can be ``any''. The name server stores interfaces and their instances and returns the appropriate one when asked. Exported interfaces contain entry points of each procedure, a unique id for the instance and a call id, which identifies a request and its response, as well as their order.
Interfaces are capabilities. Access controls on the naming service exporting them transfers on to them.
The transport protocol is reliable (ACKs, timeouts, retransmissions, keep alives for long calls). Time based identifiers are used to distinguish responses from reset servers or duplicates. Exceptions are raised on network glitches. The protocol is tuned for short exchanges of small arguments. For larger exchanges it uses a bulk-transfer-like protocol, which is not as efficient as the state of the art bulk-protocol (to avoid complexity).
Exception handling is defined in the interface. Exceptions are passed back, in the place of return packets.
To optimize server response, idle server processes are standing by. If a server is expecting a call, it gets it. Otherwise, the call is dispatched to an idle server.
Lots of neat ideas that have been widely used since.
[13]Main target: shared-memory multiprocessors.
Consistency can be guaranteed by mutex critical sections or non-blocking concurrent objects. Non-blocking objects have no (actually, only short) queues, i.e. are not vulnerable to convoys, priority inversion, deadlocks.
Basic problems:
Synthesizing CAS using locks, when no hardware CAS exists, is not non-blocking. Roll-out solves that, by always releasing the lock on preemption. The code resumes at a different point, depending on where it was when it was preempted. This mechanism seems to be there only for the CAS mock-up. Roll-forward finishes up on preemption and it's harder to implement (for instance, what do you do on page faults?). Software CAS is good for more complex conditions and multibyte swaps. Roll-out can be prevented from progressing and is vulnerable to processor failure.
How do we synchronize to mock-up a CAS? Synchronization operations are becoming costlier in faster architectures.
Spinlocks cause a lot of bus activity on all processors: when the lock is released, everyone tries to acquire it.
Queuelocks turn lock contention into a relay race. Each contender waits for a single holder to release the lock. Therefore, every sync operation is a successful operation. They are more complex to implement and geared towards lock-based mechanisms. However, ``success'' here means that
A general fix is to only request a lock when the CAS is likely to succeed, by checking the CAS value, before acquire-lock (something like ``compare*-compare-and-swap'').
[14]RPC is cool as a message passing mechanism. It's designed to work cross-machine. However, the most common case is still single-machine calls (as shown in studies of V, taos, sunos/nfs). The authors attempt to create an optimized version of RPC.
Locally RPC overheads come from:
Optimizations have been proposed but are not far enough.
LRPC calls go through the kernel, which validates the authorization and creates the link to the callee.
Clerks deal with interface exports. Interfaces are exported in structures much resembling virtual method tables. There's one in every domain. It registers with a name server. Procedures in an interface are given a descriptor, which contains an entry pointer in the server's domain and a list of argument stacks. A-stacks are mapped read/write in both domains, so no copying is needed to transfer arguments from caller to callee. A-stacks can even be shared among procedures of the same interface. Linkage records contain the return domain and address for an A-stack. The whole thing is contained in a binding object, which is unforgeable, much like a capability.
E-stacks are managed by the server. They are associated with an A-stack until the server runs out and reclaims one. E-stacks and A-stacks adhere to the Module procedure calling conventions, which separate argument from execution stacks.
More complicated stubs are done in a more conventional RPC way.
On multiprocessors, domains are cached on idle processors, so that redundant context switches are avoided.
In general, arguments are copied around fewer times. All kernel copies are eliminated. An extra copy is done when parameter immutability needs to be ensured.
Multiprocessor speedup is almost perfect, because locking of shared data structures is minimized.
[69]This is an introduction to the monitor construct, as implemented on the Pilot workstation and in the Mesa programming language. Pilot was intended to be a multiprogrammed single-user machine. Therefore, a lot of protection-oriented issues are dismissed as only producible through malicious mechanisms.
The authors didn't want to solve communication with message passing, mainly because it didn't mesh well with their programming language of choice, Mesa. After all, Lauer and Needham argued that message passing and process-oriented structures are, effectively, duals of each other.
To guarantee consistency, they did not want to use a non-preemptive scheduling mechanism, mainly because the whole idea of paging (and the necessary page faults) breaks non-preemption, anyway, and besides, in that case, multiprogramming would have to be done only through voluntary relinquishments of the processor. This is, by common agreement, very awkward.
Processes in Mesa are very easy to create. In fact, any procedure can be invoked as a separate process; in other words, procedures can be called synchronously or asynchronously, with explicit support in the language. Process creation in that context (especially since it shares the address space and the program text) is very cheap, making arbitrarily frequent process creations a viable approach. The only difference between synchronous and asynchronous processes comes in the context of exception handling, since detached procedures are, in fact, the root in the exception handling pyramid. Perhaps exceptions could be passed on to the creator of a process, but that would probably be too big a hassle, worse even when the creator has itself terminated.
Monitors have entry procedures (which need to reserve the lock before they're executed), internal procedures (which can only be called from inside an entry or internal procedure, and don't touch the lock), and external procedures, which don't require the lock. Things get a bit hairy when an internal or entry procedure contains a call into a different monitor's procedures. The lock of the first monitor doesn't get freed, in which case, if the second monitor causes the caller to sleep, the first monitor becomes bound on a sleeping process. Alternatives exist to some extent, but it's a generally difficult situation, which makes the whole monitor concept less than perfect.
To make sharing of large, cellular structures more efficient, the monitor mechanism has to allow for finer locking (the default case deals at the whole Mesa module's scale). An idea is to have monitored records (such as a node in a linked list, for instance), allow locking at that scale. Obviously, when futzing with the structural integrity of the entire construct (front and back pointers, etc., or counters), perhaps the global lock should also be acquired.
Mesa monitors/condition variables define a different semantics (differing from C.A.R. Hoare's initial suggestion). A notify (which Hoare called ``signal'') doesn't relinquish control of the CPU immediately. It merely is a hint that any waiters on a condition variable might start thinking about waking up. When they do, they explicitly need to check the condition on which they slept, mainly because it might still not be favorable to their causes (e.g., a different process might have entered the monitor first, before the former waiter woke up). This may result in fewer context switches, though not necessarily in the face of congestion.
The authors have also defined an additional operation on a condition variable, called ``broadcast''. This wakes up all processes waiting on a condition variable. It can be invaluable in cases where a single condition change might satisfy the requirements of multiple waiters.
The authors define the concept of a naked notify. This is only performed by devices, via interrupt handlers. Since a device probably cannot go through the whole monitor entry story every time it needs to produce an interrupt (because others might be following), it just does an unprotected notify (``naked''). This can cause race conditions, since it might occur between the time that a real process has checked a condition and has decided to sleep. In effect, it might never be awakened again. A lock-free synchronization construct could be useful here (like compare-and-swap).
[71]There are two schools of thought on building systems with synchronization: message oriented systems and procedure oriented systems. The authors propose a correspondence, element to element, of the two that renders them duals of each other, thereby making the question moot.
relatively static, large processes; specific communication paths; the process is the resource manager.
Synchronization occurs at the message queues. Operations are serialized there.
The canonical resource manager is a huge case statement.
resources are protected by shared data structures; consistency is ensured by locks etc. Processes are short lived and numerous.
Synchronization occurs at mutex lock queues.
The canonical resource manager is a monitor.
A correspondence between elements of each class has been found. For instance, an asynchronous message-reply pair corresponds to forking out a process with a procedure call and waiting on the result.
The authors go even further, to suggest that the duality preserves performance as well. They break up performance in 3 components, all across which the duality is maintained:
unaffected by duality
they assert it without proof; they seem to miss however that process creation is much costlier than procedure calls (perhaps in a threading context the differences are not as pronounced)
scheduling should be mapped across in a straightforward way, since queues are involved in either model.
The conclusion is that the model or the performance considerations thereof shouldn't influence how a decision is made on which one to choose. Rather, considerations like which one is easier to implement could be used instead.
[75]Problems with the ``traditional'' UNIX file system:
First wave of changes:
The real FFS had the following enhancements
Additional Functionality, which wasn't coming as a direct response to shortcomings in the traditional Unix File System
[64]The design goals for this extensions were:
The implementation uses an object-oriented approach: operation vectors.
The Vnode is a virtual equivalent to an i-node. Vfs is a virtual mount entry. Different filesystems can appear in the same hierarchy without specialization.
[100]The general idea is pushing the vnode concept[64] further, by giving its object oriented model inheritance.
16 trillion levels of indirection, since v-nodes can now stack, fan out, tree... All operations have to be called through a whole chain of invocations (one to go to the Vnode, then to go to the top of the stack, then to follow the head of the stack one level down and then to indirect through the operations vector).
The description sounds cool structurally, but I'd have to see more convincing performance results before making judgment.
[36]A painful paper, with no less than 126 references...
A program's working set is the collection of pages it has recently referenced.
A bit unclear what exactly Saltzer's Problem is. A box with a knob we turn to tune performance? Not very clear. The knob here is the control parameter , which governs the tradeoff between resident set size and longer times between faults.
The swapping curve plots f(x), where x is the mean resident set size and f is the rate of page faults. The lifetime curve plots g(x)=1/f(x), i.e., the mean number of references between page faults. The global max of g(x)/x is the best a policy can do. Simple models for g fail to capture the behavior of real programs.
A restatement of the policy: The working set policy is that under which a program's resident set is the collection of pages that have been referenced in the last time ticks.
A generalization deals with the ``retention cost'' of a segment, accumulated since its prior reference. The generalized working set policy retains pages whose retention cost doesn't exceed a certain limit.
Simple program behavior models fail to capture phase transitions (they assume locality characteristics remain constant). ``slow-drift'' models are a compromise: they assumed a continuous degradation of locality. Real measurements proved these inadequate as well. In practice, slow-drift can only account for program behavior within short phases of its execution.
The phase/transition model specifies that long high-locality periods are separated from each other by short, erratic transition periods. Phases cover about 98% of the time. Transitions cause about 50% of the page faults.
Network queueing models have been used to determine which is the optimal ``multiprogramming level'' (load) of a paging system. Studies have shown that the ``L=S criterion'' (when the time between faults L is equal to the time it takes to service a fault), or the ``50% criterion'' (keeping disk utilization to 50%) yield close approximations to the optimal load.
The problem with the above criteria is that they attempt to apply across all programs, even when for example the max lifetime doesn't exceed the page-in time. Space-time minimization per job should be the optimality criterion. Load controls (limits on the load of the system) have been used to limit the drop-off in CPU utilization when thrashing ensues.
The basic components of a dispatcher are:
For local policies like WS, the selection of for each job is a daunting task. However, it was found that for (allegedly) typical program mixes, a global WS yields performance within 10% of the optimum. Furthermore, WS performs the exact same sequence of page faults as the optimal lookahead method, displaced only by the lookahead time. It is very close to optimal, when fully tuned.
[99] Main assumption: read requests are easily satisfiable by using main memory; optimize for writes and many small files. As memory becomes more abundant and less expensive, more of it can be available as a file cache (for both reads and writes).
Gripes against older FSes:
To fix this, writes are performed sequentially in a log. Once enough write data are available, a single write operation into the log is initiated, requiring only one seek. This includes data and meta-data blocks.
Major issues:
The log is the main storage structure, i.e., it's not temporary. So, every time a file changes, it and its i-node get written in the log. An in-memory i-node map points to the current location of all i-nodes. This map gets checkpointed on disk periodically.
For writes in the log to be fast enough, contiguous long writes must be possible. The disk is split into 512k segments. Segments are always completely overwritten, if changed at all. Segments with little live data are copied over in a compacted form.
Liveness is determined by checking to see if a datablock is pointed to by the appropriate pointer in the associated i-node (the i-node # and the block # for each block in the segment is stored in the segment's summary).
The ``cleaner'' is the component that compacts live data to open up more contiguous space (more empty segments) for the log. It attempts to only read in segments with a low liveness ratio, to minimize the cost of cleaning. In addition to the above criterion, recently modified blocks are considered likely to change soon, i.e., not worth compacting. So, the age of the live blocks in a segment is also taken under consideration when deciding whether to clean or not to clean.
Crash recovery uses checkpoints and roll forward. A checkpoint contains the i-node map, the segment map and a time stamp. Given this initial info, segment summary blocks are used to roll-forward the FS with operations that happened after the checkpoint.
LFS is slower than a traditional fs only when reading sequentially a file written randomly.
Main performance factors: cleaning(about 25%), rw(75%) In UNIX it's seeking(90%) and rw(70%).
[35] Internal security (as opposed to external: armed guards, fire protection and such) is mainly focused on the following:
There are two major design patterns:
They limit the information flow to prevent leaks from secret to less secure classification. Once you're at class A, you can't talk to a lesser class. This scheme incurs severe overclassification, since all principals end up going up in classification. It doesn't model the world (for instance, NDAs temporarily increase a person's classification for a certain domain, with specific penalties in case of privilege incursion)
To fix, info flow must be deduced (which input affects which output). Detect covert channels. This is very very hard.
Restricted private info can be inferred from unrestricted aggregate info: ask controlled question when part of the answer is known (trackers).
Solution: reduce accuracy of answers, use random samples.
It helps with authentication, guards against eavesdropping.
[80] Seminal paper on the basic authentication protocols used today. It covers both shared key ciphers and public key ciphers.
They're dealing with encryption and authentication protocols, in three contexts:
In all cases, there are two phases: one during which principals request a key from a trusted authentication server and one during which they authenticate themselves with each other. Nonces (one-use-only identifiers) are embedded in most messages that might be prone to replay attacks.
The basic, shared key, Needham-Schroeder protocol between two principals, A and B, and a trusted authentication server AS are as follows:
IA1 is to avoid having a bad person send us an older response by the AS, which it (the bad person) has had time to decrypt and play with.
Both IA1 and B are returned, to make sure this is the response to the request above, and not just any replayed response.
This is to certify that the real A is on the other side, not just a replayed stream.
Obviously, this couldn't be , since that could just as easily be replayed...
The same scheme is followed for public/secret key encryption algorithms. Given that A and B have already located the public keys of each other, here's how they initiate a conversation.
Unfortunately, the protocol above has been found to be broken. Specifically, in the end, right after step , Bthinks he's definitely talking to A, although there's no guarantee of that. Gavin Lowe presented the first mention of the problem in [73].
[73] The author presents a way in which the public-key authentication protocol presented in [80] can be subverted with a man-in-the-middle attack. The general idea is that A starts a conversation with the intruder. The intruder can make someone else (B in the example below) think he's talking to A.
A starts a normal authentication conversation with I, the bad guy.
I decrypts the message by A, since it was addressed to it. Then, Iimpersonating A, replays the message, encrypted with B's public key.
B naively responds with its own nonce. The message is intercepted by I.
I just replays the message as received to A.
A replies as normal, exposing the nonce to I.
Now I resends the nonce encrypted with B's public key. For all intents and purposes, B thinks it's talking to A now, although A thinks it's talking to I.
The protocol can be fixed against this attack, by including the identity of B in the middle step of the protocol, so that A cannot be tricked into colluding with I.
[48] The general idea is to do striping across multiple storage servers, with parity, just like RAID. Zebra globs writes together, just like LFS [99].
Striping helps with
Striping is not done per file, which would limit the benefit (since the file would need to be chopped to tiny pieces), but per log segment. Per-file striping causes a discontinuity in the way small and large files are handled. If, on the other hand, small files are stored in fewer servers than the full array, then the space overhead of parity becomes huge.
The LFS approach was borrowed. Here, logging is used between a client and its fileservers (as opposed to the original LFS case, where logging was used between a client and its disk). Each filesystem client uses its own log and stripes its log segments, instead of the individual files.
File meta-data are managed centrally by a file manager. The logs contain only data blocks. The file manager also deals with cache coherence, namespace management. It has to be contacted for all opens and closes. It has a local "shadow" file for each file accessible through the Zebra filesystem. The local shadow file contains block pointers and such.
Segments are reclaimed by a stripe cleaner, similar to a segment cleaner in LFS. Instead of using segment summaries, as in LFS, the cleaner keeps track of usage information in stripe status files, which it keeps as normal ZebraFS files.
Deltas contain meta-information that clients write into their logs; deltas don't contain data. However, deltas are also read by the file manager, so that it can manage meta-data accordingly.
Recovery needs to cover:
The only critical resource is the file manager. Any single crash by another component can be dealt with on-line, without reducing availability. Recovery of any single failure can occur while the system is running.
Some points where things might not be as rosy:
[4] The general idea is: ZebraFS [48] completely decentralized. Any participant can take over the role of any failed component. Also, xFS employs cooperative caching, to use participant's memory as a global file cache.
Major assumption: a cluster of cooperating workstations (NoW), whose kernels trust each other completely. A cool application of this is using a NoW as a fast NFS server.
Three main problems solved:
Metadata are handled by managers. A shared, globally replicated manager map assigns a metadata manager to a file. The assignment of files to managers is done through assigning different bit areas of the inumber to different managers. New files are given appropriate index numbers so they are managed by a manager living on the same node that created them.
Each metadata manager has an IMAP to locate the blocks to its files. In fact, an IMAP entry contains the disk location of the file's i-node, which contains the block addresses or the addresses of indirect blocks.
Storage servers are split into stripe groups. A stripe is split among fewer storage servers than all those available, to make sure that stripe fragments don't turn too small or, conversely, that clients don't have to buffer enormous amounts of data. Each stripe group has its own parity. Therefore, simultaneous storage server failures in different groups can be recovered from. Also, clients can write to different stripe groups at full speed at the same time.
i-nodes are only cached at the managers, so that the manager doesn't get bypassed when a local cache can't satisfy a request. This was decided so that cooperative caching could always be preferred to disk access. Caching is done per file block, not per file and follows the write-invalidate protocol.
Recovery is accomplished through checkpointing, roll-forward and distributed consensus for membership list recovery.
[102] This is about one of the most popular remote filesystem on the Unix platform.
Design goals:
Built on top of SunRPC. Calls are idempotent, so lack of UDP reliability is no problem; retries can be used safely. "Stateless" means that all necessary information for a call is given inside the parameter list. The protocol is stateless, not its components (i.e., the server keeps state for the clients and each client keeps state for the server).
The fhandle(file handle) represents the server's FS-specific identifier for a file at a client, and it's opaque as far as the client is concerned.
The file access protocol (NFS) is different from the filesystem mount protocol. Mount protocols need OS-specific information to put a client-server association together, whereas the file access protocol is OS-unspecific.
Statelessness implies synchronous disk operations on the server. In other words, all modifications on the server are flushed before a call returns.
i-nodes are versioned, to avoid i-node aliasing (i.e., using a stale file handle on an i-node that has been recycled).
NFS motivated the design of v-nodes, as a way to provide uniform access to different filesystems inside the kernel.
Both protocols (NFS and MOUNT) were implemented in user space. The kernel vnode implementation of NFS jumped out of the kernel to have requests completed.
The hard issues on which the paper comments are, strangely enough, those which still haven't been fixed (give or take one or two).
All in all, a great idea but we should have moved along quite a bit since.
[63] Main goal: highly available operation for a network file system, even in the face of disconnection.
Workload is balanced at the granularity of subtrees. High availability is achieved through:
Caching is done on whole files. This simplifies disconnected operation and allows for more straightforward failure modes. Clients are given most functionality, to promote scalability.
The contents of the cache, in the face of disconnection, can be determined by the user, in the same way this is done when a user copies interesting files on a laptop before leaving a network access point and then reintegrates them back into the file repository, on reconnection.
When disconnected, the pessimistic approach would be to enforce consistency, by requiring an a priori acquisition of a shared or exclusive lock for access. In the case of involuntary disconnection, this is unrealistic. The optimistic approach, which is the one taken by Coda, allows all updates and hopes no conflicts will ensue; when conflicts do arise, they are resolved upon reconnection.
System calls are intercepted through v-nodes and serviced by a user level cache daemon (Venus). Venus can be in any one of three states:
[72] Basic idea is replicating storage servers through a primary storage manager. Furthermore, writes are aggregated through the use of the write-behind strategy, in conjunction with a UPS, to make sure that power failures don't kill the volatile write log.
Harp is not a file system. Rather, it's a replication extension written using the VFS interface. It strives to maintain Unix and NFS file system semantics.
Every file is assigned a server group. A server group consists of a primary server, a bunch of backups and a bunch of "witnesses". When any of the primary or backups is disconnected, or partitioned away, a new view of the server group is created, where one of the witnesses is promoted to serve as backup. Witnesses only store event logs (i.e., sequences of modifications on the file system). Logs are played back during recovery to resynchronize the state of the file system.
Modifications on the file system, as stored in event logs, are applied by an applier kernel thread. This thread goes through the event log and calls the appropriate local filesystem calls to have them propagated to disk.
[39] One of the earliest suggestions of device drivers, device-independent I/O interfaces, standard I/O, redirection.
Device-independent I/O was made possible by the memory mapped nature of Multics. Once a segment was made known (no matter what device it resided on), it could be accessed independently of its implementation or capabilities.
Device-specific data manipulations are done inside the ``Device Interface Module'', which is pretty much an object-oriented device driver in current terms. Its interface even provided for device-specific commands, similar to ioctls.
A very exciting paper to read, given the context in which it was written.
[43] Replication of files in a distributed, transaction-enabled environment.
The main points are:
Votes are assigned to a replica depending on the reliability of the host on which a replica resides. A very very reliable machine causes replicas living in it to have many votes. When r for instance is lower than the number of votes of a very reliable replica, then reading that replica successfully is good enough. Obviously, reliability must also cover consistency in this regard.
Caches of a file can be maintained close to a client, by creating a copy with 0 votes. They don't participate in the consistency considerations, except as far as their version number is concerned.
Updates to the vote configuration are supported exactly like a normal file modification (every replica contains a directory of all other replicas, which is 1-1 associated with the version number of the replica).
[106] A very interesting survey, supported by a large number of large enterprise installations. The issues brought up are somewhat outdated, but a large portion thereof still have a great significance.
Why is I/O an issue? Why do we want it optimized?
Main algorithms used: Scan, SSTF (shortest seek time first), cyclical Scan, others.
In some large systems, disk arms hardly move (because the disk is exclusively used for a particular kind of sequential application), so such matters don't really count. It doesn't seem that many current systems follow this pattern.
The farther ahead a sequential run is, the more likely is to jump into a different run soon. So, adjust the prefetched size according to the age of a run.
For both schemes, compaction can be a good general solution, although it
[19] The return of the Virtual Machines.
The goal is to take current operating systems to shared memory multiprocessors. Also, allow fixed-parallelism operating systems (like NT) to share more resources.
Main technique: virtualize everything (CPUs, memory, I/O devices). The virtualization layer is much smaller than a complete operating system, so it's also faster to develop and less prone to bugs. Fault containment comes naturally since each virtual machine can fail independently of others running alongside it (although more is needed for hardware fault containment, e.g., cellular DISCO).
The traditional problems with virtual machines have been:
Each virtual CPU is treated like a process in a conventional operating system. A table is maintained of most recent register contents (including privileged registers), TLB entries, so that virtual CPUs can resume where they left off, when scheduled on a real CPU. Virtual kernels are executed in supervisor mode, which is not quite as powerful as kernel mode in the MIPS architecture. When a privileged instruction is attempted by the virtual kernel, the trap is handled by Disco which changes the appropriate virtual privileged registers and jumps to the trap hanlder of the virtual kernel.
There are three levels of memory in DISCO:
When a virtual TLB is reloaded, we replace the mapping from virtual memory to physical memory with a mapping from virtual to machine memory. Any subsequent accesses to virtual memory get correctly translated to the corresponding machine memory without additional overhead. The ``page table'' that translates physical memory page numbers to machine page numbers is the pmap, which is keps for each virtual machine. The pmap also contains back pointers from physical memory page numbers to their corresponding virtual page numbers, to assist with invalidations when a page is take away from the virtual machine.
In MIPS, kernel code lives in an unmapped address space. Since this bypasses the TLB, it makes virtualization either impossible, or too slow (as would be the case if every single instruction had to be emulated to fix memory references). Therefore, OSes on MIPS have to be slightly modified to put their kernels in mapped memory (this is, in fact, a change on the Operating System ITSELF, however other architectures don't have such problems with memory mapping, so this shouldn't be a problem for Disco on them).
Virtual machines suffer more TLB misses than when the OS runs standalone. This is because the kernel also has to go through the TLB. Also, TLB misses are more expensive, since traps have to be emulated. To alleviate the problem, Disco maintains a second level, software TLB, practically extending the size of virtual TLBs. So, some virtual TLB misses are never sent to the reload code, since they're satisfied under the virtual machine.
Under the virtual machines, the NUMA-ness of the system is hidden from the operating systems. Pages migrate or get duplicated transparently to improve performance. Page access trends are measured through the cache miss counting facility provided by FLASH. When a page needs to migrate, the virtual TLB entries are invalidated and the page gets copied. Similarly, virtual TLB entries are downgraded to read share a page across multiple real processors. To find the right TLB entries, a memmap is maintained, mapping machine pages to sets of virtual machine/virtual address pointers.
I/O devices are virtualized through Disco-aware device drivers installed into the virtual OSes. Each device defines a monitor call command, which is used to pass all arguments to a device command to the monitor in one fell swoop. These calls perform virtual to machine memory translations for DMA requests, before a request is issued. Thereafter, the Disco device driver can talk to the device directly. Especially in the case of devices accessed by a single virtual machine, the I/O resource doesn't get virtualized at all, except for DMA requests. Disco's internal device driver interface much resembles that of other operating systems, to simplify the use of ready-made device drivers within it. Disco, in fact, internally uses IRIX device drivers.
Real, modifiable disks are not virtualized. Onle one virtual machine may mount a disk. Others can share data on those disks by using NFS, through the virtual machine that mounts the disk.
For non-persistently modifiable disks, the concept of a copy-on-write disk is used by Disco. This is for disks with code or read-only data, where modifications don't need to be kept after a virtual machine reboots. Those are read-shared under the virtual machines, so that disk accesses to their pages are almost instant, one the disk page is in memory. In other words, a secondary buffer cache exists under all virtual machines.
A network interfaces are virtualized and given their own data-link addresses on different virtual machines. When a distributed file system protocol like NFS is used, pages are shared instead of copied whenever possible.
Most commodity Operating Systems have some sort of base hardware abstraction layer, to simplify portability. Making minor changes on that layer, makes virtualization faster and easier. For instance, an OS can be made to use loads/stores to a specific address instead of certain privileged instructions that only read or change the contents of a privileged register.
It is extremely easy to use a specialized, thin operating systems that can scale up to the entire multiprocessor without much work. Such is the case for special operating systems supporting heavy scientific applications. Commodity operating systems can run alongside unburdened scientific applications, that will take up all available computing resources when they find them unused.
[119] LOCUS was a distributed operating system running on heterogeneous machines. It had a distributed, replicated file system, supported nested transactions, provided location-transparent naming and so on and so forth.
Replication in a file system is great since it enhances availability and performance. However, it can be hellish for maintaining concurrency control.
There are multiple physical devices per filesystem (since its replicated). Normally, physical devices are incomplete (i.e., they don't contain the entire logical file system). Version numbers are used to maintain concurrency. Access is transparent, regardless of the physical location of a file (except for differences in performance).
The synchronization site (CSS) (which is the synchronization manager for a file system) is assigned for all files within a filesystem. Updates are possible even when not all copies are accessible (i.e., when the network is partitioned). However, upon reintegration of the network partitions, a reconciliation mechanism is applied to bring all copies of the file in synch, especially if all disconnected copies had been updated in parallel.
The CSS is maintained in the mount table, which is globally replicated. Subsequent accesses to a file (after the CSS has been consulted) don't have to involve the manager.
Directory path traversals are done a component at a time, but shipping multiple components to a remote storage server for a directory was considered.
File operations are atomic. Commit and abort are supported. Old data versions are kept through shadow paging. Update propagation is also done in an atomic way, in case contact is lost halfway through the update.
File creations and deletions are handled just like file modifications. Storage Servers (SS) for new files are chosen among those of the parent directory.
For architecture-specific file location (different executables for different architectures)), a hidden directory is created with the name of the ambiguous file. Inside this directory, files named after different architectures are contained (e.g., /bin/ls/i386 is the file that's going to be executed when I execute /bin/ls on an i386 machine. On a SPARC machine, /bin/ls/SPARC is going to be executed instead).
Processes can be forked locally or remotely in a transparent way. A combined fork and exec has been provided, to avoid futile address space copying (especially over a network link). Shared resource semantics (open files, pipes) is handled through a token technique, whereby access is only allowed when the token is held by the node.
Within a partition, shared resources are strictly synchronized. Upon reintegration, conflicts are resolved automatically for understood resource types (say mailboxes), or a higher arviter is required. Partitions aren't just disconnections through router failure; they can be due to maintenance, negligence and so on.
Primary copy, majority consensus and weighted voting can allow updates on only one of the partitions.
Both bundled updates and unrelated updates are handled.
Reconfiguration involves:
The three reconfiguration protocols are combined, preventing deadlocks by using protocol ordering (i.e., being strict on what runs when) and node ordering.
[65] An intermediate construct, between tightly-coupled multiprocessors and loosely-coupled distributed systems. It's like a NetworkOfWorkstations approach, only for VAX machines.
The interconnect ports provide both a guaranteed, reliable, in order message service and a UDP-like, unreliable service. They also allow direct transfers of block data from virtual memory to virtual memory without the intervention of the operating system. Data is sent through the interconnect ports using 7 queues (4 outgoing data queues in 4 different priorities, 1 response queue for incoming data, and 2 pool queues with free messages and datagrams). The host program need only place the information it needs sent in those queues and the interconnect hardware will read the queue elements in the right priority and send them back, without bothering the CPU. Incoming data is placed in the queue and an interrupt handler can deal with all incoming packets in the queue, before returning, significantly reducing the delay through interrupts.
Disk systems (a disk with an interconnect component) can also participate in the network, providing storage facilities for all machines in the cluster. Access to those disks is also provided through the message-oriented interface described above.
Device Management is greatly simplified:
On each node, connection managers deal with locating resources, detecting recoverable failures, integrating new cluster members and so on. A quorum scheme is used to determine cluster membership. When operational members become fewer than the quorum value, then the cluster suspends activity until enough nodes have come back up. If a node fails and restarts before its membership times out, it's allowed to rejoin. Otherwise, it's told to reboot, before it can reenter the cluster.
Synchronization is handled by the distributed lock manager. This deals with awarding, maintaining and releasing locks of different types (shared, exclusive read-only, exclusive, ...) on objects of different types (from an entire file system to specific records of a file). For each hierarchical tree of objects, the master is the node that first requested a lock on the root of the tree. It has to keep track of granted locks and queues of waiters for incompatible locks. Lock masters are located through a lock directory, which associates a master with a resource name. Directories are also handled in a distributed manner. Given a resource name, a hash function yields the owner of the directory for that name (assuming a specific, well known number of members in the cluster).
During a cluster transition (e.g., addition of a new member, or escape of an old member), operations halt temporarily. All lock managers deallocate locks acquired on behalf of other systems. Then, each lock manager reacquires the remote locks it was holding in other systems, thereby redistributing masters and directories (since the membership number has changed).
All lock manager operations are optimized when applied locally.
This system was unique in that software and hardware were being designed together, with a specific common goal: the creation of a clustered system of simpler, commodity-like machines. When the hardware is given, many more compromises than listed here have to be made.
[23] It is an operating system for a shared-memory multiprocessor. It strives to minimize the impact of failures of individual processors to the operational state of the entire multiprocessor. Main tools:
This approach creates the following challenges:
Fault containment has three components:
A firewall is a feature of the FLASH architecture, whereby a processor can associate a 64-bit vector which each memory page in its local memory. Each bit in this vector represents a processor in the system (if there are more processors than bits, some sort of aggregation is done). A write attempt by a processor not allowed to write a remote page results in a bus error.
Memory sharing can occur in two ways:
Faults can appear in both software and hardware. In hardware, the interconnect can malfunction because of a faulting processor, similarly, a faulting processor can cause the memory lines it has access to to misbehave. Similarly in software, the errant kernel is the component responsible for fault containment. Therefore, when the kernel is misbehaving, wild writes cannot be avoided, unless there's hardware support (for example in the coherence controller), or a trusted, lower-level software component (as is the case in a microkernel architecture).
Three types of fault ``contamination'' have been identified and dealt with:
All cells monitor all other cells for signs of insanity using heuristics (for example, timed-out RPC calls, bus errors trying to access remote memory, remote clock monitoring, failed consistency checks on remote, internal pointers). Once an indication has been registered, normal operation of the system halts temporarily, and all cells engage in a distributed consensus protocol, to decide whether the suspect cell is in fact faulty. If the consensus is affirmative, then the faulty cell is rebooted. This second phase guards against insane cells pronouncing every other cell faulty and automatically rebooting them. Furthermore, cells accusing others of insanity, if voted down twice in a row for the same accusation, are considered corrupt.
Recovery is performed in two phases. In the first phase, TLBs are flushed, remote mappings are removed from process address spaces. After a first barrier has been crossed by all, the second phase cleans up virtual memory, by resetting all firewall write permissions and discarding pages that were writeable by the faulting node. After the second barrier is crossed, cells can resume operations.
[15] Extend the operating system services, to make the system correspond better to the requirements of specific applications. In this way, different clients can get independent specializations.
Provide
All that is given through the following techniques:
Comparison to other approaches:
Protection in SPIN is supported through capabilities and protection domains.
Capabilities are unforgeable language-level pointers. They're supported and protected by the language (Modula 3) and impose no run-time overhead. The kernel externalizes them through indices into a kernel-level capability pointer table (much like file descriptors in Unix).
Protection domains are interface name spaces. They're created from ``safe'' object files (i.e., object files signed by a trusted Modula-3 compiler, or asserted to be trusted by an administrator). Such namespaces can be combined, resolved and so on.
The Dispatcher is a big bottleneck, especially when multiple guards need to be evaluated. Event handlers are, in general, very deeply trusted once their credentials have been validated. In other words, a trusted misbehaving extension (for example, a buggy extension), can wreak havoc in the system in many ways (an example is holding down a lock on a resource, causing starvation on threads of control needing that resource).
[42] The idea behind the whole thing is being able to provide compression algorithms with better statistical models of code, so that they can improve compression efficiency.
This is important, since code compression is becoming very attractive, both to reduce transmission speed of code over a network, and also to reduce access times from secondary storage and footprint.
The statistical model of a chunk of code is useful, because it determines how output symbols are assigned to input symbols: frequently occurring input symbols are assigned shorter output symbols, since the overall goal is to minimize the length of the output symbol stream.
The objective of a modeller is to output a bunch of contexts, which are identifiable states in the linear scanning of a code stream where symbol probabilities are relatively stable, and their associated probability distributions. To determine such contexts and probability distributions, a modeller checks a specific set of predictors, such as stack lengh, instruction under the cursor, etc., to output something like the following decision tree, which for each decision through the predictor set, yields a probability distribution:
if instruction=comparison then use distribution_that_favors_branches else if stackHeight=1 then use distribution_that_precludes_binary_operators else use other_distributionThese decision trees are created by a simple algorithm that attempts to minimize the total entropy of the code stream. Implementations of the algorithm are still slow for on-the-fly model inference, but they are very useful for long-term compression (such as that used before code is stored on stable storage).
Commonly used predictors are:
[104] Hardware considerations for a ring protection mechanism.
Criteria to judge access control mechanisms:
The above criteria are usually combined by keeping the latter three constrained within reasonable ranges, and trying to provide as much of the former.
In this work, the classic Multics segmentation architecture is assumed, ignoring paging.
Protection rings were motivated by the need to change a process' privileges, as it is executing different code. Note that in Multics, a process is loosely equivalent to a user shell. Different segments are equivalent to programs/processes in Unix.
A protection ring is a generalization of protection domains. Instead of only having user and supervisor domains, there exist now r domains totally ordered in terms of the privileges they convey (0 being the most powerful). Each process executes in a single ring at any time. Each segment defines a privilege bracket for every possible right (read, write, execute) when the right is enabled by the associated flag. A process has the right if and only if its ring lies within the associated bracket for the segment accessed.
A process can increase its ring freely, since ring increase means monotonic reduction of privileges. Ring decrease, however, is heavily controlled, since it imparts additional privileges. Special locations within a segment are defined, called gates. A process can raise its ring number to that necessary to execute a particular segment only when it jumps directly into one of those gates. The ring decrease is only allowed when jumping into the gates from a particular range of rings (called the gate extension). The gate extension effectively describes the rings from which a downward jump into a privileged segment is allowed. For rings above the gate extension, a downward jump is disallowed. Gates and gate extensions are stored in the descriptor of a segment and are originally read from the access control list associated with a segment on storage.
Execute and write rights always overlap only in a single ring, if they are both allowed on a segment. This makes sure that only the ring with the highest execute privileges is allowed to write a segment.
For an example, consider a segment A with a 0-3 write bracket, a 3-4execute bracket and a +2 gate extension. A process in ring 5 may jump into one of A's gates, and lower its ring to 4, since ring 5 is within 2 of ring 4 (the top of the execute bracket of A). However, a process in ring 7 may not jump into one of A's gates (and therefore, may not lower its ring through A), because ring 7 is not within 2 or ring 4 (the top of the execute bracket of A).
Calls and returns can cause problems when the protection domain changes between caller and callee. Caller-callee environments must be independent (i.e., the environment of lower-ring segments should not be accessible by higher-ring segments). For an example, consider caller-callee stacks should be independent: Stack records written at ring n should not be accessible at ring m>n. Otherwise, consider segment A executing at nand calling segment B executing at m. If B has access to A's stack records, it can alter them, thereby causing B to return to code it didn't intend to execute. This is solved by having different stack segments for each ring, making sure that the stack segment for ring n is not accessible at any ring m>n.
In the case where the call lowers the ring number, the callee must ensure it doesn't rely on the caller for its correct operation (since the caller is, in fact, less trusted). The caller must make sure it doesn't allow the callee to circumvent trust boundaries:
Calls at the same protection domain work exactly as described above. However, upward calls also have problems.
A suggested ring organization for Multics would be:
Protection rings don't deal with ``mutually suspicious programs'', running with the same amount of privileges over a trusted arbiter.
[5] Integration of heterogeneous networks at the application level.
Target application is telephony (and telephony-related services, like voice mail). The main contribution is an architecture for connecting telphony networks with the Internet, and the associated protocols for using packet-based transports to carry logical channels.
It's main components are:
[41] Basic motivation: networked devices are very diverse in capabilities. How can the same content be delivered and used at all these different devices, without being unduly duplicated or delayed?
Axes of variability:
General idea: perform lossy on-demand content type conversion, to minimize network latency and bandwidth requirements. This goal can be achieved through the following three techniques:
The semantic type (text, image, video) and the encoding (postscript, mpeg, ...) are logically independent. It is the content that's distilled, to reduce detail or quality (e.g., image resolution or color depth), but some operations are easier in some encodings than others. Refinement, on the other hand, allows the user to decide whether the full-blown, ``expensive'' version of some content should be downloaded, by looking at the ``cheap'', distilled version. Refinement amounts to an informed zoom-in operation, in terms of quality.
Besides, on-the-fly compression can also be customized to the particular needs of the client that initiated it (as opposed to using pre-distilled stored versions of the original content, and pigeonholing clients to available versions).
Current techniques for dealing with client variability include providing multiple versions of the content (which means a bigger administrative mess and still leaves some requirements out), or just supporting one side of the fight (i.e., either the low end clients or the extra-heavy-duty high-end clients). Even when coarse-grained refinement is allowed (as is the case with web browsers supporting progressive image encoding techniques), it's done in a way that ignores semantics, or value of different parts of the conent, thereby treating bundled content indistinguishably.
The basic components of the proposed architecture are as follows:
[1] INS is a resource discovery system for mobile networks, based on attribute-value pair ``names''. Names describe what is sought, not where the sought object lives. Message routing and name resolution are bound together (the authors call this ``late binding''), so that physical name mappings can change even during a long-lived session. However, reliable delivery cannot be guaranteed with late binding.
The expected mobility can come in different flavors:
Example abstract names one can resolve in this scheme are: ``the closest geographically, least loaded printer'', ``the least loaded fileserver with the best network performance to where I am'', ``all mobile cameras in section A of this building''.
Names belong to an attribute hierarchy. The nodes of the hierarchy are not attributes; they are attribute-value pairs. This allows the children of a node to change structure, depending on the value of the parent. A good example from the paper is that ``country=us'' has a child ``state=CA'', but ``country=canada'' has a child ``province=BC''. If we, instead, had hierarchies of attributes and values, under country we'd have to deal with both provinces and states.
Intentional Name Resolvers are both name resolvers and message routers (since, in INS, messages piggyback data with the intentional name of their destination). Services attach themselves on an INR, which advertizes the attribute-value characteristics of its attached services to its neighboring INRs.
Delivery to names can be one of the following:
The first two options belong to the ``late binding'' technique, where name resolution and forwarding are done by the naming infrastructure. The third option separates forwarding from naming.
For example, the name ``color printer'' would resolve to the closest, fastest, highest-resolution color printer to where I am (if my cost function is proximity, speed, resolution) in ``intentional anycast''; in ``intentional multicast'' the name would resolve to all color printers I can reach. In early binding, it would just resolve to the network addresses (IP addresses) of the color printers I know are out there, along with their metrics, as I have them in my cache.
INRs organize themselves into a (not necessarily minimum) spanning tree, using average delay (round-trip time) as the cost metric. Every new INR in the INR cloud pings all others to figure out which currently active INR is the closest one (using the delay metric). It picks that one as its next hop neighbor. The tree is not minimum and it's not robust (it has many single points of failure).
Periodic system-wide updates take the form of flooding. A new name-specifier is made known, along with its associated IP address, port number and transport type, a routing metric, the next hop INR along with its delay metric and a unique identifier of the announcer.
The list of currently running INRs is maintained and provided by a central Domain Space Resolver.
Clients can either request name resolution (along with forwarding, in the case of late binding), or discovery. Clients themselves also have intentional names (so that services can send them their results).
Services pick the metrics on which they wish to receive routing (for instance, their average load, or their print resolution). They also set an announcer ID for themselves (to differentiate them from other similar services running on the same node). The ID is a concatenation of the IP address with the startup time of the server process.
The main design goals addressed are:
Performance is significantly hurt by the CPU-intensive name resolution and update processes. To alleviate the problem, partitioning services into different virtual spaces has been proposed and tried.
In p. 188, fourth paragraph, why is forwarding the message to the ``best'' end node so much better and dynamic than early binding? Is this to save against changes effected between the time of the resolution and the actual use of the return network addresses?
In p. 189, figure 1, the service announcing an intentional name is in the middle right of the figure, not the bottom right.
In p. 190, last paragraph. Why do we need the routing tables? We already have routing tables at the IP level. Why do we route through INRs? Perhaps to catch as yet unsettled updates?
There's no information on how a client finds and/or picks an INR.
It's not clear how v-spaces remain balanced.
The performance evaluation is rather superficial. Arbitrary distributions are assumed for the growth of name trees and the behavior of virtual spaces.
It is not clear why late binding is useful. If the first-hop INR is rapidly updated, which is what the architecture strives to accomplish, then why should the system go through the INR network to deliver the data?
[84] A short, fairly forward-looking, thesis paper on using content-based naming. The examples are not exactly obvious. Better examples exist now of why content-based (or intention-based[1]) naming is rapidly becoming preferable. Mobility is definitely one very good example domain.
It's interesting to read pre-web literature. And this is merely 7 years old.
[97] A very dense, very impressive paper (received the best student paper award in SigComm '99), modeling and analyzing soft-state protocols.
Soft state was first coined by Clark in a SigComm '88 paper on the design philosophy of Internet Protocols, though not quite fully defined. What makes state soft is that it can expire. As long as there are refresh actions keeping it fresh, the state remains. When the refreshes cease to appear, the state eventually goes away. This makes recovery under failure much easier than having to deal with explicit, ``hard'' state that requires active restitution in the face of failure.
Soft state requires extra messages, since it relies on refresh.
The Internet is the perfect domain for soft state protocols, since it contains endpoints with a wide range of failure rates, reliability guarantees and performance characteristics. Soft state-based design fits the Internet better, since it calibrates the design process to anticipate conditions that occur quite often on the Internet.
This paper proposes a model for soft state protocols and defines a consistency metric on it. It also suggests a soft state-based transport protocol framework as a building block for the design of soft state protocols.
The consistency metric c(k,t) for an item k at time t between sender S and receiver R is defined as , where P.val(k) is P's value of k. Given this definition, instantaneous system consistency can be defined as the average consistency over all items in the database at time t and, similarly, average system consistency can be defined as the time average of the instantaneous system consistency over a long time period. A system is eventually consistent when the instantaneous system consistency approaches 1 a long time after the last item has been inserted in the sender's database.
Periodic updates are good enough to foster eventual consistency over a lossy medium, but they can be expensive (redundancy) and can cause undue propagation delays.
Unfortunately, most of this section went right over my head. However, the gist of it describes a simple refresh scheduler (FIFO) and analyzes its performance and bandwidth wastage. The simple FIFO scheduler maintains system consistency between 85 and 95% for loss rates in the 1-10% range (with a 15% probability that value-attribute pairs will be removed from the sender). For loss rates as high as 50%, and pair removal probability of 10%, bandwidth waste is calculated around 90%.
A better way to manage refresh transmissions is to use the tried and true technique of ageing data. Retransmissions (refreshes) are considered less important than new transmissions (of new attribute-value attributes or updated ones). Therefore, there now exist two transmission queues at the scheduler, a ``cold'' one (refreshes) and a ``hot'' one (new). Through simulation, the authors have shown that when the bandwidth used by the hot queue is slightly higher than the rate of update at the sender, system consistency is maximized. Furthermore, update latency decreases as the bandwidth used by the cold queue is increased.
To make retransmissions more effective, a receiver feedback mechanism is added in the scheme, where detected losses by the receiver create a negative acknowledgement. When an item is declared missed by the receiver, it is moved into the hot queue again by the sender. In other words, the hot queue contains not only new transmissions but also requested retransmissions; the cold queue contains only spontaneous refreshes. When the bandwidth used by feedback traffic (NACKs) is between 20 and 60% of the total bandwidth, consistency reaches almost perfect levels.
This is just a nice, packaged form of the theory derived in the first part of this paper. It allows applications to specify what levels of consistency they expect from the tranport layer. Then SSTP arranges the bandwidths used by hold and cold queues to achieve that consistency. It also relies on receiver reports to be informed on the observed loss rate. Finally, it requires some separate module to tell it what the available bandwidth is.
SSTP provides upcalls to the application, to allow it to change its behavior when the consistency ``guarantees'' are not maintained. Such is a case when the application feeds data into SSTP faster than the bandwidth it has allocated for the hot queue.
To deal with large data spaces, SSTP allows applications to put their ``soft'' data in hierarchical name spaces. Transmission and update of such name spaces is done in a recursive descent-based manner. In this way, some receivers might decide to skip parts of the name tree (and, therefore, not request those parts).
Despite the importance of the contained work, this is a really sloppy paper, apparently not very carefully proofread. Especially the analysis sections (3-5) lack basic continuity, like explaining what some of the symbols used signify and the like. The graphs are minimally captioned and not adequately explored in the text. There's a figure that's never even mentioned in the text. This is a real shame, since this is an otherwise extremely cool paper. Pity.
[87] This deals with the concept of picking one Internet host among many of the same class, based on some application-specific characteristics.
Should this be implemented at the IP level, there are several important and difficult issues to deal with:
This is an early attempt at ``what''-not-``where'' addressing and naming. Obviously, the criteria selecting members from anycasting groups can only be those understood by routers (hop counts or link bandwidths at the most). For more advanced criteria, application-level smarts will be required. Furthermore, the hackiness of it all is just mindboggling.
Bottomline is, too much overloading (in the OO sense) can be bad for your designing health.
[113] A high-level presentation of active network research.
Basic idea of active networks: network switches perform customized computation on the messages flowing through them.
There are two ways in which active networks have been visualized:
Program encoding must, in general, support mobility (i.e., be independent of platform and location), safety (i.e., provide ways to restrict the accessible system resources), and efficiency (i.e., make sure that in most common cases performance is not horribly penalized).
The kind of functionality available would cover packet manipulation, access to switch environment (time, address, link status, etc.), packet flow manipulation.
The program model must also defined a common resource model (routing tables, link status, routing capabilities). Defining a secure resource allocation mechanism is a hard problem.
Systems with use in this framework include Safe-Tcl, Java, Omniware and Proof-carrying code.
Some additional issues/disadvantages not clearly mentioned are
[117] Another take on the deployment of wide-area services.
The spin is that instead of plugging the code where the data are going to pass through (which is what Active Networks do [113]), we map the path name to a sequence of operators (i.e., we let the path be calculated dynamically).
This is an overlay mechanism, in that IP-level routing is not supplanted.
Example applications would be replica selection in a replicated framework, mobile distillation (where the distillers can move to minimize perceived latency), or customizable client caches.
This is yet another approach to naming things using the ``what'', not the ``where'', otherwise called Intentional Naming [1]. Intentional Naming bundles location and transport in one nice package, which allows location to be swayed by available transports or, conversely, transport be swayed by how we named the endpoint. In contrast, ``Extensional Naming'' breaks up location and transport. This might seem simpler (since it is conducive to modularized solutions), but it implies some hard state: If a name resolves to a transport address, this address must be continuously supported, though it might have nothing to do with what is named (e.g., in mobile systems).
Active names pay the mobile code premium once per session, whereas active networks pay it per packet. This allows active networks to change their behavior per packet, which active names cannot.
Their main design goals are:
The scheme works like this:
In , the location of NS may also be an active name. Obviously, there needs to be a ground case in this recursion. Such is the case for some primitive services (like HTTP or DNS).
Resolvers are pretty much compute servers with some necessary functionality:
The paper compares active naming to active networks or active services before it has even talked about what exactly active naming is (they do talk about it, but in too fluffy terms to actually be helpful). This can be very annoying.
Unfortunately, according to the authors, there are many security issues that still need to be resolved.
The described applications of the scheme are pretty convincing, especially in the case of mobile distillation.
[67] A great quote from the paper:
"There are already enough names.
One must know when to stop.
Knowing when to stop averts trouble."Tao Te Ching
Context: Name service for all computers and all people in the world. In general, only very few pairs of principals will have any communication exchanges at all. Assume the names change slowly, and the properties associated with names also change slowly. The name service is also large, long lived, highly available, invulnerable to local faults and tolerant of mistrust. It all sounds a lot like a hierarchy.
From the client's standpoint, a hierarchical name tree is seen. Symbolic links are supported. The top part of the tree consists of directories. Each directory has a unique identifier (something like an i-node in a Unix File System hierarchy). From each super-directory, subdirectories can be reached through named arcs (file names in a Unix File System hierarchy). So, nodes in the tree are independent of the names through which they're reachable. From directory leaves, value trees spawn. Value trees are also name trees where an arc is a name. Nodes contain time stamps and a present/absent flag.
In the above scheme, we can name single values, sets of values, or subtrees. Each directory has an access control function that maps principals and name paths into rights (R, W, test). Each directory has an authentication function from an encryption key to a principal. Some mappings in those functions are maintained externally, since no Public Key Infrastructure is assumed (i.e., by a courier).
At the administrative level, replication and subtree/server associations are visible. Every tree directory arc, except for the directory identifier it points to, also contains a list of servers on which that directory identifier lives (any of them can be picked). Servers are also named in the name tree.
Updates originate at one of the copies and later on propagated to the rest of them. This happens during a sweep operation, which collects updates from all copies of a directory and then spreads them through all of them, resetting their last sweep timestamp. The entire set of servers can't be reliably retrieved from the parent (after all, the parent might be undergoing update as we check it). So all servers are connected in a ring, through which a sweep proceeds to update all of them. Disconnections, network partitions, failures lead to disruption of the ring. Ring reformulations occur manually by an administrator, and they mean the advent of a new epoch.
Administration operations under different subtrees are independent and need not be coordinated. When, however, trees are merged together under a new root (i.e., name services are combined into a new one), a scheme has to map from the old well known root (for which an identifier, not a name, was known) to a node in the new tree. This is done in a table of well known nodes, which point from a directory identifier that used to be a root, to a path in the current scheme. For every such reorganization, the well known node table has to be translated to the new tree; this is done infrequently enough, that the extra computational overhead is not an issue.
Restructuring, that is moves of entire subtrees under another subtree, is handled using symbolic links, so that paths leading to the old location of a subtree have pointers to the new location of the same subtree.
To permit caching, each entry has an expiration time. No entry can change before its expiration time has passed (except, perhaps, it can move to a different subtree, as long as there's a symbolic link to its new location). Therefore, clients can safely cache copies of past lookups, until their expiration time.
Value updates must be:
On this basis, updates are specified on timestamped paths (i.e., every component of the path has a timestamp identifying the particular node the client intended to follow). Earlier timestamps are superseded by later timestamps.
Directory lookups reflect the state of a directory after all updates prior to the sweep time of the directory, and some of the updates after that sweep time.
Every server stores a pointer to a server that's closer to the root than itself, unless the former server does store the root. Also, a server cannot contain the only copy of the directory that leads to it from the root. Otherwise, an infinite loop trying to locate that directory would ensue.
Network partitions have to have user intervention to be dealt with. That sounds kind of silly.
No mention of time skew. Times are assumed globally synchronized.
This pretty much describes how DNS works except for a few differences:
[26] What do you do when you want to actually use the name service much much much more frequently than [67] allows? Say, for global file name service. Assume the existence and full deployment of multicast.
Three target classes of operations:
General idea is to split the name hierarchy into multiple levels with different characteristics:
At the managerial level, a name server is collocated with the object manager of the objects it handles. Once it receives a named request, it can immediately dispatch it to the right object. Similarly, when the manager is down, so is the name server. Getting to the right name server/manager from a client is the job of prefix caches, and realizing a name doesn't belong to any managers at all (i.e., it's a non existent name), is the job of higher-level name servers.
The prefix cache contains path prefixes (i.e., a number of path components) which map to specific managers. For example, the mount point of an entire file server would be a path prefix and the server itself would be the specific manager. Even deeper prefixes can be accommodated, since for each entry, a manager identifier and a manager-specific directory identifier is returned. In the example above, for a subtree of the mount point of a file server, the file server's ID and the i-node of the subtree would be returned.
If the cache lookup is a near miss, i.e., it identifies a directory which is not a local managerial directory, then a probe operation on that directory is initiated. This just multicasts the request on all those managers covering (i.e., having mappings within) the local administrative directory. When one of them responds, the entry is put into the cache and operations proceed normally. If the near miss points to a name outside the local administration, then a liaison server is probed for the right managerial directory identifier. There's never a cache miss, since there's always a cache entry for the root global name servers. If that entry becomes stale, the client may resort to probing all nearby managers and liaison servers (through scoped multicast), trying to locate a new root entry.
When a cache entry becomes stale, then the server it was obtained from is asked to wake up and smell the coffee, which gives feedback to the server that it's caching stale entries. Stale entry detection is done either through timeouts (i.e., looking for a manager that's not there any more), or through appropriate error codes (for example, a manager replying that it's not managing the named object any more; for that to work, requests from clients have to contain both the [managerID, directoryID] tuple and the name they think maps to that).
The advantages here are:
Administrational directory servers store, for each of their subtrees, sets of managers that cover the associated name space. Obviously, they are authoritative only for those prefixes that do not exist (i.e., those with a subtree name they don't have).
At the global level, designs presented for instance by [67] are used. Liaison servers become the front end to global level name servers, making it easier for clients to deal with fewer naming protocols.
To improve caching performance, cache inheritance is used. Parent processes bequeath their naming caches to their children.
For nearby objects, mapping only fails when the object manager itself fails or is inaccessible. Otherwise, (i.e., for objects outside the local administrative domain), mapping can also fail if all the administrative directory servers for the target organization are down and the client does not have their member managers cached.
Queries (i.e., binding checks and listings) are resilient against most managerial failures (since the administrational servers can answer them). Complete resilience can be guaranteed only if all clients are caching all global information.
Security covers:
The reverse authentication problem in above is the most difficult, since a client doesn't even know which manager it wishes to hear from, only that it should be "the right manager" for the particular query. This is customarily solved using a capability-like scheme of certificates. A certificate authority issues time-limited certificates for each server stating that "server X is authorized to answer query Y for time T". A server returns such certificates along with its signed answers. When a certificate expires, the server needs to get a new one from the certificate authority.
It's mentioned that object managers may choose to switch directory identifiers fairly often, and administrational managers may choose to switch the identifiers of managers fairly often (say for load balancing). This means that invalidation of cache entries can be a fairly frequent occurrence, unlike what's implied in the discussion in 3.2.
[56] The ITU-T take on the Directory service.
A basic objective is to provide ``user-friendly naming'', whereby named objects have names suitable for citing by human users. This is not all-restrictive, i.e., some names might also be ``unfriendly''.
The other basic objective, naturally, is to contain ``name-to-address mappings'' for indexed objects which are dynamic. In other words, the served mappings may change over time.
This is not a database service. Queries are assumed to be much more frequent than updates. Rates of change are assumed to follow ``human standards'' as opposed to ``computer standards''. In other words, a mapping is assumed to change at rates no faster than those that a human could accomplish.
Eventual consistency is what counts here. Cases where both old and new versions of some information float transiently through the system are acceptable.
Except for access rights issues, results of a query are independent of the identity or location of the inquirer.
The data model is assumed to be structured as a tree (and the interfaces take that under consideration). However, other structures are not precluded from existing in the future.
The data model (as described in X.501, which is as yet not electronically available), is hierarchical, based on value-attribute nodes. A node normally follows a schema that defines its type (class). Administratively, the party responsible for a node defines its own class and nothing else. Of course, defining the class of a node also defines the possible classes of the node's descendants.
Aliasing is allowed. An alias node may point at a node somewhere else in the data hierarchy.
Nodes (i.e., entries in the directory), have distinguished, unambiguous names which are derived from the path connecting them to the root of the hierarchy.
Intermediate and leaf nodes in the entry tree are identical in structure (i.e., there's no special leaf node type).
The directory provides support for querying and modifying it. It provides responses to all types of requests. Responses are however specific to the request type.
The Directory offers optional support for mutual authentication of the user and the directory.
It is possible to qualify how a request is going to be handled:
Query requests include read, compare (for instance, for password checking), list, search and abandon (which tells the directory a previous request is no longer relevant or needed). Query requests may contain a context, that can be used when multiple versions of values are kept for some of the attributes of a sought entry. A good example of context is ``language''.
Modification requests include add (leaf), remove (leaf), modify entry (which is an atomic operation) and move (which moves an entire subtree into a new location). Contexts may also be included in the request.
Exceptional conditions include errors (for anything from security issues to schema rule violations or service controls) and referrals (which allow a directory access point to refer a user agent to another, more suitable access point - perhaps because the latter is closer to the requested data).
Requests are handled by directory service agents either recursively (in which case, if they don't personally know the answer, they'll find it from other agents and return it to the requester), or by referral (in which case, they'll point the requester to a more likely place for the posed request).
The Directory service supports redundancy, in the form of shadowing (as defined in X.525) or caching (which is considered an implementation issue, as yet unspecified).
The shadowing model assumes a single owner of specific data. Shadowing is read-only. All modification requests are forwarded to the data owner. Partial shadowing is also permitted (whereby, not all components within a subtree are shadowed; a filter predicate is used to define which parts of a subtree are to be shadowed).
Inconsistencies are tolerated. In the case of shadowing, inconsistencies are always temporary (although no time constraints are specified). In the case of caching, no declaration is made. The only stipulation mentioned for shadowing inconsistencies is that they never occur internally to an object, i.e., the fields of a single object all contain information from the same version of that object, whereas a shadowed copy of that object in a different directory agent might contain an (entire) previous version of it.
Replication is not transparent to the user. The user knows if a request was satisfied from a shadow copy of the authoritative data. The user is also allowed to specify that a request should be satisfied by the master directory server, for better accuracy, at the expense of performance.
Access to the Directory is allowed through the Directory Access Protocol. LDAP is a light-weight version of that protocol.
It is not clear how to deal with transient inconsistencies, when, for example, both old and new information floats on the system. Is there an upper bound to the duration of the transient state? How can we (eventually) tell the information we received was about to be purged from the system?
It's not clear what kind of filtering they intended to support in requests, as shown in 8.2.3.
It's not clear what the use of a context in add or remove requests might be, in 8.4.
It would be interesting and useful to allow aliasing inside an object (i.e., to give different names to the same attribute inside an object).
[82] X500 [56] names objects contained in a directory, using their path location (and assumes a hierarchical organization): the name of an object is the path to it from the root of the directory tree. This work tries to assign a better, more descriptive name to objects that's independent of their path names.
A new piece of information, that was not present in [56] is that ``first-level'' directory service agents, i.e., DSAs that handle subtrees of the DIT right underneath the root, can also handle the root itself. In other words, the 0-th level of the tree is handled by all 1-st level DSAs.
The problem lies in that the organization of the directory (who owns what pieces of it), affects how objects are named. If the directory is reorganized, all names have to change.
Descriptive names fulfill the following two criteria:
Any number of attribute-value pairs that can unambiguously characterize an object (not requiring that attribute-value pairs belong to the named object only) can constitute a descriptive name. For example, if we want to name person X, and we know noone shares X's office phone number Y, then [office-phone-number = Y] is a descriptive name for office X.
Resolving such a descriptive name can be very expensive. You'd have to search a lot of the entire tree just to locate one object that matches all of the attribute-value pairs in the name. You'd also have to keep looking, just in case there are more matches in the rest of the tree.
Stopping at the first match is not a good solution, since the results of a query would be dependent on where we start from. For example, if I start searching at my local domain, perhaps I'll find a local match before I find remote matches. Different directory user agents would find their local match first. This is unacceptable, since the resolution of a name in the directory should have a single semantics.
The solution preferred is one that restricts the tree space to be searched for a descriptive name. This is done by means of registered names. A registered name consists of attribute-value pairs that are protected from duplication outside the owner object. More specifically, if an object X has a registered name Y whose component is [Z=A], then no object within the subtree rooted at X's parent may contain Z=A as one of its attribute-value pairs.
Registered names are used to limit searches for descriptive names within a subtree much smaller than the entire directory. It is assumed that most directory system agents that own (i.e., manage) part of the directory will have at least registered the name of their local root (i.e., the root of the subtree they own). Therefore, if we find the longest registered name within a descriptive name, chances are that registered name lies at the root or within a local subtree of a directory system agent. The search can be more elaborate within that limited subtree, for the remaining name components, which are not registered.
Is it clear that subsequent reorganizations of the directory might not render currently unambiguous descriptive names ambiguous?
The designation of an attribute as a ``naming attribute'' is rather arbitrary. As shown in more recent work (for example, [1]), any kind of qualifier might be useful in a name.
It is claimed that descriptive name resolution is no less efficient than destinguished name resolution. However, it seems to me that the constants might be very different in the asymptotic similarity. Distinguished name resolution involves one traversal of the path from top to bottom. Descriptive name resolution requires one traversal of the registered path from top to bottom, and then an exhaustive search of the found subtree (unless very restrictive heuristics are used to guide the local search).
[92] Another descriptive naming service, like that in [82] or [1]. It's a confederate approach, in that clients are solely responsible for contacting as many name servers as they need; the name servers themselves don't talk to each other.
An interesting characteristic about Profile is that names are allowed to contain limited regular expressions instead of merely literal strings. So, names like ``John|[J[ohn]] Doe'' can be attributed to an object, to mean any of ``John Doe'', ``J Doe'', ``John'', ``Doe''.
Profile does not enforce a hierarchy unlike most other naming systems. The sole instrument of low-level organization are attribute-value pairs. The named objects (principals in the terminology of the paper), have no relation to each other except for that indicated by an arbitrary attribute.
The use of multiple name servers in the ``confederacy'' is effected at the user agent. A resolution operation contains the name to be resolved, the search path and a bunch of operational flags.
The name is just a set of attributes. It can contain additional alternation or optionality (is this really a word?) operators. It may also contain the interpret function that should be used to match the given name.
The search path is an ordered list of name server addresses known to the user a priori. This can also be replaced using the indirection operator ``@'' by a list of names naming name servers (lovely sentence...). In other words, instead of saying ``use the name server at address X'', you can say ``use the name server by name of 'blah blah blah' ''. The 'blah blah blah' part is also a Profile name, and the named principal must have an attribute pointing to the network address of a name server.
The operational flags determine whether the first result or all results should be returned, and whether the search should proceed outside the local name server automatically or not.
Name servers can maintain and return referrals to other name servers. Other than that, they have no organization amongst themselves. Any kind of name server hopping is performed solely by the user agent.
The modification API is pretty coarse: users change a public text file and the database is reconstructed (compiled) from it at regular intervals. There's also a protected partition for the portions of the database that should be administered in a more restricted manner. The protected partition contains most complete attributes (e.g., everybody's user name etc.) whereas the unprotected partition contains most incomplete attributes (i.e., things that people want to register for themselves).
There is no defined protection scheme other than that provided by the systems on which a nameserver relies, e.g., the filesystem.
Using an attribute-based naming scheme has the following pitfalls associated with it:
Attributes are split into two categories: complete and incomplete. A complete attribute is defined for all principals in a namespace. For example, if all principals have a ``name'' attribute, then ``name'' is a complete attribute. This is similar to the closed world assumption in that, if we know anything at all about a property then we know all there is to know about that property.
Incomplete attributes are considered less authoritative as name components than complete ones. An example of how an incomplete attribute can be a problem is this: Consider principals P and Q and attributes a and b. If a is registered for both P and Q, but b is only registered for P, then a is complete and b is incomplete. Now, if a name contains both a and b, then it is intuitively wrong to not return Qjust because b is not registered for it. To the extent of what is know about Q, describes it just as well as it does P.
In the Profile world, matching attribute-based names can be done either in an exact or a partial fashion:
A name (attribute set) A matches another B exactly if A is fully contained within B.
A name A matches another B partially if the intersection of A and B is not empty.
Using the concepts of complete/incomplete attributes and exact/partial name matching from above, Profile defines a set of interpret functions, aimed at fulfilling different levels of accuracy or selectiveness. As mentioned above, for example, incomplete attributes are considered less useful to name a principal than complete attributes.
The paper describes a way to bake Profile into a Unix shell by using translation mechanisms that take a standard command (like cp or mail) and yield a Profile-enabled command.
This paper is in serious need of a translation. It doesn't look like people who didn't work on it ever read and helped shape its language. The concepts are not always clearly described, the examples are less helpful than you'd expect and things enumerated in the text are never given unambiguous numbers.
[17] Grapevine is a distributed messaging system, providing service for service location, naming and authentication. It contains an early presentation of a distributed email system. The system described bears close resemblence to how Sendmail and SMTP work.
A basic concept presented here is that of content independence. The transport mechanism, whether SMTP or the Grapevine transport protocol, doesn't concern itself with interpreting the content of messages. End-point applications deal with the interpretation of the payload of a message.
The level of security described in the paper is weak. A better version was in the works.
Users are registered in a database. User entries can be grouped within group entries. The entry for a user contains an authenticator, a list of mailbox locations and an address for synchronous communication (mainly used for server entries - not for people). A user's multiple inboxes are independent from each other and do not support the abstraction of single inbox; a user must check all of them for new messages.
Access control allowes an individual to be an owner of a group (in which case, he may arbitrarily change the entry at will), or a friend of a group (in which case, he may insert or remove himself from the group).
Only ``signed'' messages may be sent (i.e., the sender must authenticate himself to Grapevine, before he can send a message). Message polling is not authenticated. Successful message retrieval may cause messages to be erased from the protected inbox or retained until later.
Once a message is constructed, the message server authenticates the sender, flattens out the recipients list (i.e., recursively expands all groups contained therein) and validates it (returns error conditions for non-existent recipients). It then receives the message body and dismisses the user agent. The recipients are then split into ``steering lists'', i.e., lists of recipients per destination message server. The message is shipped to each message server once, accompanied by the corresponding steering list. Transiently undeliverable messages are queued for later delivery. Race conditions (such as that occurring when a validated user is removed by the time the message is ready to be delivered) are dealt with using appropriate error messages, sent to the original sender. Messages are postmarked with the time at the first-hop message server. This postmark can be used as a hint for later sorting.
Messages are shared when delivered. When a message is addressed (and subsequently delivered) to multiple users with inboxes at the same message server, then the message body is stored on disk only once.
The name space is split in two levels: the top level denotes the name registry and the bottom level denotes the person within that registry. Registries are distributed and replicated to multiple name servers. Replication is at the grain of a registry. Updates can be submitted to any server containing a registry. That server must propagate the update to the rest of the name servers serving the same registry.
Servers talk to eachother (for example, to propagate updates) using the same messaging protocol. So, a server with updated data will send a message to those server it knows also should hold that data. Servers check their messages every 30 seconds. Deleted items are retained along with their deletion timestamp, to deal with cases where an addition and a subsequent deletion arrive out of order. Every 14 days, deleted items are purged. Nightly, registration servers perform sanity checks by checking notes with other servers serving the same registries.
The metadata of the registration database are replicated everywhere. The metadata registry contains entries for all registration servers in service (as individuals) and for all registries (as groups of registration servers). To bootstrap the protocol with a registration server, two methods are used: an external naming service that resolves ``GrapevineRServer'' to a set of registration servers and a local broadcast service query.
Resource location works through the registry. Services have their own groups in some registry. Servers are members of the service they perform. A client retrieves the membership of a service and then picks that server which is best, according to some metric.
Caching is used to store hints about where an entry's preferred inbox lives. If a cache entry is outdated (for example, the preferred inbox has been removed), it returns a cache flush notification. No other explicit invalidation takes place.
Some of the administrative functions (for instance, the membership function) sound inherently extremely expensive (e.g. determining group membership in a group closure).
A general characteristic of Grapevine is that it understands the semantics of the registered entries' structure (for example, it knows what a ``connect site'' is, in page 263). Marrying the registry with the transport module in such a way means that either one (or both) will have trouble evolving. For example, if the transport changes to a different protocol that uses more than one ``connect sites'', the registry API will have to change to accommodate the upgrade.
Authenticating users before messages are accepted implies that there's a trust base among multiple organizations: If organization A accepts a message from user B, does that mean that organization C should deliver the message to one of its users?
Assuming that a human will serialize modifications to a registry is a rather outdated idea. Global distributed registries cannot be assumed to be directly serialized by humans any more.
[103] Experiences and lessons learned on the system described in [17].
An interesting point is that Grapevine had an upper limit on how scalable it should be: 30 servers and 10000 users tops. It is useful to know these numbers before the design phase.
A significant scaling problem faced the system due to global distribution lists, whose membership roughly remained proportional to the user population. Since groups are expanded fully at a message server, before a group message is dispatched, message addressed to global groups could result in excessive latency. One way to deal with that would be to add greater indirection (i.e., intermediate lists within the list), and avoid total expansion of group memberships. For example, a global list ``news.global'' could be further partitioned into ``news.west'', ``news.east'', etc. In this way, message delivery to ``news.global'' would be offloaded to message servers for the ``west'' and ``east'' (and so on) registried, parallelizing the task.
Another interesting point has to do with the concept of a moderator in distribution lists. Regardless of what the system can or cannot do fast enough, people start getting overwhelmed when, at a point, they start receiving large numbers of messages.
Duplication of servers (message servers in this case) is not only a solution for overloading, but also for availability. It's good to have a local server, dealing with unreliable connections, instead of making the lack of reliability visible to end users. Registry replica location and distribution is dependent on five factors, according to the authors:
A detailed description of the administrative facilities of the system is provided in the paper.
There are several problems arising from transparent replication. The paper (in section 6) puts together a list of bad design decisions that lead to problems, with regards to replication transparency. Below I'm abstracting the design decisions into a list of things not to do in distributed systems design.
There are several problems arising from unexpected uses of Grapevine. The paper (in section 7) puts together a list of encountered problems. Below I'm abstracting the lessons learned
Here are the lessons learned from the authors' experience in the areas of reliability and robustness:
The solution to the global distribution list problem is inadequate for a more scalable system (i.e., one that doesn't such a low upper scaling limit as Grapevine), since it relies on administrators figuring out what a good partitioning is, by themselves. Such configuration tasks are better done automatically, especially if a system is to grow and scale gracefully.
(In section 7) Using expired cached credentials in the absense of connectivity is extremely disconcerting. How do we break a file server's security? We yank the network cable out and then we bruteforce its capabilities.
It is a common implied stance in this paper to assume that possible bad things that haven't happened so far won't happen in the future. Invariably, the bad things in question are pathological conditions caused by excessive workloads. It is quite naive to assume such conditions won't occur, should the system is fully deployed, just because they haven't in a limited deployment of the system. At least, this stance is never spelled out in the paper.
[83] This is a single-page article on naming systems (good or bad ones) and why it's very important for them to be well designed.
The most important distinction made is between identification and authentication. The common practice (mainly off-line) of using an identifier as an authenticator is extremely dangerous and has been proven harmful. Names, even private names (such as Social Security Numbers in the USA), are not as good as passwords: names are customarily used as references (pointers); if one of the places where such a private name is used as a reference is lax about keeping the reference secret, this entire fragile authentication scheme falls apart.
[33] This is an internet draft on the protocol requirements of a common-base personal presence and instant messaging system.
It covers (at the requirement level) such issues as access control of presence information, notification infrastructure, security affordances, and so on. This is still a work in progress as of late 1999.
[34] This is a common model for the IETF IMPP working group's discussion on Presence and Instant Messaging. It defines such thinks as the wonderful new term Presentity (i.e., Presence Entity), Watcher (an agent that looks for changes in a presentity), Subscriber (a Watcher that waits for future changes) and Fetcher (a Watcher that just extracts the current value of a presentity).
[58] The Identity Infrastructure presented here is an attempt to deal with the personal communication problem, similar to [5,101] and others.
What's presented here mainly is the communications protocols between:
The external IDIP is spoken among identity servers. An identity server speaks the internal, downward IDIP with an application. Similarly, an application speaks the internal, upward IDIP with an identity server.
Applications register themselves with their user's identity server, so that when a request arrives from another identity, this server can represent its owner's current connectivity state.
No support is provided for content conversion, only for content negotiation. Location privacy is preserved, although it isn't stated explicitly in the draft: the draft assumes that identity servers merely bring end applications in contact with eachother. However, the draft seems open enough to allow proxies to be used as intermediaries to preserve users' privacy.
No mention of a personal naming scheme. Email-like identifiers are assumed.
[61] This is a high-level vision white-paper on the arrival of ``identity services'' as the next killer app.
Much like [101], this advocates the institution of the ``extended individual'', or on-line identity, which represents the person in the networked world. This identity encapsulates information about the person (connectivity status, names, etc.), functionality of the networked person (privacy behavior, for instance) and ``membership information, or what the paper calls channel (i.e., the addressability of a person by his interests, capabilities and so on).
[37] This is an overview of the current available modes of personal communication (on the Internet).
It goes over communication media both in the real-time category (instant messaging, text chat, internet phone, and collaborative/whiteboard applications) and in the asynchronous category (email, newgroups, and shared spaces).
Then the author proceeds to present the application contexts within which these communications media are combined into personal communications applications. Buddy lists are systems where a person subscribes to his circle's members; when a member ``comes on-line'', the subscriber is notified. This is the general framework within which On-line presence work is being conducted (see also [34]). Other such contexts include internet telephony, virtual communities (e.g., multi-player gaming, real-time newsgroups/topicgroups), and virtual worlds.
The main barrier preventing these applications from becoming predominant is standardization, especially in the Instant Messaging/Presence arena: there are too many proprietary protocols striving to secure a market share, completely precluding interoperability among all users' software.
[30] This is a corporate strategy paper, which addresses the issue of identity management within the limited scope of the Enterprise.
[20] This is an attempt at designing a logic for reasoning about authentication mechanisms.
The main constructs on which the logic is based are as follows:
Using these constructs and some common sence rules about authentication, a set of inference rules have been put together:
Some additional rules dealing with instantiation of variables and other inference tools are also included.
To use the logic described above to derive facts about authentication protocols, a transformation of those protocols to the vocabulary of the logic must first be performed. The transformation was arbitrary and interpretational in this paper, i.e., there was no formal method to perform it.
Given a protocol, once its idealized protocol counterpart is derived (i.e., its equivalent protocol in the vocabulary of the logic described above), the inference rules may be applied to it, to derive a concluding set of facts about the protocol: a set of opening assumptions are stated, which are augmented by inferred facts after each step of the idealized protocol. The goal of this process is to prove in the end that both parties of an authentication exchange have a common shared key for subsequent communication (i.e., A believes and B believes ). Some times, a stronger goal is necessary (or possible), namely that both parties know that the other party has the same information about the exchange as they do, i.e., A believes that B believes and similarly, B believes that A believes .
The authors proceed to apply their logical analysis to:
The constructs and the mechanics of the paper are pretty interesting, although lots of pieces are missing, most notably the transformation mechanism into idealized form.
[68] This is an extremely dense paper describing a detailed theory for authentication and its practical applications.
Principals are the main acting entities in this authentication theory. Principals can request actions. Actions are performed on objects (resources such as files, devices, processes, etc.). The objective of this work is to provide a language in which rules may be defined, according to which requests by a principal on an object are granted or not. Rules are attached to the objects they govern. A reference monitor decides whether to allow a specific action, taking under consideration the involved objects with their associated access rules, principals and requests.
The job of the reference monitor is non-trivial, especially in distributed systems, since it must make logical inferences leading it to a decision. For example, principals almost never directly attempt an action on an object themselves, since principals are humans and objects are bits; most usually, a principal utilizes a channel, such as a network connection, a pipe or a function call (e.g., to access object X, Alice, a principal, types on a keyboard, which is attached to a system K, connected through a TCP pipe P to another system L, where object X lives. The reference monitor on system L sees an action attempted by a process Z on object X.) Channels are principals, too; however, they're more than that, because they can request things directly inside a computer, whereas human principals can't.
The reference monitor must decide whether the immediate channel attempting an action is trusted at least as much as the principal mentioned in the access rule for an object. In the example above, if X specifies in a rule that it will accept all actions from Alice, then the reference monitor must decide whether process Z is at least as trustworthy as principal Alice. To do that, it has to reason about Z, L, P, K and Alice.
Trust can be bestowed or transfered by a principal on another principal. For example, I may trust my brother to do anything that I can do. To prove my trust, I give my brother a legally binding document, such as a power of attorney. This provable trust is represented in the authentication theory described in this paper by the ``speaks for'' relation. When A speaks for B, A can do at least as much as B can do (i.e., if a request should be granted for B, then it should also be granted for A). Since only channels can say things directly inside a computer, invariably human principals must have channels that speak for them, so that they can communicate their requests through a computer system.
In a distributed system, where many components from different devices conspire to see a request through to its destination, it is inevitable to place different levels of trust in the reliability or the loyalty of every single component. A well-designed authentication system relies for its security only on few trusted components, its Trusted Computing Base, or TCB; in other words, no component outside the TCB should be allowed to affect the security of the system by misbehaving. Furthermore, the system is fail-secure when it is conservative in the face of failure: when it gives the wrong answer, it always denies a request it should have granted, never the other way around.
To connect principals to other principals, define statements, etc., the following constructs are possible:
Some basic, intuitive axioms or inference rules using the above constructs are defined in the paper, including modus ponens, distribution and monotonicity over of ``says'' over and so on.
One of the primary tools in this theory is the ability to express
delegation of authority. A principal can utter a statement that allows
another principal (most commonly, a channel) to speak for it. The main
handoff rule is
Encrypted channels are very powerful. They allow the untrusted, arbitrary handling of messages, without compromizing the contents' secrecy or integrity, through public key cryptography. Even if existing public-key cryptographic algorithms turn out to be insecure or too expensive, shared-key cryptography can be used to simulate public-key functionality. Therefore, even with a small TCB, a sane authentication model can be built using encryption.
Given that a computer node is trusted as a whole (including its operating system and administration), the first step to building secure end-to-end channels, for use in the authentication scheme described, is to build secure channels from node to node. A single node-to-node secure channel between any two nodes is sufficient, since the theory allows multiplexing over a single channel. The node-to-node scheme must allow rekeying (since encryption keys must change regularly).
The authors present the following mechanism to establish secure node-to-node channels (from A's view point; B does the symmetric). It is assumed that both A and B have their own node key-pair, which has been assigned to the node by a certification authority, and their own boot key pair, computed when the node boots up.
Up to this point, A and B can communicate using K, up until they have to change keys, or one of them dies. However, to be able to talk about this channel, they have to name the key they're using (so that, for example, innocent or malicious replays may be identified). To create an identifier for this newly created key, they take the next step.
After this exchange, A knows that , i.e., everything that arrives encrypted by K can be safely trusted as much as anything that could arrive encrypted by Kb-1. Since Ka and Kb are not used by the actual channel, they can remain in use for much longer (as long as the system stay up in this case) and still, the channel bears their trust.
To extend the chain of trust a little bit further, we need to show that someone ``meaningful'' sits behind Ka. That can be done by connecting the A's node key to Ka. Remember that Kan bears a certificate that , for the ``meaningful principal'' A. Now, if node A can present something like (for example, sign Ka with Kan-1), then Ka can be trusted as much as Kan, or transitively, A.
RIGHT BEFORE 5
[77] How the Domain Name System was born.
In the begining, there was the HOSTS.TXT file. This was a file whose lines contained an association between a name and a network address. This file was centrally maintained and distributed to all hosts on the network periodically. This worked fine, while there was a single, humongous host per organization; when, however, host numbers started paralleling user numbers, HOSTS.TXT became a huge administrative bottleneck: centralization was incompatible with the higher rates of change.
The main constraints that the DNS design strived to uphold were
[59] One interesting aspect here is that they are trying to motivate life long numbering. there is no actual motivation behind life long numbering; instead a viable way to obtain life long numbering is provided.
Life long numbering is provided by a hierarchical database scheme. Higher level databases store pointers to lower level databases. Leaf databases store the actual profile. Every database must contain a pointer towards all profiles included in its subtree. The top level database must contain pointers towards all existing profiles.
Queries start from the database of the caller going upward until a pointer to the callee is located (This happens for the first time in the database that is the least common ancestor of the caller's and the callee's databases); then the search proceeds downward.
Hiper is using this hierarchical architecture along with replication to both provide life long numbering and minimize query and update latencies. The location of replicas is determined in a way similar to that described in[Shivakumar 94], modified to hierarchies instead of graphs. The number of replicas is determined from the available database space allocated to each of the systems users. Additionally, the more replicas there are in the system, the higher the update cost. Therefore, the number of replicas per person should be limited.
Replicas are placed in the hierarchy to minimize the update cost while minimizing the lookup cost. A useful metric for calculating those costs is one for calling frequencies between the sites of two people participating in a communication. This is called the local call to mobility ratio. It is defined as the number of calls from a user to a zone over the number of moves of that user in a fixed time period. When this number is high, it makes sense for the calling zone to replicate the profile locally. This ratio can be compared to a threshold to determine whether that user's profile will be replicated in the called zone.
Since ancestor databases contain the covered area of all their descendant databases, the local call to mobility ratio is always higher than the threshold in the ancestors when it is higher than the threshold in a leaf. So, naively, it would seem beneficial to replicate all profiles at as high level a database as possible. Unfortunately, higher level databases service more profiles than lower level databases which means that the update costs incurred could overwhelm them. A tradeoff has to be maintained between overwhelming higher level databases with updates and populating the hierarchy with ineffective profile replicas.
For high enough values of the local call to mobility ratio, replication can be considered justified. Similarly, for high enough values of the ratio, replication can be considered unjustifiable. For the values in between,replicas could be allocated in the most beneficient way, i.e., so that only those with highest ratio among the non- clearcut databases get a replica. The upper and lower threshold values are picked to unambiguously make replication good or bad.
The algorithm described in the paper first assigns replicas to as many of the databases with a ratio above the upper household as possible, starting from the bottom of the hierarchy, until all such databases are given a replica or the maximum number of replicas has been assigned. In the second phase, the rest of the unassigned replicas are allocated to databases with as high a ratio as possible, starting from the top.
Another contribution of this paper is a fairly elaborate framework for simulating location management techniques. The simulator takes under consideration geographical features like freeways or local population sizes, when implementing user mobility and calling patterns, as well as network computing between sites.
The actual simulation setup in section 4 of the paper is a very interesting combination of statistical national averages with local counters in a simulation.
It is unclear why this algorithm was picked. Especially, it is unclear why exactly all of the second phase replicas are assigned top down, or why the split between bottom up and top down assignments is set to be the upper ratio threshold.
One big hole in the paper is the lack of proof of scalability of HiPER compared to other schemes. The authors admit that all their simulated systems are nothing special given current hardware and software.
[32] A major difference to current trends in this paper is its pre-web approach to Internet wide traffic. They say that the predominant source of request/response traffic on the Internet comes from name server traffic. This is not true any more in post-web times, in all probability.
The basic issue dealt with here is how DNS was flawed, how that led to performance problems, and how the problems could be solved.
One of the implementation and deployment issues observed is the predominance of collocated caching-only servers with client resolvers. The absence of centralized caches within a site, that could share name caches among multiple local resolvers, leads to multiple queries issued for the same result from the same site.
The paper includes a technique for calculating the packet loss rate on a single ethernet, given the numbers of unmatched requests and responses for local queries.
It seems that the paper is less applicable to directory services other than the DNS, especially since it tends to focus on DNS bugs and how they have been manifest (for example, some resolvers' mistake of adding the local domain name to all queries, assuming client programs always gave them unqualified host names).
An interesting contribution of the paper is a classification scheme for server errors (bugs, incorrect assumptions, etc.). One of the conclusions drawn is that negative caching is in general overstated, mainly because it is intended as a safety valve for faultyimplementations of servers or resolvers.
An interesting data point is the amount of replication per directory unit in a large set of servers. The authors use these data to support "unfair" leak of performance degradation between multiple domains serviced by the same server.
The authors claim that active error detection is more important than widespread deployment of bug free implementations. The reason is that bugs may resurface, despite the efforts of programmers. Active error detection can limit the impact of current or future bugs. One proposed technique is "playing dead". According to it, a name server can occasionally play dead for a particular queried name and record the requester's reaction. This can show problems like untuned retransmission intervals, for example. Another approach would be to keep track of how many times the same query has been submitted by the same client host. Erratic hosts can be identified in this way and notified of their errant ways.
[46] This scheme is intended to provide user-friendly naming within a larger directory service, such as X-500. The distinguished name of an entry is separate from the user-friendly name for the same entry. The distinguished name is a unique identifier, whereas the user-friendly name may omit features of the entry that don't interest the requester. The scheme is heavily dependent on X.500, which is in turn heavily dependent on organizational structure. User-friendly names are reduced to their structure using defaulting rules, for instance, when the inferred country is USA, then a two-letter component preceding it is assumed to be a state.
There is a long list of heuristics used to resolve a user-friendly name to a distinguished name, assuming that the user-friendly name follows certain conventions, such as hierarchical ordering. Perhaps in current technology that deals with searches of large distributed data stores, such heuristics wouldn't necessarily be of much use.
One important advantage of this scheme is that, similarly to X.500, it doesn't have to invent a new name for an entry. Instead it uses the commonly used name for an entry. Unfortunately, some entries may share the same "commonly used name", for example roles of the same person, or people with the same name. To resolve such ambiguities, much longer names would have to be devised.
It seems that this scheme is merely a glorified method for resolving a local alias to the globally unique name for an entry. The resolution takes place locally and might come across ambiguities when transferred to a third party.
[66] One obvious change from the previous paper ([59]) is that online, as opposed to offline, replication is investigated. In other words, replication decisions are made while the system is in operation.
A new thing in this paper is a reasonable albeit slightly fluffy method for maintaining the metrics used to calculate the local call to mobility ratio.
In this online version of the scheme (called Hopper), the threshold variables are replaced by a threshold variable for creating a new replica and a threshold variable under which a current replica is removed. The values of these thresholds can tolerate non-optimality, as the simulations show. The thresholds for replication and deletion are picked from the basic cost inequality, comparing update cost and lookup cost for any two pairs of zones. Values are picked by instantiating the hop distance between the caller and the callee with their minimum and maximum values respectively. In other words, what used to be the lower threshold in the offline case, becomes the delete threshold and what used to be the upper threshold in the offline case becomes the replicate threshold.
The scheme uses a time threshold to determine when calling or mobility measurements are too old to use. When the time threshold elapses, calling and mobility measurements are flushed. To make sure that early estimates after a metric flush are not too inaccurate, calling and mobility measurements are not used to calculate new ratios until their sum surpasses another predefined Threshold.
When after a lookup, it is determined that the caller deserves a replica, but there are no more replicas available to create there,then the replica exchange protocol is used. This is done asynchronously. During this exchange, a random replica is considered for deletion. If the randomly selected replica is not as beneficial as the caller's, then the former is deleted and the latter is notified that it currently contains a replica of the profile in question. The exchange is not performed if it's going to cause excessive transfer costs, for example if it has to travel too far, or the system is currently overloaded. Replicas may also be deleted during updates. If when a replica is updated with a move, it determines that it no longer surpasses the deletion threshold, it notifies the previous current location of the profile that it has removed itself from the replica set.
Another departure from the previous, offline algorithm, is that replicas are now only placed into leaf databases. The authors justify this choice with its lower complexity, especially in the case of online replica allocation.
The simulation part of the paper presents a multi-day run involving the ten most populous cities in the USA. The basic home location registry technique (with translation for lifelong numbering), basic home location registry with caching, simple hierarchical, HiPER and Hopper are simulated. The simulations show that Hopper can achieve fairly consistent lookup locality, due to its online nature, while maintaining low requirements on database storage.
[22] This work seems to be attacking directories from a knowledge-theoretic point of view. The idea is to use directories in a distributed system to share information about processes among processes, given constraints on latency and overhead. When constraints limit the amount of knowledge sharing, then knowledge estimation is used.
The concept of process knowledge, as defined in the theoretical literature, is too strong to cover practical distributed systems. The problem is that "knowledge" is too strong a word. Traditionally, knowledge of a fact X about process B by process A means that fact X is true about process B. Unless something is done by process A to change its X-knowing state, fact X will have to remain true regardless of process B's will. Since in practical systems the state of process B can change with or without A's consent, the concept of knowledge cannot be supported.
An argument is made about the dependency between global clocks and knowledge gain or loss. The argument is connecting the fact that if I know X is true between times a and b, then at time a I gain the knowledge that x is true, whereas at time b I lose that knowledge. However, that seems a bit naive, since the knowledge was "x is true between times a and b" all along; the knowledge is not lost after it has become non-applicable. A minor philosophical point.
Estimation is based on conditional probabilities. In practical systems, a process can know a fact about another process only with a probability lower than 1 usually. This work tries to derive useful conclusions from sets of such probabilities connecting different facts in a distributed system.The goal for a designed directory would be to optimize the estimate probabilities for the contained properties, for a given set of infrastructure constraints. Problem-specific properties, such as temporal or spatial locality can affect how these estimate probabilities vary. Moreover, ways to increase probabilities are explored, as well as the price incurred.
The directory design space is broken into two axes: organization (centralized, hierarchical, distributed) and maintenance (active or passive, push or pull).
The paper proceeds to describe the simple case of a two-state process communicating with a directory, in both lossless and lossy communications environments. Finally a specific case study is presented for a videoconferencing application.
[57] This standard describes exactly how things are modelled inside X.500 ([56]).
An administrative domain, as far as X.501 is concerned, consists of multiple subtrees not necessarily communicating with each other. This means that the domain may control multiple subtrees routed at different points of the global tree.
X.501 seems to have a strange fascination with administrations of the national level. They even make the distinction between the domain belonging to a national "administration" (and administration is the national telecommunications carrier) and private administrations. They even go so far as mentioning national regulations governing administrations within the barriers of a single country.
The specification recognizes the existence of multiple roles of a single person. They don't go far enough to define a correspondence between real objects (i.e. the real world) and online objects.
The administration of the directory is served by entries in the directory itself. Although those entries are mentioned as "administrative entries", they look exactly the way other normal entries look.
Attributes types may be subtyped. Every entry in the directory, contains an attribute type and a list of values (non empty). A value may be annotated with a context. A particular entry may have many alternative values for various contexts. A common context is language.
Additional object schemas may be defined, beyond those standardized by central organizations, national organizations or private organizations. It is not clear how easy to do that will be.
Besides attribute types, there are also object types. Those are also subtyped. Multiple inheritance is supported.
There is a fine distinction between what is called a structural object class and an auxiliary object class. Roughly speaking, the former is an interface whereas the latter is an implementation. Structural object classes represent the way an entry looks inside the directory. An associated auxiliary class determines the semantics of that entry.
In section 8.8, matching rules are discussed which are used for searching through the directory. Although a set of basic matching rules is defined in X. 520, no restrictions are placed on the kind of rules supported in X.500. The most basic rule is an attribute-value assertion. This checks an entry for the presence of an attribute and compares its value with that given in the assertion. Such rules may have a context associated with them, so that a particular instance of the sought attribute is checked. Every attribute type contains its own equality matching rule (somewhat like the equals method in Java). That can be used in attribute-value assertions. Rule evaluation is extremely flexible, to allow the same rule to match different things depending on where the directory resides. For example, if no context preference has been included in an assertion, then the context may default to what the particular answering directory considers default. Builtin rules include presence, equality, ordering, substring matching, and approximate matching.
Section 9: A name is a sequence of Relative Distinguished Names. X. 501 acknowledges that alternate ways to name objects in the directory are possible in the future. An RDN is the set of attribute value pairs in an entry of all those attributes which have a distinguished value. If an attribute doesn't have a distinguished value then it doesn't participate in the RDN of its enclosing entry. An entry may have multiple RDNs if some of its distinguished values has instances for different contexts. An RDN is modifiable.
Among the multiple RDNs for an entry, one is designated the "primary" RDN for it. It consists of the primary attribute value pairs for all participating distinguished attribute value pairs. This primary RDN, which is unique for an entry, is what most directory operations would normally return.
A distinguished name for an entry is the sequence of RDNs from the root of the directory to the entry, through the hierarchy. A distinguished name for an alias is not the same as the distinguished name for the entry the alias points to. Clearly, a primary distinguished name for an object is that consisting of primary RDNs.
Alias entries are always leaf nodes. They do not have to point to the name of a real object necessarily.
Section 10: Administered areas are partitions of the global tree that are administered by a single administrative organization. For example, the subtree belonging to the U.S. constitutes an administrative area administered by the actual organization in the U.S. dealing with its share of the X.500 global hierarchy. The root entry of such a subtree is the administrative point.
Inside an autonomous administrative area, there are four kinds of specific administration supported: for names(i.e., deciding which names can or cannot be given), schemas (deciding how governed entries will be structured), security and collective data. Specific administrative areas deal with subtree schemas, security (i.e., access control), collective entries and default contexts (i.e., which context will be considered the default one, when the default context is needed).
Areas dealing with specific kinds of administration among those above may overlap; for example an administrative area that deals with security might overlap with a different administrative area that deals with the default context. A domain as described above consists of potentially multiple autonomous administrative areas.
Section 11: A subtree refinement is defined as a subtree, whose shape has been carved either by depth (i.e., excluding a number of upper levels and or lower levels), by name (i.e., excluding a number of its subtrees with or without the subtree root) or by type (i.e., excluding or limiting to a set of entry types).
Section 12: A directory schema, associated with an administrative area, contains information about the kinds of names possible in the area, the possible object classes, attribute types, matching rules, contexts, associations of attributes to possibly containing object classes, and possible associations of object classes with subordinate object classes in the hierarchy.
Class hierarchy is explicit in the directory. More specifically, every object contains the identifier of all the superclasses of its own class. This need not necessarily be reflected in the actual storage. The directory does make sure that entries of a particular object class follow the rules (contain all the mandatory attributes, do not contain unanticipated attributes, do not allow multiple values in singleton attributes and so on).
Attribute types contain information such as their superclass, their syntax, matching rules, whether they are operational (i.e., meta-informational) or user controlled, whether they are collective or not and whether they are singleton or not. Operational and user attributes form separate hierarchies. Attributes lacking equality matching rules cannot be used for naming and lose their type when queried.
To fully specify the structure of an administrative area, its administrative point will contain enough subentries describing allowed structural classes, their naming and their content rules. Name forms are assigned to an object class and determine which of the mandatory or optional attributes of that class can be used in its RDN.
Structure rules specify how entries can be linked to each other in the hierarchy within an administrative area. They associate a name form (which, in turn refers to an object class) with a set of possible superior structure rules (which, each, point to object classes as well). For example, to make it possible for an entry of object class A to have a subordinate entry of object class B, first a name form for class A has to be created, let's call it name form C, that defines an RDN scheme for class A, and then a structure rule D for name form C (connecting it to its superiors). Then, a name form E for class B should exist and a structural rule F containing E as its name form and designating structural rule D as its superior rule.
The actual contents of an entry within the specified administrative area are dictated by a content rule. A content rule takes a particular object class and turns it into a specialized/customized class for this administrative area. It can give it additional mandatory and optional attributes (beyond those defined in the class definition), specify additional auxiliary object classes ("implementation" classes), and preclude some of the inherited optional attributes from appearing.
Finally, contexts within the area are defined in two phases: context types are primary, fundamental contexts that can be used within the area. Those are defined independently of the attributes where they may apply. In the second phase, context use rules associate an attribute with a set of mandatory contexts (they all have to be defined whenever this attribute is met in the area) and a set of optional contexts.
Section 13: The directory system schema is defined within an area to determine how administrative and operational properties of area can be included. It determines the kinds of subentries that may be created and defines operational attributes for the area. So it deals with subentry name forms. Other operational attributes include timestamps, creator or actor identities, associations of entries with their governing subentries, access control, context assertions.
Section 14: In this section the actual structure of a subentry is described and how all pieces of schema information from sections 12 and 13 come together into the subentry of an administrative point.
Section 15: Authentication procedures work through the association of a user identity and an access control identity. The latter is used in all access control decisions.
An access controlled specific administrative area is governed by the access control scheme defined in a subentry of its administrative point.
All arguments, results or errors in directory operations may use a protection mapping. There are three defined protection mappings: signing, encryption, and signing-and-encryption.
Section 16: Several access control schemes can be utilized inside the directory, to suit the requirements of the current administrative area. Basic access control is fairly flexible and somewhat complicated. Simple access control is a subset of basic access control. Rule-based access control can setup procedural kinds of access decision functions.
In access control schemes, access control items are the basic units of control specification. They can be placed at various locations throughout an administrative area, with different semantics. Sets of them can be combined and precedence is defined among access control items in a set according to an integer priority. An item can have different levels of fidelity requirements placed on the authentication of the requester of an operation. Specifically, the identity of the requester might have no fidelity (i.e., we assume the requester is who it claims to be), simple fidelity (i.e., password based) or strong fidelity (i.e., based on strong authentication).
Access control can apply to a number of different kinds of "protected items". Such are entire entries, attributes, attribute values or names. More specific instances of those may be specified as protected items, such as "all attribute values of attribute X". additionally, access controls can be associated with constraints, such as "do X if by doing so there won't be more than y values for attribute z".
Permission categories include standard permissions like read, browse, add, modify, permissions for moving entire subtrees to a different location, for using attributes or attribute values in filter matches etc.
Permission items can apply to an entire administrative area (in which case they are called prescriptive and are placed within a subentry of the administrative point), to the subentries of an administrative point (in which case they are called subentry and are placed within an administrative point), or to a single entry (in which case they are placed within the entry itself). When making an access control decision, all access control items of the scope within which the protected item resides are taken under consideration. Those are the entry item, if one exists, and all prescriptive items for all administrative areas within which the item lives, up to the top of an autonomous area. Similarly, for a subentry, its own entry item, its owning administrative point subentry item and all superior prescriptive items up to the top of an autonomous area are taken under consideration.
A general rule governing how basic access control schemes operate is that any access control decision should be answerable without having to inquire remote directory system agents.
The ``simplified access control'' scheme is the same as the ``basic access control scheme'', except that it doesn't support inner security areas (i.e., no nesting) and it doesn't support access control items within single entries.
Section 17: In rule-based access control schemes, access control associated with specific data is not explicit. Instead, a general label indicating how restricted a piece of information is is attached to the relevant entry. This label along with the properties of the requester and the policy used determine whether access will be granted or not.
Section 18: X.500 provides support for maintaining integrity information within the directory. This is done by supporting digital signatures within the directory along with the information itself.
Similarly, X.500 provides support for maintaining confidentiality of the stored information (i.e., keeping stored information secret even from the directory itself). This is done by designating stored information as encrypted using a particular algorithm.
[2] This book contains fingerprints for some of the most important public keys used on the Internet at the time of its publication (early 1999).
The reasons for putting the functionality of a global CA into paper form have been to
X509 and PGP key fingerprints are included. The keys have been given a level trust, ranging from D (minimal trust) to A (full personal trust).
In the process of accumulating the keys printed within this book, the editors have identified some ominous problems in the way that certificates are handled in web browsers. They have unearthed implementation flaws (giving incorrect results), design flaws (ignoring important protocol-mandated fields from certificates), distribution flaws (hard-coding root CA keys into the application, whose lifetime might surpass the lifetime of the hard-coded CA key).
$Id: Bibliography.tex,v 1.53 2000/06/20 03:14:52 maniatis Exp maniatis $
This document was generated using the LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)
Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -no_navigation -show_section_numbers -local_icons -split 0 -dir /home/maniatis/public_html/Bibliography Bibliography.tex.
The translation was initiated by Petros Maniatis on 2000-06-20