Lightweight Remote Procedure Call
Bershad, Anderson, Lazowska, Levy, ACM TOCS 8(1):37-55 (Feb.1990)

LRPC combines

  * Fine-grained protection using capabilities (hard to implement):
    per-object protection domains, with all objects living in a single
    address space. Objects can only be touched via protected procedure
    calls, which transfer control into the object's domain.

  * RPC-like large-grained protection mechanisms, which were proven to
    work locally in Mach. Boundaries correspond to machine boundaries.


Insight:

Originally, RPC folks optimized for calls within the LAN as the common
case. Key observation is that today the common case is embodid by cross-
domain calls on the same machine (over 95% of the time). Simple class with
small arguments/results (<= 50 bytes) are still the common case.

LRPC aims to improve RPC for this new common case. Traditional RPC overhead
is due to 7 factors: use of stubs, message copying access validation,
scheduling abstract/concrete threads, context switches, and dispatching
in the server. LRPC borrows the execution model from protected procedures,
while maintaining the programming semantics and protection model of RPC.

Binding: Each server domain has a clerk, which registers its service's
interface with a name service. When a client binds to the interface, it
issues an import call to the kernel, which passes it to the clerk, which in
turn returns a procedure descriptor list to the kernel. This list contains
for each procedure an entry address, number of simultaneous calls allowed,
and size of its argument stack (A-stack). The kernel allocates the A-stacks
shared between client and server, and a linkage record for each A-stack;
then it returns to the client a capability called a binding object.

Calling: The client's stub placves the args on the A-stack, places {A-stack
BP, binding object, proc identifier} in registers, and traps to the kernel.
If any args were passed by reference, the referents must be copied onto the
A-stack, to protect servers from bad pointers.  The kernel verifies
everything, ensures the A-stack/linkage pair is not used by anyone else
and pushes the linkage on the A-stack. Then it finds an execution stack
(E-stack) in the server domain, pushes a pointer to the A-stack args on it,
switches context to the server's domain, and does an upcall to the server's
stub. After server procedure is done, is returns to the stub, which returns
to the caller's domain. Note that this scheme allows clients or servers
to asynchronously modify the A-stack after control has been transferred
across domains.

Stub generation: Most stubs are generated straight in assembly; this is
simple because most of what they do are moves and traps. More complicated
things (binding, exceptions, call failure) and marshalling of complex/
large arguments are handled by Modula2+ code.

Performance hacks:
On multiprocessors, LRPC does domain caching to reduce context switch over-
head: if a processor is idle in a server domain, then the caller is switched
to that processor when it enters the server domain (and the idle thread goes
to the caller's processor). If, by the time the call is done, the idle thread
 switched into the client domain is still idle, the caller is returned to its
 previous processor. To make this an often occurrence, LRPC makes idle
processors spin in domains known to have high activity. Performance could be
further improved with PID-tagged TLBs.

Miscellaneous performace tricks:

  * avoid locking shared data during call and return, to avoid contention.

  * use a bit in the binding object to indicate whether call should be
    local or remote; for remote calls, use traditional RPC stubs

  * a calling thread is allowed to issue multiple LRPC calls (using
    different A-stacks), unlike traditional RPC

Domain termination:
Domains can be terminated by the kernel without requiring that outstanding
threads be synchronously terminating. However, the binding object (either
as client or server) is revoked, so no more out-calls or in-calls can be
made.