5/20/01 Notes from HotOS

JX: are language-based, type-safe OS's becoming reasonable? Do we care? (Many OS fault models, such as resources not being reclaimed b/c of zombies, unclaimed interrupts, etc are not any better addressed)
Sys support for programming routers: Statically proving safety of functions is hard (deps on code, inputs, dynamic env), so need abstractions that don't limit expressiveness of extension code despite this. Soln: expose protection HW of router at lowest level: fixed invocation overhead (switching uproc protection domain), native execution (vs VM), hard protection guarantees [had a pointer to last SOSP?]. Performance: hard to predict CPU needs of router's core tasks, so use priorities (not proportional sharing) as primitive, and have extensions adapt to transient uanvaialbility of CPU. Event-driven control flo;w: localize state in fns, carry invocations to fns, so fns can share state across flows (interflow priority scheduling, etc). How do you carry the state around? Fn global variables? Do fns have to be reentrant? Fine-grained sched is only possible if funcitons voluntarily yield...no?
Lazy thread switching (Jochen Liedtke) has an interesting use of soft state as an optimization. idea: for multiple kernel threads inside same user addr space, keep a twin of TCB info in user space. When t1 IPC's to t2, just switch from t1 to t2 by modifying user-space copy of TCB (so you save a kernel crossing); when a "real" kernel activiation occurs (exception, timeslice expires, etc), the kernel can notice that the UTCB and KTCB are incosistent and force the UTCB to match KTCB. Threads can destroy their own tasks/hose themselves by stomping on UTCB, but KTCB is still "the truth". Result, you save 2 kernel crossings per IPC (you can have several "user-space-only" thread switches before any kernel event forces a "real" kernel thread switch). A nice use of soft state! Gives you most(?) of the fault and performance isolation of kernel threads without the overhead of kernel crossings. (Their tagline: "Might reunify kernel threads and user threads") Then they tried on new P-IV (which has a non-P6 core!) and it took a lot longer - and inserting NOP's actually made it faster! (probably RS skid, or RA stall, or screwing up Load Value Prediction, or something like that...a good example of uarch working against the OS!)
Fail-stutter FT (Remzi and Andrea): things don't "just fail" or failstop: fault masking and slow death and geometry-based performance deg (disks); fault masking (ECC, etc) in processors, non-determinism; deadlock, unfairness, congestion (networks). Remzi argues: even hardware does fail and fault-mask, so shouldn't trust your life to it. OK, but can we apply redundancy and isolation/randomness (to enforce indep failure assumption)? Fail-stutter FT attempts to capture this "intermediate" between Byzantine and fail-stop. Esp. a problem in parallel performance assumptions when one component is fail-stuttering (but therefore looks "normal"). Toward a model: try to capture performance fault (compared to performance spec). (Gribble also suggested introspecting based on "known steady state" behavior. See Richardson et al. Internet performance failure detection for ideas...and add to reading list)
What's been missing from Remzi and Steve's papers: what precisely is the model? what precisely are the design guidelines for building a system?
Jay Lepreau: these techniques (discussed in FT sessions) really only work when systems are loosely coupled and state-coupling-based dependencies are eliminated. Also, there's a cost to retrofitting existing systems (or creating new ones) that use these nice interfcace boundaries. To what extent can we use machine virtualization to fix existing systems?
Marvin: Byzantine FT is bunk, because assumes 3f+1 good nodes; in practice, successful attacks take down large numbers of nodes at once, so it's hard to argue that Byzantine assumption holds. (Unless you make your Byzantine group enormous, but the BFT has n^2 growth)

P2P session

PAST: p2p storage system with Tapestry-like routing - in fact I'm not sure how it is different from Plaxton mesh.
Chord (distributed lookup for p2p): maps DocID to nodeID w/consistent hashing. Nodes and docs share ID space; docID N stored on nearest-successor-numbered node (if node N doesn't exist). Each node knows of logN other nodes, corresponding to those nodes whose ID's are 1/2, 1/4, ..., 1/2^N away in the node ID space
Greg Ganger: for better security, make subsytems have their own security perimeters (disk, NIC, display, etc). Challenges: how to do delegation, what should each device do behind security perimeter, etc. Strongly reminiscent of "orthogonal security" as espoused by Goldberg, Wagner & Brewer (but was never published).
Rob Ricci and Jay Lepreau: to make p2p networks more censor-resistant, use "protocol objects" to deploy and replace transport protocols hop-by-hop. PO identifies by a hash which protocol it will use.

Virtualization (Brian Noble, Peter Chen):

Idea: better to run some services/apps on VM on top of host OS. Can introspect what the app/guest OS is doing (eg writes to PTBR indicate addr space changes on VM), apply well known FT techniques (log all VM activity, then replay log to replay a complete execution). Examples: secure logging. Current systems vulnerable to hackers that turn logging off; do logging of VM activities to do intrusion analysis. To figure out what to log, use methods from Thy & Pracice of Failure Tol. (log nondeterministic events + network messages, etc)

One.world (Robert Grimm)

Exterminate complex/unsuitable abstractions: distrib objects (hard to evolve, formats set by standards bodies, difficult to control from security standpoint since all object accesses involve remote code execution); transparent distribution (eg RPC, because they embody single-node-model assumptions that failure is uncommon case). We have similar motivations and have come up with somewhat different appraoches; should compare point by point.
Model they adopt: tuples and event handlers. Environments serve as containers for those (and can contain other environments). Environments can be migrated; the idea is that the enviornment doesn't have residual dependencies/pointers outside itself that are implicit. Migration == isolate migration logic in separate environment, then embed application's environment in that environment.

Case for Resilient Overlay Networks (Dave Andersen, Hari B.)

Make small resilient overlays on the real Internet, to synthesize a meta-Internet with a small number of nodes. This lets you do things that would be hard if network scalability was a real issue. The real Internet isn't many-to-many anyway; it has choke points, and [Labovitz 00] says Internet routing convergence is an order of agnitude slower than previously thought - 3 min. recovery time, 15 min max, for simple failures. Claim: sclabaility/heterogeneity fundamentally lead to slow recovery. [No proof offered of this claim] So instead,make small/homogeneous groups, and do fast recovery in those groups.
Resilient overlay networks (RON) can be used to make more sophisticated routing decisions, multiple route tables, packet inspection for content-policy-based routing, etc.
"Conduits" emulate sendto() and recvfrom() so existing apps will just work!