OSDI 2000 notes

To read in more depth: (and/or add to CS241 list)

Castro/Liskov "proactive recovery"; Ben Zhao's paper including Paxson trees; TACT; DDS.

Reliability session and discussion

Miguel Castro & Barbara Liskov, Proactive Recovery in Byzantine-FT file system. Idea: Since node recovery in a BFT system (read: rebooting) can fix a corrupted/compromised node, but it may be too hard to detect failure until it's too late, instead proactively "recover" nodes even if you have no reason to believe anything's wrong. Define quorums of 2f+1 replicas each (total of 3f+1 replicas in group); then every quorum intersection contains at least 1 correct replica. A quorum certificate, which verifies a quorum's reply and ordering on a message, may be corrupt if an attacker has taken over the primary (which is in charge of generating the cert?). So must periodically change MAC key (ie revoke possibly-bad/stale signatures).

Steve Gribble et al, DDS. Claim: getting max performance from a service isn't an interesting problem to work on; getting enough performance while maintaining the "ilities" is. (This reinforces current trend in systems, expounded by Dave Patterson at HotOS-7.) Insights: (a) OK to say no, at every level, to avoid making guarantees that may be difficult to keep when weird conditions/corner cases obtain; (b) enforce invariants to simplify analysis/reasoning; (c) many small partitions easier to manage, and fast SAN makes it practical. 128-node, Java-based DDS can support read/write workloads on order of ~8B queries/day, which is correct order of magnitude for a very large infrastructure service.

A few key ideas from other good papers

There seems to be a "compiler renaissance" -applying compiler techniques (or the compiler itself as a point of very high leverage) for doing application-level-semantic things, either with explicit programmer annotations (MC, Devil) or without (Angela Demke Brown - "scheduler activations for managing physical RAM"). We should be sure the SNRC integrated grant exploits this in conjunction with restart-based availability as much as possible.

Angela Demke Brown and Todd Mowry, Taming the Memory Hog (Compiler-Inserted Physical Memory Releases).

Idea: "scheduler activations meets Mach-style-application-level-page-cache-management". Compiler inserts code that prefetches, releases, etc, because it can see things like reuse patterns (cycling over arrays), locality, etc. This code makes decisions at runtime that are informed by extra info about VM/phys mem usage exposed to apps by OS. It can include things like a "priority list" for releases. OS still can force paging-out, etc. if code doesn't release enough mem.

TDB - trusted database on untrusted storage (Intertrust Inc)

A client-side, single-user, encrypted (DES/CBC), cryptographically sealed (SHA-1) DB for online and offline e-commerce transactions, pay-per-use digital content, etc. Must be tamper-evident/resistant, resist replay attacks (e.g. save the DB before making a purchase). Idea: store a secret key and hash of the DB in trusted storage; use these to unlock and verify (resp.) the DB on untrusted storage. (Optimization: a tree of hashes can be stored in untrusted storage, so an individual item can be verified in lg N time.)

Log structured (log in untrusted storage): hard to link updates to same item via traffic analysis
Reads and writes exploit caching of hash value tree, to avoid traversals to the root each time.
Dirty-buffer annotations are maintained in log. On recovery, this residual part of the log (new hash values resulting from writes, not yet propagated to root) must be verified as log is rolled forward from last checkpointed root hash value. Otherwise an attacker could deliberately crash machine, tamper with residual log, then cause recovery from tainted residual log.
Outperforms crypto layered over commercial embedded RDBMS and provides better security; crypto overhead only about 6% of performance.
Summary; encryption and hashing must be integrated with both metadata and data at a low level. --> Can't just layer crypto over existing storage mgt sys.
Weaknesses: no way to change DB key; no easy extension to multiple users (crypto/security issues, not just engineering)
Really more like a cryptographically-sealed FS than a secure DB.

End-to-end authorization (Jon Howell, David Kotz, Dartmouth)

Problem: how do you do fine-grained delegation/resource sharing across administrative domains? Extend BAN logic with new operator: B=>A means "B speaks for A". Then if B says X, we can assume A said X. It's transitive. Implementation for viewing web pages via proxy: links can contain proofs including delegations. I delegate to a principal by constructing the right kind of link. Cryptographic sealing using SPKI (simple PKI). Also an RMI-over-ssh implementation. Not a very well organized talk but sounds like interesting and useful work.

SR disk arrays (Randy Wang)

Rotational replication: make symmetric copies of disk sectors; cuts rotational delay in half, at expense of space efficiency. Since this makes average seek time worse, stagger sector layouts across separate disks.

In a 3x2 SR-array, each disk has 1/6 of original data, 1/3 of orig seek distance, 1/2 rotational reduction. In general, in a Ds x Dr array, Ds controls seek reduction (seek reduced to 1/Ds), Dr controls rotational reduction (by 1/Dr), and space efficiency is reduced by DsDr. In general, if D=total number of disks, the non-overhead part of latency is reduced by about sqrt(D).

Keynote - Danny Lewin, Akamai

Most "access traffic" is shared among many, many small ISP's. UUnet is largest, but about 6% (no larger, despite all their acquisitions), followed by @home (each user creates more traffic since they're exclusively broadband). To get to 95%, need to see ~7000 networks, and it's trending to spread even thinner. These are smaller "edge" networks.

Peering (in the sense of BGP between a pair of networks) is critical, but no incentive for peering (necessary evil, and expensive) and BGP makes it easy to screw the person you're advertising routes to. Also, hard to maintain route table consistency with more than 5-10 peering points between two networks.

Peak speedup for Akamaized sites is 3 to 136 with median 13. (Yahoo's Internet access peaks around 1PM ET)

Panel - stock options vs research grants? Some key points that were made

Innovation and impact are distinct, regardless of where each flourishes best
Not a "typical" sample of startups: those represented were both successful and technically interesting to work at

Random stuff

XX's law: [look up this reference]: You can either come up with something or get credit for it, but not both. Bartels's corollary to XX's law: Nobody remembers XX.