HotOS-VII Notes

Technical sessions | Benchmarking panel | Breakout | Trivia

March 29-30, Rio Rico, AZ
I didn't take notes on Outrageous Opinions session because nearly everyone there (including me) was too type to drunk.

Technical sessions

Elephant: versioning FS that separates retention from file ops. Versioning policy (which files, how often, etc) can be set manually (down to the per-file level) or using heuristics based on file traces (but hard to tell which versions are "landmark" versions, and disks aren't so cheap that it can be pure copy-on-write). Cleaner: moves old versions to tertiary storage, compression, or disposal (currently only option). Implemented as a VFS; add'l level of indirection ("inode log", which replaces single inode) is used for versioning.
- If copy-on-write could be implemented for deltas only, is it that much worse to keep everything?
Caching documents w/active properties: Active properties are code that attach to documents (consequences); degenerate property is "bit provider". Documents become placeless, being property-based. Placeless middleware triggers properties on events, one of which may be document access via the middleware. Very close to a service-centric view of the Internet! So there are no distinct document repositories, just the unified middleware. Verifiers (callback code, generated by properties themselves) determine whether consistency actions are needed to respond to document changes that occur out-of-band with respect to PM, by cross checking document state against external information after the change has taken place. Sounds like there is a lot of hacking here to lasso together a wild variety of repository type things.
- How do you specify which properties should trigger? (Suppose you can provide document in multiple formats) ...do you have a naming issue? (Or: would fixing naming also solve the content/consistency problem?)
- Web caches don't "throw their hands up" on personalized documents...they use the "Vary" header
- Why is the caching not in the middleware/infrastructure? (Presumably the middleware runs in the infrastructure...)
- Compared to DOLF (assuming some combination of Web caches understanding make can handle consistency for DOLF), this says consistency shouldn't be completely orthogonal but should be assisted by application-specific knowledge (verifiers). It's also property-driven rather than data-driven.
Context-based modeling for smarter file access: Use semantic knowledge of the context in which files are used (e.g. make is typically followed by cc, ld) in the same way that filesystems exploit intrinsic structure of abstractions (linear prefetch buffers, track buffers, etc.) A model tracks accesses and provides probabilistic prediction of next file to be opened. Surprise! Simplest model ("if B followed A, it always will") does very well, with 72% correctness rate. Surprise! Results were consistent across 4 machines with quite different access characteristics (server, desktop WS, etc.) Space requirements for modeling are severe (O(n) for simple last-successor, O(n²) for best performing graph-based methods where access patterns are represented using tries). Solution: partition trie based on 1st level (below root) nodes and limit number of nodes in a partition. Each partition is then a predictive-node that gets added directly to a file (few hundred bytes, can add onto the inode). The additive accuracy of this scheme beats last-successor and approaches multiple-context (graph-based). Conclusion: linear-space correlation is enough to do a good job.
ISTORE philosophy: scalability, maintainability, reliability, etc. are more important than peak pfmnc or cost. Introspective systems: monitor themselves and work around detected problems. ISTORE hopes to provide framework for building these. ProxiWare is introspective, to a certain degree, anyway; but it's a specific example. Need hardware support for device monitoring, shared-nothing for redundancy & fault containment, S/W support in the form of monitoring intelligence and rule-based policy engine (plus policy compilers, etc.) that runs on an embedded CPU accompanying a device (combination is a "brick").
- Does performance tolerance mean graceful degradation?
Two papers on congestion control. Balakrishnan, Seshan, Rahul propose unified congestion mgt "stack" at each host using callbacks and late sending decisions on the application's part (rather than aggressive buffering and letting the stack pace the sending). Stefan Savage et al.: TCP congestion control designed for long flows over thin pipes, not short but intense flows over fat ones: today, hosts start out knowing nothing, and by the time the short Web flow is over, it's too late to do anything with the info. (Ex.: SYNs getting lost at conn establshment (when RTT is conservatively set to seconds), background "random drop" rate fools TCP backoff and leads to BW underutilization, etc.) Goal: at edges of network, make lots of little flows to same place behave like one big flow, by watching the aggregate behavior of the little flows. Various ways to exploit the info (SPAND, I-TCP thru proxy, ECN, etc.) How often do flows share a network bottleneck? UW/Harvard traces suggest 60-80% of the time, someone else is hitting the same sites you are. If you exclude multiple conns from same host (which could be solved with unified congestion management or smarter HTTP), 20-50%.
Paper on adaptation proxies for legacy apps: don't address mobility, security, etc. Propose per-packet filtering (a la Zenel), environment monitoring with callbacks (as in Rutgers Env API, TranSend, DARPA GloMo adaptation arch),
- "Don't want the app, server, or network to know that data has been altered"...is this really true?
- Nothing new here, hence uncompelling...the problems presented here have been solved before at lower cost, and author didn't differentiate from any previous work.
Hikers Buddy and low power software: Use specific application semantics to optimize power management; this was in context of busy vs. idle times for GPS-equipped PalmPilot. Conclusions and OS wishlist: would like to decouple power states of individual components (e.g. should be able to communicate w/GPS over serial w/o lighting screen); warning for impending power-related events (e.g. about to sleep); "on wakeup" handlers.
Reliability hierarchies for storage systems. Make analogies between performance hierarchies and reliability hierarchy. Higher levels have better overhead/performance but potentially lower reliability. When should you "write back" data to a slower but more reliable level? MTTDL (mean time to data loss) and data loss rate (fraction of new data lost over time) seems analogous to harvest & yield. They got Rio to work on PC's.
Command management for next generation user input. Today's systems fundamentally don't handle (architecturally) some of the challenges presented by new input modalities such as speech. People will expect to use these at a much higher level of abstraction than mice or keyboards, and in more context-sensitive/ambiguous ways. Also current systems expect one input at a time; "multitasking" speech into it is hard. So speech recognition accuracy is not the only problem, and may be the wrong target-- today's apps assume they're getting perfect input. Solution: command management services in the OS. (Is this command "safe"? idempotent? reversible? critical?)

Panel: Benchmarking (the good, the bad, the ugly)

(Mike Jones) We tried to build a soft real-time NT scheduler. Generally it worked great on modern hardware; worst scheduling glitches were due to I/O device braindead design or "malicious" (greedy) behavior! Often, it's because a device was being a Bad Citizen with a purely benchmarketing-driven design. There are no benchmarks to penalize devices that unbalance a system. Sociological as well as technical problem here.
(Jeff Mogul) "Brittle metrics": firm (quantitative), hold their shape (repeatable), break when applied to any hard problem. Statistical significance & repeatability are necessary but not sufficient. Important areas disenfranchised by brittle metrics: sizing a new system, capacity planning for increasing loads, understanding performance failures for commercially important systems. We should favor predictive power over repeatability (note subtle difference) and get more measurements of real systems under real loads.
(Margo Seltzer) There's no such thing as a general purpose benchmark; really need application-specific, workload-specific ones, so you know whether the benchmark predicts your expected improvement. Characterize both system & application. Approach: combine a system vector & application vector to characterize each individually and the combination. New apps should come with app vectors so you can benchmark them on different systems.

Breakout

Worthwhile trivia

CODA traced every syscall on 33 machines for 2.5 years, available on 38 CD-ROMS.
We need a new conference forum to encourage "crossover" and other experimental research in OS; how to assure high quality submissions, who should be the poster boy/girl for it, etc.?