Technical sessions | Benchmarking panel | Breakout | Trivia
March 29-30, Rio Rico, AZ
I didn't take notes on Outrageous Opinions session because nearly everyone there
(including me) was too type to drunk.
Technical sessions
- Elephant: versioning FS that separates retention from file ops.
Versioning policy (which files, how often, etc) can be set manually (down to the per-file
level) or using heuristics based on file traces (but hard to tell which versions are
"landmark" versions, and disks aren't so cheap that it can be pure
copy-on-write). Cleaner: moves old versions to tertiary storage, compression, or
disposal (currently only option). Implemented as a VFS; add'l level of indirection
("inode log", which replaces single inode) is used for versioning.
- If copy-on-write could be implemented for deltas only, is it that much worse to keep
everything?
- Caching documents w/active properties: Active properties are code that
attach to documents (consequences); degenerate property is "bit
provider". Documents become placeless, being property-based. Placeless
middleware triggers properties on events, one of which may be document access via the
middleware. Very close to a service-centric view of the Internet! So
there are no distinct document repositories, just the unified middleware. Verifiers
(callback code, generated by properties themselves) determine whether consistency actions
are needed to respond to document changes that occur out-of-band with respect to PM, by
cross checking document state against external information after the change has taken
place. Sounds like there is a lot of hacking here to lasso together a wild
variety of repository type things.
- How do you specify which properties should trigger? (Suppose you can provide
document in multiple formats) ...do you have a naming issue? (Or: would fixing
naming also solve the content/consistency problem?)
- Web caches don't "throw their hands up" on personalized documents...they use
the "Vary" header
- Why is the caching not in the middleware/infrastructure? (Presumably the
middleware runs in the infrastructure...)
- Compared to DOLF (assuming some combination of Web caches understanding make
can handle consistency for DOLF), this says consistency shouldn't be completely
orthogonal but should be assisted by application-specific knowledge (verifiers).
It's also property-driven rather than data-driven.
- Context-based modeling for smarter file access: Use semantic knowledge
of the context in which files are used (e.g. make is typically followed by cc,
ld) in the same way that filesystems exploit intrinsic structure of abstractions
(linear prefetch buffers, track buffers, etc.) A model tracks accesses and
provides probabilistic prediction of next file to be opened. Surprise!
Simplest model ("if B followed A, it always will") does very well, with 72%
correctness rate. Surprise! Results were consistent across 4 machines with
quite different access characteristics (server, desktop WS, etc.) Space requirements
for modeling are severe (O(n) for simple last-successor, O(n2) for best
performing graph-based methods where access patterns are represented using tries).
Solution: partition trie based on 1st level (below root) nodes and limit number of nodes
in a partition. Each partition is then a predictive-node that gets added
directly to a file (few hundred bytes, can add onto the inode). The additive
accuracy of this scheme beats last-successor and approaches multiple-context
(graph-based). Conclusion: linear-space correlation is enough to do a good job.
- ISTORE philosophy: scalability, maintainability, reliability, etc. are
more important than peak pfmnc or cost. Introspective systems: monitor
themselves and work around detected problems. ISTORE hopes to provide framework for
building these. ProxiWare is introspective, to a certain degree, anyway; but
it's a specific example. Need hardware support for device monitoring,
shared-nothing for redundancy & fault containment, S/W support in the form of
monitoring intelligence and rule-based policy engine (plus policy compilers, etc.) that
runs on an embedded CPU accompanying a device (combination is a "brick").
- Does performance tolerance mean graceful degradation?
- Two papers on congestion control. Balakrishnan, Seshan, Rahul
propose unified congestion mgt "stack" at each host using callbacks and late
sending decisions on the application's part (rather than aggressive buffering and letting
the stack pace the sending). Stefan Savage et al.: TCP congestion control designed
for long flows over thin pipes, not short but intense flows over fat ones: today, hosts
start out knowing nothing, and by the time the short Web flow is over, it's too late to do
anything with the info. (Ex.: SYNs getting lost at conn establshment (when RTT is
conservatively set to seconds), background "random drop" rate fools TCP backoff
and leads to BW underutilization, etc.) Goal: at edges of network, make lots of
little flows to same place behave like one big flow, by watching the aggregate behavior of
the little flows. Various ways to exploit the info (SPAND, I-TCP thru proxy, ECN,
etc.) How often do flows share a network bottleneck? UW/Harvard traces suggest
60-80% of the time, someone else is hitting the same sites you are. If you exclude
multiple conns from same host (which could be solved with unified congestion
management or smarter HTTP), 20-50%.
- Paper on adaptation proxies for legacy apps: don't address mobility,
security, etc. Propose per-packet filtering (a la Zenel), environment monitoring
with callbacks (as in Rutgers Env API, TranSend, DARPA GloMo adaptation arch),
- "Don't want the app, server, or network to know that data has been
altered"...is this really true?
- Nothing new here, hence uncompelling...the problems presented here have been solved
before at lower cost, and author didn't differentiate from any previous work.
- Hikers Buddy and low power software: Use specific application semantics
to optimize power management; this was in context of busy vs. idle times for GPS-equipped
PalmPilot. Conclusions and OS wishlist: would like to decouple power states of
individual components (e.g. should be able to communicate w/GPS over serial w/o lighting
screen); warning for impending power-related events (e.g. about to sleep); "on
wakeup" handlers.
- Reliability hierarchies for storage systems. Make analogies
between performance hierarchies and reliability hierarchy. Higher levels have better
overhead/performance but potentially lower reliability. When should you "write
back" data to a slower but more reliable level? MTTDL (mean time to data loss)
and data loss rate (fraction of new data lost over time) seems analogous to harvest &
yield. They got Rio to work on PC's.
- Command management for next generation user input. Today's
systems fundamentally don't handle (architecturally) some of the challenges presented by
new input modalities such as speech. People will expect to use these at a much
higher level of abstraction than mice or keyboards, and in more
context-sensitive/ambiguous ways. Also current systems expect one input at a time;
"multitasking" speech into it is hard. So speech recognition accuracy is
not the only problem, and may be the wrong target-- today's apps assume they're getting
perfect input. Solution: command management services in the OS. (Is
this command "safe"? idempotent? reversible? critical?)
Panel: Benchmarking (the good, the bad, the ugly)
- (Mike Jones) We tried to build a soft real-time NT scheduler. Generally it worked
great on modern hardware; worst scheduling glitches were due to I/O device braindead
design or "malicious" (greedy) behavior! Often, it's because a device was
being a Bad Citizen with a purely benchmarketing-driven design. There are no
benchmarks to penalize devices that unbalance a system. Sociological as well as
technical problem here.
- (Jeff Mogul) "Brittle metrics": firm (quantitative), hold their shape
(repeatable), break when applied to any hard problem. Statistical significance &
repeatability are necessary but not sufficient. Important areas disenfranchised
by brittle metrics: sizing a new system, capacity planning for increasing loads,
understanding performance failures for commercially important systems. We should
favor predictive power over repeatability (note subtle difference) and get more
measurements of real systems under real loads.
- (Margo Seltzer) There's no such thing as a general purpose benchmark; really need
application-specific, workload-specific ones, so you know whether the benchmark predicts your
expected improvement. Characterize both system & application.
Approach: combine a system vector & application vector to characterize each
individually and the combination. New apps should come with app vectors so you can
benchmark them on different systems.
Breakout
Worthwhile trivia
- CODA traced every syscall on 33 machines for 2.5 years, available on 38 CD-ROMS.
- We need a new conference forum to encourage "crossover" and other experimental
research in OS; how to assure high quality submissions, who should be the poster boy/girl
for it, etc.?