Caching
Latency-based caching
- server download lat alone does worse than LRU, LFU, and Size!
- Hybrid: ((ConnTime + WB/Bandwidth)*(Nrefs^WN))/Size
- Estimate ConnTime, Bandwidth using TCP-like smoothing;
- Modified Harvest to use their alg (actually uses buckets, not
servers).
- Note: per-server, not per-URL. More robust estimates even when
per-URL data is stale.
- Hybrid formula more robust to variety of workloads
(trace-playback as well as real users); better on average than
minimizing any single metric, in terms of e2e latency
- Minimizing cache misses (server hits) sensitive to WB; minzing
Bandwidth sensitive to WN; but minimizing e2eTime
insensitive to both!
- "With 120K refs from aol.com, results inconclusive". (AOL gave
them traces!) Hypothesis: less locality (or different locality)
compared to BU and VT traces.
- Problems:
- Self-selection: some people tend not to visit docs whose URLs
indicate they're far away. Hypothesis: we may be underestimating
the improvement.
- Variance in download times etc. is high in practice.
- How about caches that adapt their algs according to
traffic patterns? (future)
Points I brought up:
- Modified Harvest, cool! Customizable eviction?
- Sharing traces and playback engines
- How did you get AOL traces?
- How big cache, and how does perf of each alg depend on size?
They used a cache that was 10% of "infinite size". Relative
performance is invariant to cache size down to about 1% of
"infinite size", at which point LFU gets much better.
Action items: We should share Harvest mods (they are very
interested in partitioned Harvest) and traces (they don't distribute on
CDROM, but have them online and queriable by Java applet, which they'd
be happy to give us)
Finding salient features by looking for word clusters
- Goal: extract "word clusters" from documents, then use them to
perform the query "other documents like this one" (Excite does
something like this)
- get "word clusters" based on word counts, syntactic analysis,
etc. -- no semantics or "prior knowledge"
- future: rank-ordering rather than raw word counts; word stemming;
combinations of terms (logical connectives, etc); hierarchical
cluster refinement
- Problems:
- no word-sense disambiguation (ie by context), since
purely statistical (solution: since cluster size small,
try to determine semantic relationships between words in a
cluster using lexical database; can also do the same on
orig. query and compare semantic similarity)
- subject to "spam words", outliers, etc (above mech also
gives formal metric fro "cohesiveness", which should throw
these out)
- Conclusion: categorization of documents less useful than word
clusters for doing "similarity" searches
- Flaw: a big leap from statistics/syntax to semantics. The natural
language folks have tried this time and again and most semantic efforts
have foundered on the amount of context really needed.
- Flaw: document sample size is 85. Yes, 85.
NSTP - notification service transport protocol for groupware (Lotus)
- Toolkit for "synchronous groupware", using Java or C++. Looks
similar to what McCanne et al. are doing with MASH and object
libraries, but far more stupid.
- Server-multiple clients model; one TCP conn per client. Forget it.
- "What about consistency in multiuser apps": "It's an application
level problem" (they provide locks, etc.)
- "What about scalability" question got a fudgy
answer and handwaving (soudns like it's not designed for
wide-area anything)
- demo: playing tic-tac-toe and
chatting using Java applets that have the notification toolkit
under them)
- Sources avaliable for noncommercial user at nstp.research.lotus.com
- This doesn't sound useful to me.
WebRule
- Web server plug-in that contains a rules database that allows
rules to be triggered by actions.
- Actions can be local (startup, shutdown, URL access request,
permissions violation, etc.) or remote (another WebRule server
sends you an action request, rule update, etc.)
- Actions trigger rules, which are basically little scripts with
various attributes attached (permissions, etc.)
- Rule example: "When such-and-such page changes [the action part],
go get it, plus the
following other pages, and then run them through this
table-merging program [the rule part]".
- Can build little groups of collaborating WebRule servers to
support such services. Examples they gave weren't terrifically
well motivated but I think it has potential.
- Flaws:
- Server plug-in written in Java and C. Clearly this
application has more leverage on the proxy (imagine
scalable proxy augmented with rule/action paradigm)
- Not scalable for the obvious reasons, and also not clear
what happens to scalability if rules cause lots of
cross-server interactions.
- As far as I can tell, individual users cannot modify or
upload rules -- only WebRule admins can.
- Wouldn't the scalable proxy be a great place to run a rule/action
system like this one?
Pseudo-Serving
- Idea: let clients "bid" CPU/disk resources to get faster
service from servers
- Interesting idea, half baked implementation and simulation
results, thoroughly unconvincing, and author didn't handle
questions particularly well.