Back to index
Explaining World Wide Web Traffic Self-Similarity
Mark E. Crovella and Azer Bestavros
One-line summary:
WWW traffic is self-similar, and user's usage patterns exhibits ON/OFF
behaviour with heavy-tailed distributions of ON and OFF periods; these
characteristics can be demonstrated to stem from people's access patterns
and the underlying distribution of file sizes on the web.
Overview/Main Points
- WWW traffic is self-similar, although more so during busy
periods than non-busy periods. (This statement is intrinsically
flawed, since self-similarity must apply across all time periods
or it doesn't apply at all.) This self-similarity was
verified by a number of well-known statistical methods.
- Data on user-access patterns was gathered by instrumenting the
Mosaic browser of Boston University CS department users - Mosaic
was still popular when the study was done. Note that traffic
from a set of users to the rest of the web was measured, not
traffic from all users on the web to a single web site.
- User traffic was shown to exhibit ON/OFF behaviour. In other
words, there were periods of high activity followed by periods of
no activity. Both the ON and OFF period lengths were
heavy-tailed (ie. there is a non-trivial probability of a very
large period length).
- The heavy-tails of ON periods were shown to be related to the
heavy-tailed distribution of file sizes on the web. All types of
files were seen to be heavy tailed, although multimedia files
(video/audio) were slightly more so. The results were verified
by consulting the logs of 32 web servers around North America;
the file size distribution observed in their client browser
traces matched those in the web server logs quite well. This
further suggests that sampling web traffic via client browser
traces provides a representative view of web traffic in general.
- There was a strong inverse relationship between file size and the
frequency that the file was requested - the files that were most
frequently requested were small files (256 - 512 bytes - this
must be .html files). This implies that client
browser caches tend to satisfy many small file requests,
increasing the weight of the tail of observed file requests over
the network. (It was stated that client browser caches satisfied
77% of web requests.)
- There was a knee in the curve for OFF period length
distributions. The two components observed (period lengths
between 1ms and 1 sec, and period lengths above 30 secs,
therefore with the knee between 1 sec and 3 secs) were attributed
to two scenarios in which OFF periods could be observed. The
first is due to the workstation rendering
previously retrieved data before requesting more - this is
called "Active OFF". The second is due to the user
inspecting data and not using the web at all - this is called
"Inactive OFF".
Relevance
- Client traces are publically available. We should use them
in our scalability study.
- Results suggest web traffic patterns are related to human's usage
patterns and file sizes on the web, rather than artifacts of
web protocol and document processing by machines, and therefore
these patterns are likely to remain.
- We must come to grips with these patterns when designing our
proxy. Fundamentally the news is good - we can expect bursty
accesses, and small documents to be access more frequently than
large documents. The bad news is that peak traffic is going to
be very very large.
Flaws
- Analysis of user traffic was done by instrumenting web browsers at
their university. If people were aware of the instrumentation,
it may affect their usage patterns. Furthermore, the populace of
a university is not a typical cross-section of world-wide web
users.
- Self-similarity analysis done for the busiest hours measured.
By definition, this is atypical traffic.
- Not nearly enough data was analyzed for the strength of the
conclusions they presented, IMHO.
Back to index