Explaining World Wide Web Traffic Self-Similarity

Mark E. Crovella and Azer Bestavros

One-line summary: WWW traffic is self-similar, and user's usage patterns exhibits ON/OFF behaviour with heavy-tailed distributions of ON and OFF periods; these characteristics can be demonstrated to stem from people's access patterns and the underlying distribution of file sizes on the web.

Overview/Main Points

WWW traffic is self-similar, although more so during busy periods than non-busy periods. (This statement is intrinsically flawed, since self-similarity must apply across all time periods or it doesn't apply at all.) This self-similarity was verified by a number of well-known statistical methods.
Data on user-access patterns was gathered by instrumenting the Mosaic browser of Boston University CS department users - Mosaic was still popular when the study was done. Note that traffic from a set of users to the rest of the web was measured, not traffic from all users on the web to a single web site.
User traffic was shown to exhibit ON/OFF behaviour. In other words, there were periods of high activity followed by periods of no activity. Both the ON and OFF period lengths were heavy-tailed (ie. there is a non-trivial probability of a very large period length).
The heavy-tails of ON periods were shown to be related to the heavy-tailed distribution of file sizes on the web. All types of files were seen to be heavy tailed, although multimedia files (video/audio) were slightly more so. The results were verified by consulting the logs of 32 web servers around North America; the file size distribution observed in their client browser traces matched those in the web server logs quite well. This further suggests that sampling web traffic via client browser traces provides a representative view of web traffic in general.
There was a strong inverse relationship between file size and the frequency that the file was requested - the files that were most frequently requested were small files (256 - 512 bytes - this must be .html files). This implies that client browser caches tend to satisfy many small file requests, increasing the weight of the tail of observed file requests over the network. (It was stated that client browser caches satisfied 77% of web requests.)
There was a knee in the curve for OFF period length distributions. The two components observed (period lengths between 1ms and 1 sec, and period lengths above 30 secs, therefore with the knee between 1 sec and 3 secs) were attributed to two scenarios in which OFF periods could be observed. The first is due to the workstation rendering previously retrieved data before requesting more - this is called "Active OFF". The second is due to the user inspecting data and not using the web at all - this is called "Inactive OFF".

Relevance

Client traces are publically available. We should use them in our scalability study.
Results suggest web traffic patterns are related to human's usage patterns and file sizes on the web, rather than artifacts of web protocol and document processing by machines, and therefore these patterns are likely to remain.
We must come to grips with these patterns when designing our proxy. Fundamentally the news is good - we can expect bursty accesses, and small documents to be access more frequently than large documents. The bad news is that peak traffic is going to be very very large.

Flaws

Analysis of user traffic was done by instrumenting web browsers at their university. If people were aware of the instrumentation, it may affect their usage patterns. Furthermore, the populace of a university is not a typical cross-section of world-wide web users.
Self-similarity analysis done for the busiest hours measured. By definition, this is atypical traffic.
Not nearly enough data was analyzed for the strength of the conclusions they presented, IMHO.

Back to index