We collected traces of several different Web servers to answer the following questions:
We selected four Web sites for trace collection: the globally popular www.ncsa.uiuc.edu (home of the Mosaic browser), our locally popular fas.harvard.edu (a campus-wide information server), hcs.harvard.edu (a computer club's server), and das-www.harvard.edu (the computer science department's server).
We modified the three servers on our campus to record when the object
being transferred was last modified. This information is not usually
logged, but is essential for performing an accurate simulation of
consistency mechanisms. We use this information in
chapter
to realistically simulate cache
consistency issues.
Table: Summary of the Web server access traces. The Requests Serviced
column indicates the total number of requests that appeared in the
trace for that service, while the % cacheable column indicates that
percentage of the requests that are valid and for static objects (as
opposed to dynamic pages for example). The Documents column indicates
the number of documents that are cacheable on the server. Note that
the NCSA trace is for one day, while the other three are for one
month.
Of the four traces we used to drive our simulator, the NCSA trace is most representative of the globally popular servers that our algorithms are designed to help. The other three traces are more useful for exploring the effect of server-initiated caching on the less popular but more numerous small-scale servers.
Before we began analyzing these traces we had expected that some pages would be much more popular than others, if only because most Web sites have a ``home page,'' or table of contents, that lists the contents of that server. There are usually several other files associated with this home page, and almost every visitor to a Web site therefore sees these files. These pages, at least, will be exceptionally popular relative to the rest of the Web site. We also expected that access patterns would not be consistent across servers; some popular servers such as the Boston Restaurant Guide [9] or New England Alpine Ski Report [32] have specific geographic interest, while others such as the White House home page [20] have a uniform appeal. Finally, we expected geography to predict topology to some extent-there should be more network hops between a site in California and a site at Harvard than between a site at M.I.T. and the same site at Harvard.
Our trace analysis revealed the following facts:
In the rest of this section, we will explore each of these facts in greater detail.
First, we examined the distribution of Web accesses per object. This
distribution is shown in Figure
. As we
had expected, access patterns are highly skewed. The graph indicates
that a small percentage of the files available on a given server are
responsible for a disproportionate share of the requests from that
server. For example, the top 5% of the files on the NCSA server were
responsible for 90% of the total requests from that server.
Figure: Popularity
analysis of requested files from four different Web sites. The skewed
distribution of requests indicates that caching a minority of the
files can satisfy a majority of the requests.
Bestavros confirms these results [2], adding that the more globally popular a server, the smaller the fraction of pages that account for most of its accesses. Our results agree with this observation: the two most popular servers, NCSA and FAS, are also the two with the smallest percentages of files responsible for the most requests. These results are encouraging because they suggest that caching a small subset of a server's files will reduce the server's load significantly.