We collected traces of several different Web servers to answer the following questions:
We selected four Web sites for trace collection: the globally popular www.ncsa.uiuc.edu (home of the Mosaic browser), our locally popular fas.harvard.edu (a campus-wide information server), hcs.harvard.edu (a computer club's server), and das-www.harvard.edu (the computer science department's server).
We modified the three servers on our campus to record when the object being transferred was last modified. This information is not usually logged, but is essential for performing an accurate simulation of consistency mechanisms. We use this information in chapter to realistically simulate cache consistency issues.
Table: Summary of the Web server access traces. The Requests Serviced column indicates the total number of requests that appeared in the trace for that service, while the % cacheable column indicates that percentage of the requests that are valid and for static objects (as opposed to dynamic pages for example). The Documents column indicates the number of documents that are cacheable on the server. Note that the NCSA trace is for one day, while the other three are for one month.
Of the four traces we used to drive our simulator, the NCSA trace is most representative of the globally popular servers that our algorithms are designed to help. The other three traces are more useful for exploring the effect of server-initiated caching on the less popular but more numerous small-scale servers.
Before we began analyzing these traces we had expected that some pages would be much more popular than others, if only because most Web sites have a ``home page,'' or table of contents, that lists the contents of that server. There are usually several other files associated with this home page, and almost every visitor to a Web site therefore sees these files. These pages, at least, will be exceptionally popular relative to the rest of the Web site. We also expected that access patterns would not be consistent across servers; some popular servers such as the Boston Restaurant Guide  or New England Alpine Ski Report  have specific geographic interest, while others such as the White House home page  have a uniform appeal. Finally, we expected geography to predict topology to some extent-there should be more network hops between a site in California and a site at Harvard than between a site at M.I.T. and the same site at Harvard.
Our trace analysis revealed the following facts:
In the rest of this section, we will explore each of these facts in greater detail.
First, we examined the distribution of Web accesses per object. This distribution is shown in Figure . As we had expected, access patterns are highly skewed. The graph indicates that a small percentage of the files available on a given server are responsible for a disproportionate share of the requests from that server. For example, the top 5% of the files on the NCSA server were responsible for 90% of the total requests from that server.
Figure: Popularity analysis of requested files from four different Web sites. The skewed distribution of requests indicates that caching a minority of the files can satisfy a majority of the requests.
Bestavros confirms these results , adding that the more globally popular a server, the smaller the fraction of pages that account for most of its accesses. Our results agree with this observation: the two most popular servers, NCSA and FAS, are also the two with the smallest percentages of files responsible for the most requests. These results are encouraging because they suggest that caching a small subset of a server's files will reduce the server's load significantly.