If server popularity was a continuous phenomenon and requests were always clustered in exactly the same way, then mirroring popular Web sites in a static manner would be optimal, and clever caching algorithms would be unnecessary. Unfortunately, this does not seem to be the case. Using a friends-of-friends algorithm, popular with astrophysics researchers for detecting galactic clusters, we performed a geographical cluster analysis on our traces to determine groups of hosts with common access patterns. The algorithm is simple: if a host is within a certain distance of any member of an existing cluster, then the host joins that cluster. If a host joins multiple clusters, then those clusters are merged together. If the host can not join any cluster, then the host becomes a member of a new cluster.
For the algorithm to work properly, must be set such that the clusters are neither too big nor too small. Through trial and error we found that a range of 50 miles works well for the Internet. Figure displays a cluster analysis of the NCSA and the local FAS traces.
Figure: Cluster analysis of the globally popular www.ncsa.uiuc.edu and our locally popular fas-www.harvard.edu. Each point represents an entire subnetwork plotted geographically; the numbers indicate the total number of requests received from each cluster. Clusters with fewer than 300 requests have been omitted. Notice that the ratio of requests between the East coast and the West coast for the FAS server is roughly 18, whereas the ratio for the NCSA server is roughly 1.6.
Two results are apparent. There are prominent clusters in both California and New England on the two graphs; however, the distribution of requests differs significantly. Secondly, the more popular server has many smaller cluster in addition to those in New England and California. These results indicate that no single server scheme would be able to cache pages for multiple, unrelated servers efficiently. Mirroring our campus information server on the opposite coast might be helpful, but such a simplistic answer would not help the NCSA server, which is popular in 22 separate clusters. On the contrary, a scheme that can autonomously replicate the NCSA server's most popular files in 22 different geographic regions while replicating our campus server in only 4 will be more efficient. Furthermore, dynamic replication schemes also help distribute server load more efficiently.
The graph makes it easy to see the intuitive advantage of distance-sensitive caching: the subnetworks represented by each point lie tantalizingly close to each other. Client caching requires a separate cache on each subnetwork, but distance-sensitive caching enables the entire cluster to autonomously share one local cache.