If server popularity was a continuous phenomenon and requests were
always clustered in exactly the same way, then mirroring popular Web
sites in a static manner would be optimal, and clever caching
algorithms would be unnecessary. Unfortunately, this does not seem to
be the case. Using a friends-of-friends algorithm, popular with
astrophysics researchers for detecting galactic clusters, we performed
a geographical cluster analysis on our traces to determine groups of
hosts with common access patterns. The algorithm is simple: if a host
is within a certain distance
of any member of an existing cluster,
then the host joins that cluster. If a host joins multiple clusters,
then those clusters are merged together. If the host can not join any
cluster, then the host becomes a member of a new cluster.
For the algorithm to work properly,
must be set such that the
clusters are neither too big nor too small. Through trial and
error we found that a range of 50 miles works well for the Internet. Figure
displays a cluster
analysis of the NCSA and the local FAS traces.
Figure: Cluster analysis of the globally popular www.ncsa.uiuc.edu and
our locally popular fas-www.harvard.edu. Each point represents
an entire subnetwork plotted geographically; the numbers indicate the
total number of requests received from each cluster. Clusters with
fewer than 300 requests have been omitted. Notice that the ratio of
requests between the East coast and the West coast for the FAS server
is roughly 18, whereas the ratio for the NCSA server is roughly 1.6.
Two results are apparent. There are prominent clusters in both California and New England on the two graphs; however, the distribution of requests differs significantly. Secondly, the more popular server has many smaller cluster in addition to those in New England and California. These results indicate that no single server scheme would be able to cache pages for multiple, unrelated servers efficiently. Mirroring our campus information server on the opposite coast might be helpful, but such a simplistic answer would not help the NCSA server, which is popular in 22 separate clusters. On the contrary, a scheme that can autonomously replicate the NCSA server's most popular files in 22 different geographic regions while replicating our campus server in only 4 will be more efficient. Furthermore, dynamic replication schemes also help distribute server load more efficiently.
The graph makes it easy to see the intuitive advantage of distance-sensitive caching: the subnetworks represented by each point lie tantalizingly close to each other. Client caching requires a separate cache on each subnetwork, but distance-sensitive caching enables the entire cluster to autonomously share one local cache.