For a server to make the optimal decision about where to cache data it must have an accurate representation of network topology. As we saw in section , there is currently no way to determine á priori the Internet's topology. We hypothesized that geographical information could be used to hint at which servers were topologically close.
We surveyed the Internet using the traceroute  program to measure Internet topology, and we used a file maintained by Merit , listing the address of each subnet administrator for the 42,000 subnets on the Internet today. The critical datum in the Merit file is the zip code listed in the address; in conjunction with a geography server , this provides enough information to establish the latitude and longitude of each network administrator. As long as the zip-code of the subnet administrator matches the zip-code for the subnet as a whole we can accurately place the subnet geographically. It is a simple calculation using this information to compute the distance between two arbitrary hosts on the Internet, accurate to within a zip-code and the size of the subnet. This approach is not effective for subnets that span multiple zip-codes, such as backbone networks or regional networks, but it is effective for the local networks that account for a large fraction of the client requests.
To test the correlation between these two types of information we selected several hundred hosts in the United States and surveyed each one's distance from Harvard. We calculated the latency between our host and it, as well as the number of network hops between them using traceroute. We also calculated the distance between them in miles as described above. Since individual workstations are frequently not accessible our survey settles for any computer it can reach on the same subnet as the desired host. If it is not possible to access any host on the desired subnet then a failure is recorded for that host. We ran this program from several other locations around the Internet, including the west coast and Colorado.
We did not expect extremely high correlations because Internet connectivity varies widely. While some hosts are connected by a high-speed network connection, other hosts are connected by slower, less well-connected networks. Different backbone connections are another source of error; because these backbones only connect to one another at a few sites, a file exchanged between two hosts on different backbones, no matter how close to each other geographically, may have to travel quite far on the Internet. As an example, the hosts maddog.harvard.edu and carrara.bos.marble.com, are both located near Boston, but since one is on the MCI backbone and the other is on the Sprintlink backbone packets between them must pass through Washington, DC where the two backbones connect.
Figures and display the data from our east coast observations for distance versus network hops, distance versus latency, distance versus backbone hops, and network hops versus latency. The Colorado and west coast observing runs yielded similar results.
Figure: Results of Network Survey: Network Hops and Network Backbone Hops. Note that geographical distance establishes a lower bound for network hops. Note also the number of hosts in the sub-100 mile range that are 0 backbone hops away.
Figure: Results of Network Survey: Network Latency. Note that the latency graph was cropped at 200 ms for clarity; there were 17 hosts with latencies ranging from 200ms to 1s that were removed.
In looking for signs that geographical distance predicts network distance (network hops, backbone hops, and latencies), we were encouraged by the apparent correlation shown in the graphs. We also noticed the trend that nearby hosts show the greatest correlation between geographic distance and network distance. Once the distance exceeds 500 miles, the importance of geographic distance decreases.
We hypothesized that if we limited our analysis to hosts on the same backbone network, we would find stronger correlation between geographical distance and network distance. Table presents the results of this study. To examine this hypothesis we divided the hosts into several groups, one group for each backbone, and then computed the correlations for each backbone separately.
Table: Backbone-based correlations for geographical distance versus network hops, latency, and backbone hops, as well as network hops versus network latency. We have divided our samples into groups based on the backbone to which they are connected. Measurements were taken from a host on the 204.70 backbone (the NSFNET); notice how correlations are strongest overall for other hosts on 204.70.
These observations affirm our hypothesis that the correlations between geographic distance and Internet distance are higher overall when looking at hosts on the same backbone than when looking at all hosts. This result suggests that it will be advantageous to steer clients toward host caches that are both geographically close and on the same backbone network.
We included a comparison of network hops to latency because calculating expected latency on a network is hard. It requires a knowledge of network bandwidth and expected loading. If the number of network hops between two computers is related to the latency, then by optimizing to reduce network hops we are also optimizing to reduce latency. Figure indicates that there is a moderate correlation between hops and latency: fewer than average hops implies low latency, and more than average hops implies a high latency. This is helpful, because it implies that steering a host from a distant cache to a close cache will decrease latency as well as network traffic. As we saw in section this should be one of the primary goals of replication schemes.
We investigated latency further by following up on a suspicion voiced by Bestavros in a private conversation. He suspected that latency was primarily caused by crossing between backbones, not necessarily by the number of individual backbone hops. This hypothesis would make sense if connection points between backbones proved to be bottlenecks and sources of congestion. We therefore modified our survey to also include the number of backbones traversed. The results of this new survey are in Figure . There is a clear correlation between the maximum latency observed and the number of backbones crossed, although we can not draw any further conclusions from the data. We hope to follow up on this finding in future work; if it turns out that latency is strongly related to the number of bandwidth hops then simply mirroring web sites on multiple backbones should reduce latency considerably.
Figure: Number of backbones versus latency. There is a clear correlation between the maximum latency observed and the number of backbones crossed. This would support Bestavros' suspicion that crossing backbones accounts for the majority of the Internet's latency. Notice that no latencies greater than 100 ms were observed without crossing at least one backbone boundary.