The final component to large-scale autonomous replication is efficiently locating the nearest replica of a given file. It is easy to make a copy of a piece of data; deciding which copy to use is difficult. Resource location, for example, was the primary difference between Blaze's distributed file system and a traditional distributed file system. Under his system, it was not necessary to satisfy a cache miss with the primary host. A host could locate a copy of a file in another host's cache.
Geographical push-caching is similar to Blaze's system in that cache misses can be satisfied out of other caches; it is different in that the locations of these caches are computed so as to minimize network traffic, and cache misses must be satisfied out of the closest cache. Our resource location scheme will therefore need to be able to locate the closest copy of the file.
Guyton and Schwartz are interested in discovering a nearby resource without any sort of centralized database whatsoever [19]. This differs from earlier approaches such as that used in Grapevine for example [3] which required centralized shared databases.
Guyton and Schwartz try to determine how to choose among a collection of replicated servers such that the selection takes into account network topology [19]. They evaluated a variety of approaches using a network simulator, uncovering a number of tradeoffs between ease of deployment, effectiveness, network cost, and portability. They finally conclude that there is no obvious ``best approach,'' but only a variety of compromises.
At the heart of this research is the fact that in the current Internet
there is no magic black box to determine Internet topology. If this
information was known, then optimal resource location would be not
only possible but trivial, because this global Internet topology map
could be consulted to determine exact host distances. The purpose of
Guyton and Schwartz's research is to determine the cost and
effectiveness of approximating this information through various means;
in section
we extend this research further by
determining how well geographical information approximates Internet
topology.
Guyton and Schwartz examine the variety of choices that distinguish between various resource discovery approaches. These choices include: does the client passively gather location information, or does the client actively seek the nearest replica? If the client actively seeks, does it do so on the level of the network routing protocols, or on the application level? If on the application level, does the client probe the network looking for the nearest copy, or does the client gather routing tables thereby building a local copy of the network topology? If the client probes the network, does it do so by selective triangulation or by using measurement servers that attempt to build up topology maps for portions of the Internet?
Each of these choices lies on a spectrum with ease of deployment/high
network cost on one end, and difficult deployment/low network cost on
the other. None of them are optimal, and only the least accurate scale
in a manner appropriate to the World Wide Web. Route probing for
example, one of the most accurate methods, requires a measurement
server to calculate the shortest path in a dense graph, a non-trivial
calculation. Multiply this effort by the thousands of clients that
would need to use such a service and this option becomes
infeasible. The fact that an efficient means for detecting Internet
topology does not exist forces us to turn toward more radical
solutions, such as using geography to predict
topology. Section
describes our research in this direction.