No distributed wide-area system can be built without an analysis of the underlying Internet strata. A designer must understand the performance issues of the Internet in order to build as efficient a final product as possible. There have been a number of Internet studies; we describe here the most recent reports.
The Applied Network Research Group at the San Diego Supercomputer Center has been researching the Internet for years. They have recently turned their attention to the Web and they analyze access logs from the NCSA server to show that geographic caching helps reduce network traffic, latency, and server load .
Their research anticipates our own research since they provide motivation without implementation specifics. They found that server overloading is a growing problem for Web servers, but are afraid that solving the load problem will lead to more file requests which in turn will aggravate the network bandwidth consumption. Any solution to load balancing must distribute load so as to reduce the network bandwidth requirements.
They suggest that to help distribute loads, client preference for a cache site might be a function not only of location, but also of current network or server load. To determine how much network bandwidth could be saved by caching files geographically they performed a simulation driven by Web access logs, mapping network IP addresses to states or countries in order to determine geographic location.
In simulating bandwidth savings they examined different timeout values without any regard for replacement algorithms. The metric that they used to compute efficiency of geographic caching was the ratio of marginal savings in bytes transferred by cache size. Their simulations achieved at most a factor of 7; we show in section that push-caching can achieve efficiency values of 700 or more, since push-cache servers know how popular the items are that they are caching, and therefore can set the amount of replication accordingly.
They also suggest that one way to solve the resource location problem is to implement a modified DNS that resolves an IP address for a distributed system to a nearby server or cache. They define ``close'' to include metrics like physical distance and number of network hops. Finally they mention that no caching solution is complete without taking into account file security and different levels of cache time-out for different types of data. We are particularly interested in the possibility of a modified DNS because it would allow us to avoid contacting the primary host in order to locate a nearby replica.
Viles and French studied the availability and latency of Web servers on the Internet . Ideally all servers should be available 100% of the time, and should have latencies of 100ms or less. They found that in reality the Web does not live up to these ideals; most servers are on average only available 95% of the time, and the average latency is much higher, on the order of 500ms.
They suggest one possible way to improve server latency: implement a way for the client to request several documents at one time. This amortizes the cost of the TCP connection setup and takedown times over several documents instead of just one, since TCP costs turn out to be an important factor for short documents. They do not address the availability issue further.
Both these findings have bearing on our work, because autonomous replication should be able to solve both the availability issue and the latency issue. By replicating objects availability is improved, because a single server failure no longer eliminates all access to that server's objects. Latency is improved as well if network topology and access history are both considered when deciding where to cache objects. When nearby object replicas are available clients will observe decreased latencies in accessing those objects relative to accessing the more distant primary host.