We are not the only researchers turning our attention toward saving Internet bandwidth. Another group that has been particularly vocal on these issues consists of Bowman, Danzig, Manber, and Schwartz [11][10][12][5].
They believe that there is a need for a hierarchically structured,
extensible object-caching service though which Internet objects can be
retrieved once discovered, and they have already built a system called
Harvest (see section
) to test their ideas.
We share many beliefs in common with them; we are also trying to build such a system for the World Wide Web through which access can be provided to almost any Internet object. They have stated that servers should be instrumented to help determine where to place additional replicas; we take that belief to its logical conclusion by designing a system where servers can autonomously replicate their most popular objects.
They agree that there is little information on deploying replication and caching algorithms to support massively replicated, yet autonomously managed databases, and they also agree that it is necessary to take topology into account in order to optimize network utilization.
The rest of this section explores issues they raise in greater detail.
Danzig, Hall, and Schwartz have studied the amount of Internet traffic caused by FTP, and they show through simulation that it could be reduced by 42% if file caches are installed at strategic spots on the Internet [10]. They captured data directly from the Colorado entry point to the NFSNET backbone to generate traces to drive their simulator, keeping a trace of all FTP packets that passed through that entry point. Their Boulder, Colorado entry point is responsible for 5-7% of all packets on the NSFNET so their data is fairly representative of the Internet as a whole.
They propose placing FTP caches at all juncture points between networks, such as between the NSFNET backbone and the NEARNet regional network here in New England. Cache resolution would take place in a hierarchical fashion, with each cache satisfying cache misses from its parent, and so on, until the file is eventually retrieved from its home FTP server.
Their simulation was fairly coarse, simulating the savings across the Internet backbone by mapping all hosts to their corresponding NSFNET entry points. Such a study would be harder to perform today because there are now several backbones besides the NSFNET.
To compute byte-savings, they used the metric of byte-hops where the cost of a FTP transfer is the size of the file in bytes times the number of network hops the file had to make. This is a useful metric for computing network cost, and we will use this metric in our own simulations, although it is not trivial to calculate exact network hops between two arbitrary hosts on the Internet.
Their results show a 42% reduction of FTP traffic across the Internet backbone by adding these hierarchical caches. The impressive savings from Web-proxies indicate that the Web will yield similar savings, but also calls into question the necessity for hierarchical caching. Recall that Blaze stated that hierarchical caches are often not effective; the performance of Web proxies in the absence of any sort of hierarchical caching seems to justify this hesitation.
Their system attached time-to-live fields (TTL) to each FTP object to help maintain cache consistency. The Web also attaches time-to-live fields to its documents, but these are rarely used and cannot be trusted. One source of debate in the caching community is whether TTL fields can be set effectively without knowing the content of the data to which they apply.