Our simulator is divided into three parts; the simulator engine that drove the simulation, the host modules that implemented the caching schemes, and the network module that simulated the Internet.
The engine read in the trace data, and used it to create events. Each event represents a host requesting a World Web document from a server; the actual event consists of the server and host involved, the name of the document, the size of the document, and the date and time of the transaction.
The engine also provides support for periodic jobs that must be run on a frequent basis. When the job is scheduled to run it is removed from the queue and the associated code is run. The code may, as part of its duties, re-insert the job on the queue. This allows us, for example, to reset a host's load every hour, and to snapshot the simulator every six.
There were two different types of host modules: servers and clients. Clients would be handed an event by the engine and would use this information to request a URL from a server. This request would travel through our simulated Internet where appropriate bandwidth would be logged. Servers had all the functionality of clients, but could also serve files. Servers kept track of which files they were caching; since we were most interested in simulating the server's behaviour our simulation also maintained a primary server that always stored the most up-to-date versions of files. Both types of hosts record a variety of load parameters such as number of requests, number of bytes stored, number of bytes sent, cache hits, and stale cache hits.
Depending on the exact scheme we were simulating, the server could either return the request document or redirect the client to another server. Clients could cache not only redirects but also files themselves in order to simulate client-caching. Servers could also push documents onto other servers; a centralized manager kept track of servers willing to accept documents.
The Internet module was the most difficult portion of the simulation
to build. We wanted it to log not only the number of packets passing
across it, but the exact bandwidth used in terms of
, where 1000 bytes sent across 10 network hops would use twice
the bandwidth of 1000 bytes sent across 5 network hops. In order to
log bandwidth accurately the simulator needed an accurate Internet
topology, but it would clearly be impossible to calculate and store
the entire Internet topology. Even if we limited the amount of
information we needed to the 60,000 or so hosts mentioned in the NCSA
access log we would still be faced with an impossible amount of
information.
We also did not want to synthesize topology information because push-caching would be very sensitive to the interplay between geographic information and Internet topology. If we used the same assumptions to generate the topology that push-caching used to approximate it we would see artificially high correlations between predicted performance and actual performance, a clear case of simulator bias.
We were fortunate therefore to be given access to the Internet topology
gathered for a Network Time Protocol server survey by Guyton and
Schwartz [18] and discussed in section
.
Using this topology we were able to accurately measure the number of
hops between two arbitrary hosts, and thereby compare the efficiency
of using miles to predict topology versus using the actual topology
itself. We were unable, however, to accurately measure the number of
backbone hops between two hosts or to calculate the latency between
two hosts using this data. These are factors that we can only crudely
approximate using the results of section
.
Of the 6700 hosts in Guyton and Schwartz's database, we were left with
1700 hosts after removing network routers and hosts on networks that
span multiple zip-codes. Figure
displays
the hosts graphically, plotting each host's longitude and latitude. It
is easy to see the distinctive shape of the United States, as well as
the presence of hosts in Alaska and Hawaii.
Figure: Geographical Host Locations. Each potential client in our database is
represented by a dot, plotted using the longitude and latitude
information gathered for each host. Note the distinctive shape of the
United States, as well as the presence of clients in Hawaii and
Alaska.
Since our results rely so heavily on this data we took several steps
to assure its authenticity. A host from Harvard was conveniently
included in the study; we calculated the number of network hops from
this host on the simulated Internet to all the other hosts on the
simulated Internet. We also calculated the number of network hops from
a real host at Harvard to all the other hosts on the real
Internet. The results are shown in
Figure
.
Figure: Simulated Internet vs. Real Internet. Here we have plotted
network hops to the simulated hosts against networks hops to the real
hosts. The regression fit correlation is
, the fitted
line's equation is
.
The relatively high correlation indicates that the data are similar;
the
offset on the fitted line indicates that the simulated network
hops are lower overall than the real network hops. The differences in
the simulated and real networks stem from three causes. First, the
simulated data were gathered over a year ago and many of the hosts
from the original survey are no longer accessible. Out of the 1700
simulated hosts we were unable to contact 125 real hosts. Secondly,
we did not have access to the actual machine at Harvard present in the
simulated topology. We had to settle for running our traces from a
Harvard machine that is one hop farther away from the backbone than
the simulated machine. Thirdly, the real Internet does not always
perform optimal routing: the path that a packet takes between two
hosts is not always the shortest path. We can not replicate this
routing exactly in our simulated Internet-our packets always
traveled the shortest route. These differences are non-trivial, but
will not skew our results since all of our caching schemes will use the
same topology information.
Since our simulated Internet contains only 1700 hosts and our traces contain up to 50,000 hosts, we needed to create a mapping from real hosts to simulated hosts. Requests in a simulated run appear to be initiated from the mapped host, not from the original host. In computing this mapping we therefore try to maintain geographical information as much as possible. We first try to map a host onto a simulated host in the same subnetwork of the real host. For example, any requests from hosts on the 128.103 subnet will be mapped to a simulated host also on the 123.103 subnet if possible. If such a mapping is not possible then we try to map the host onto a randomly selected host in the same state. This is possible for all states except Vermont or Wyoming, because no hosts in the topology were located in either of those two states. For requests from those states we choose a mapping completely at random.
We did not initially maintain host mappings between simulation runs. Several of our early results showed excessive noise, however, and we realized that this randomization in the host mapping was skewing our results. We therefore set up the mapping such that once a host is mapped it remains mapped for all simulations. This was a valuable lesson because it showed us the sensitivity with which host-mapping affects network bandwidth consumption. We expect therefore that any parameter that affects server selection will affect bandwidth consumption.
The simulator was written in C++ to simplify the implementation of its modules. Each module was written as a C++ class, and we took advantage of inheritance to simplify the substitution of cache-scheme dependent operations. As an example, the network always hands its messages off to a CHost object. Since the server and client classes inherit their interface from the host class, different types of hosts can be substituted into the network without having to change any network code.
This approach is also efficient because scheme-independent code need only be written once in the generic module, and then inherited modules can call it. This applies, for example, to the instrumentation built into all the host classes to monitor load, etc.
To reduce the simulator's memory requirements we used a disk-backed database to store information such as which clients are caching which documents, the simulated Internet topology, and the host mapping.
We have also built into the simulator extensive logging tools that were used for both debugging as well as gathering results. The debug logs enabled us to test the operation of the simulator by running with contrived test cases in order to verify push-cache operation.